Home
Blog
Data Science
Detecting Parkinson Disease Project Using Python

Detecting Parkinson Disease Project Using Python

Updated on Jul 21, 2025 | 7 min read | 1.38K+ views

Table of Contents

View all

What Should You Know Beforehand?
Technologies and Libraries Used
Models That Will Be Utilized for Learning
Time Taken and Difficulty
How to Build a Parkinson Disease Detection Model
Conclusion

Parkinson's is a neurological and progressive disorder that mainly affects movement. It interferes with muscle control and coordination and results in trembling, rigidity, and difficulty with balancing. The initial signs are often subtle or unnoticed, which is why detecting Parkinson Disease at an early stage can result in better management.

In this project, a machine learning model will be created to detect Parkinson's disease symptoms from voice measurements. These voice features were obtained from biomedical voice recordings of patients, given that vocal pattern changes are known to be a signal for the disease.

What Should You Know Beforehand?

Before you start this project, you will want to have at least some basic knowledge of the following:

Python programming language (functions, loops, and working with variables)
Pandas and NumPy (for reading, analyzing, and manipulating data)
Scikit-learn (to build, train, and test basic machine learning models)
Basic workflow for machine learning (e.g., cross validation (train-test split), accuracy, confusion matrix)

Technologies and Libraries Used

For this project, the following tools and libraries will be used:

Tool / Library	Purpose
Python	Programming language
Google Colab	Cloud-based IDE for running the code
Pandas	Data manipulation and analysis
NumPy	Working with arrays and numerical data
Scikit-learn	Machine learning models and metrics
Matplotlib/Seaborn	Data visualization

Models That Will Be Utilized for Learning

Two models that are lightweight but still effective, will be used for binary classification tasks to detect Parkinson's Disease.

Logistic Regression: This is a simple and widely used linear classification model that works well when there is a margin separating classes between them. It predicts the probability that a specific instance belongs to a particular class and is used extensively in medical and diagnostic problems.
Support Vector Machine (SVM): It is a powerful algorithm that finds the best decision boundary to separate classes. It also deals with non-linear separation problems using kernel functions and is highly accurate for small- to medium-sized data.

Time Taken and Difficulty

You can complete this project in around 2 hours. It serves as a great beginner-level hands-on introduction to biomedical data analysis and binary classification using machine learning.

How to Build a Parkinson Disease Detection Model

Let’s start building the project from scratch. We will start by:

Loading and exploring the dataset
Preprocessing the features
Training the model
Evaluating its performance

Without any further delay, let’s start!

Step 1: Download the Dataset

To train our Parkinson Disease detection model, we will use the Parkinson’s Disease dataset available on Kaggle.

Follow the steps mentioned below to download the dataset:

Open a new tab in any web browser.
Type https://www.kaggle.com/code/vikasukani/detecting-parkinson-s-disease-machine-learning/input.
On the Input Data page, in the right pane, under the Input section, click parkinsons.data.
Click the download icon.

Step 2: Upload and Load the Dataset in Google Colab

Now that you have downloaded the file, upload it to Google Colab using the code below:

from google.colab import files
uploaded = files.upload()

This will prompt you to choose a file from your system. Select parkinsons.data file you just downloaded.

Once uploaded, use the code given below to load as well as preview the dataset:

import pandas as pd

# Load the dataset
df = pd.read_csv('parkinsons.data')

# Display the first five rows
df.head()

Doing so will help you verify that the dataset is loaded correctly.

Popular Data Science Programs

PGD in Data Science Post Graduate Certificate in Data Science Masters in Data Science Degree MSc AI and Data Science Program Cloud Computing Courses Certification

Step 3: Explore the Dataset and Understand the Features

Before building the model, let’s comprehend what the dataset looks like and what each feature means.

The dataset contains 195 rows and 24 columns. Each row represents an individual’s biomedical voice measurements. The column status is our target variable:

status = 1: Patient has Parkinson’s Disease
status = 0: Healthy individual

Here’s the code to explore the data:

# Check the shape of the dataset
print("Dataset shape:", df.shape)

# Display column names
print("Column names:\n", df.columns)

# Basic statistics
df.describe()

The columns include:

name: ID for each subject (not useful for prediction)
MDVP:Fo(Hz), MDVP:Fhi(Hz), MDVP:Flo(Hz): Measures of vocal frequency
MDVP:Jitter(%), MDVP:Shimmer(dB): Measures of voice irregularities
NHR, HNR: Noise-to-harmonics and harmonics-to-noise ratios
RPDE, DFA: Nonlinear dynamical complexity measures
spread1, spread2, D2, PPE: Other signal processing features

Output

Dataset shape: (195, 24)

Column names:

Index(['name', 'MDVP:Fo(Hz)', 'MDVP:Fhi(Hz)', 'MDVP:Flo(Hz)', 'MDVP:Jitter(%)',

'MDVP:Jitter(Abs)', 'MDVP:RAP', 'MDVP:PPQ', 'Jitter:DDP',

'MDVP:Shimmer', 'MDVP:Shimmer(dB)', 'Shimmer:APQ3', 'Shimmer:APQ5',

'MDVP:APQ', 'Shimmer:DDA', 'NHR', 'HNR', 'status', 'RPDE', 'DFA',

'spread1', 'spread2', 'D2', 'PPE'],

dtype='object')

You can also check for missing values just to be sure. Here’s the code to accomplish the same:

# Check for null values
df.isnull().sum()

Output

Liverpool John Moores University

MS in Data Science

Dual Credentials

Master's Degree17 Months

IIIT Bangalore

Executive Post Graduate Certificate in Data Science & AI

Placement Assistance

Certification6 Months

If you are getting the same output it means - there are no missing values in the dataset. Now, let’s move to the next step.

Step 4: Preprocess the Data and Prepare It for Modeling

In this step, we will do the following:

Drop the name column (since it’s just an identifier and doesn’t help in prediction)
Separate the features (X) and target variable (y)
Split the data into training and test sets

Here is the code to do so:

from sklearn.model_selection import train_test_split

# Drop the 'name' column
df = df.drop(['name'], axis=1)

# Separate features and target
X = df.drop('status', axis=1)
y = df['status']

# Split the dataset (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Check the shape of splits
print("Training set shape:", X_train.shape)
print("Test set shape:", X_test.shape)

Output

Training set shape: (156, 22)
Test set shape: (39, 22)

Now, the data is prepared. Let’s train the machine learning models in the next step.

Step 5: Train the Models

Before we start training the models, let’s scale our features using the StandardScaler. Doing so will not only ensure that all input values are on the same scale, but will also help both models, especially Logistic Regression to converge faster and perform better.

Here is the code to do so:

from sklearn.preprocessing import StandardScaler
# Initialize the scaler
scaler = StandardScaler()
# Fit on training data and transform both train and test sets
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Model 1: Logistic Regression

Use the below code to train the logistic regression model:

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Initialize and train the model
lr_model = LogisticRegression()
lr_model.fit(X_train_scaled, y_train)

# Predict on test set
y_pred_lr = lr_model.predict(X_test_scaled)

# Evaluate
print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred_lr))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_lr))
print("Classification Report:\n", classification_report(y_test, y_pred_lr))

Output

Logistic Regression Accuracy: 0.8974358974358975

Confusion Matrix:

[[ 3 4]

[ 0 32]]

Classification Report:

precision recall f1-score support

0 1.00 0.43 0.60 7

1 0.89 1.00 0.94 32

accuracy 0.90 39

macro avg 0.94 0.71 0.77 39

weighted avg 0.91 0.90 0.88 39

What Does this Output Mean?

The Logistic Regression model achieved ~90% accuracy on the test set.
It perfectly identified all patients with Parkinson’s (class 1) with 100% recall, meaning no actual patient was missed.
However, for healthy individuals (class 0), the recall dropped to 43%, indicating that the model misclassified 4 out of 7 healthy cases.
Despite this imbalance, the model showed strong overall performance with a weighted F1-score of 0.88.

Model 2: Support Vector Machine (SVM)

Now let’s train an SVM classifier with a linear kernel. Here is the code to accomplish the same:

from sklearn.svm import SVC

# Initialize and train the SVM model
svm_model = SVC(kernel='linear')
svm_model.fit(X_train_scaled, y_train)

# Predict on test set
y_pred_svm = svm_model.predict(X_test_scaled)

# Evaluate
print("SVM Accuracy:", accuracy_score(y_test, y_pred_svm))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_svm))
print("Classification Report:\n", classification_report(y_test,y_pred_svm))

Output

SVM Accuracy: 0.8717948717948718

Confusion Matrix:

[[ 4 3]

[ 2 30]]

Classification Report:

precision recall f1-score support

0 0.67 0.57 0.62 7

1 0.91 0.94 0.92 32

accuracy 0.87 39

macro avg 0.79 0.75 0.77 39

weighted avg 0.87 0.87 0.87 39

What Does this Output Mean?

The model correctly predicted 87% of the test samples.
It performed very well for class 1 (patients with Parkinson's), with a precision of 0.91 and recall of 0.94.
Class 0 (healthy individuals) had slightly lower performance due to fewer samples, which is normal in imbalanced datasets.

Conclusion

Both models were fairly accurate in detecting Parkinson Disease from voice data. Logistic Regression came ahead in accuracy and recall for the patient class, thus better used when it matters a lot to identify true cases. The SVM model was slightly less accurate but gave a more balanced treatment of the two classes and worked well with the scaled features.

In the course of this project, you learned how machine learning applies to health diagnostics using structured biomedical data. You also learned about data preprocessing, feature scaling, model training, and performance evaluation.

Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!

Explore our Popular Data Science Courses

Executive Post Graduate Programme in Data Science from IIITB	Data Science Bootcamp with AI	Master of Science in Data Science from LJMU
Advanced Certificate Programme in Data Science from IIITB	Professional Certificate Program in Data Science and Business Analytics from University of Maryland	Data Science Courses

Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!

Top Data Science Skills to Learn

Data Analysis Course	Inferential Statistics Courses
Hypothesis Testing Programs	Logistic Regression Courses
Linear Regression Courses	Linear Algebra for Analysis

Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!

Read our popular Data Science Articles

Data Science Career Path: A Comprehensive Career Guide	Data Science Career Growth: The Future of Work is here	Why is Data Science Important? 8 Ways Data Science Brings Value to the Business
Relevance of Data Science for Managers	The Ultimate Data Science Cheat Sheet Every Data Scientists Should Have	How to Become a Data Scientist

Reference:
https://colab.research.google.com/drive/1YJLjpGjU506joLS7Z3bSOl5RwCp4UKYB

Rohit Sharma

779 articles published

Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...

Speak with Data Science Expert

By submitting, I accept the T&C and
Privacy Policy

Start Your Career in Data Science Today

Top Resources