Detecting Parkinson Disease Project Using Python
By Rohit Sharma
Updated on Jul 21, 2025 | 7 min read | 1.38K+ views
Share:
For working professionals
For fresh graduates
More
By Rohit Sharma
Updated on Jul 21, 2025 | 7 min read | 1.38K+ views
Share:
Table of Contents
Parkinson's is a neurological and progressive disorder that mainly affects movement. It interferes with muscle control and coordination and results in trembling, rigidity, and difficulty with balancing. The initial signs are often subtle or unnoticed, which is why detecting Parkinson Disease at an early stage can result in better management.
In this project, a machine learning model will be created to detect Parkinson's disease symptoms from voice measurements. These voice features were obtained from biomedical voice recordings of patients, given that vocal pattern changes are known to be a signal for the disease.
Before you start this project, you will want to have at least some basic knowledge of the following:
For this project, the following tools and libraries will be used:
Tool / Library |
Purpose |
Python |
Programming language |
Google Colab |
Cloud-based IDE for running the code |
Pandas |
Data manipulation and analysis |
NumPy |
Working with arrays and numerical data |
Scikit-learn |
Machine learning models and metrics |
Matplotlib/Seaborn |
Two models that are lightweight but still effective, will be used for binary classification tasks to detect Parkinson's Disease.
You can complete this project in around 2 hours. It serves as a great beginner-level hands-on introduction to biomedical data analysis and binary classification using machine learning.
Let’s start building the project from scratch. We will start by:
Without any further delay, let’s start!
To train our Parkinson Disease detection model, we will use the Parkinson’s Disease dataset available on Kaggle.
Follow the steps mentioned below to download the dataset:
Now that you have downloaded the file, upload it to Google Colab using the code below:
from google.colab import files
uploaded = files.upload()
This will prompt you to choose a file from your system. Select parkinsons.data file you just downloaded.
Once uploaded, use the code given below to load as well as preview the dataset:
import pandas as pd
# Load the dataset
df = pd.read_csv('parkinsons.data')
# Display the first five rows
df.head()
Doing so will help you verify that the dataset is loaded correctly.
Popular Data Science Programs
Before building the model, let’s comprehend what the dataset looks like and what each feature means.
The dataset contains 195 rows and 24 columns. Each row represents an individual’s biomedical voice measurements. The column status is our target variable:
Here’s the code to explore the data:
# Check the shape of the dataset
print("Dataset shape:", df.shape)
# Display column names
print("Column names:\n", df.columns)
# Basic statistics
df.describe()
The columns include:
Output
Dataset shape: (195, 24)
Column names:
Index(['name', 'MDVP:Fo(Hz)', 'MDVP:Fhi(Hz)', 'MDVP:Flo(Hz)', 'MDVP:Jitter(%)',
'MDVP:Jitter(Abs)', 'MDVP:RAP', 'MDVP:PPQ', 'Jitter:DDP',
'MDVP:Shimmer', 'MDVP:Shimmer(dB)', 'Shimmer:APQ3', 'Shimmer:APQ5',
'MDVP:APQ', 'Shimmer:DDA', 'NHR', 'HNR', 'status', 'RPDE', 'DFA',
'spread1', 'spread2', 'D2', 'PPE'],
dtype='object')
You can also check for missing values just to be sure. Here’s the code to accomplish the same:
# Check for null values
df.isnull().sum()
Output
If you are getting the same output it means - there are no missing values in the dataset. Now, let’s move to the next step.
In this step, we will do the following:
Here is the code to do so:
from sklearn.model_selection import train_test_split
# Drop the 'name' column
df = df.drop(['name'], axis=1)
# Separate features and target
X = df.drop('status', axis=1)
y = df['status']
# Split the dataset (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Check the shape of splits
print("Training set shape:", X_train.shape)
print("Test set shape:", X_test.shape)
Output
Training set shape: (156, 22)
Test set shape: (39, 22)
Now, the data is prepared. Let’s train the machine learning models in the next step.
Before we start training the models, let’s scale our features using the StandardScaler. Doing so will not only ensure that all input values are on the same scale, but will also help both models, especially Logistic Regression to converge faster and perform better.
Here is the code to do so:
from sklearn.preprocessing import StandardScaler
# Initialize the scaler
scaler = StandardScaler()
# Fit on training data and transform both train and test sets
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
Use the below code to train the logistic regression model:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
# Initialize and train the model
lr_model = LogisticRegression()
lr_model.fit(X_train_scaled, y_train)
# Predict on test set
y_pred_lr = lr_model.predict(X_test_scaled)
# Evaluate
print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred_lr))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_lr))
print("Classification Report:\n", classification_report(y_test, y_pred_lr))
Output
Logistic Regression Accuracy: 0.8974358974358975
Confusion Matrix:
[[ 3 4]
[ 0 32]]
Classification Report:
precision recall f1-score support
0 1.00 0.43 0.60 7
1 0.89 1.00 0.94 32
accuracy 0.90 39
macro avg 0.94 0.71 0.77 39
weighted avg 0.91 0.90 0.88 39
What Does this Output Mean?
Now let’s train an SVM classifier with a linear kernel. Here is the code to accomplish the same:
from sklearn.svm import SVC
# Initialize and train the SVM model
svm_model = SVC(kernel='linear')
svm_model.fit(X_train_scaled, y_train)
# Predict on test set
y_pred_svm = svm_model.predict(X_test_scaled)
# Evaluate
print("SVM Accuracy:", accuracy_score(y_test, y_pred_svm))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_svm))
print("Classification Report:\n", classification_report(y_test,y_pred_svm))
Output
SVM Accuracy: 0.8717948717948718
Confusion Matrix:
[[ 4 3]
[ 2 30]]
Classification Report:
precision recall f1-score support
0 0.67 0.57 0.62 7
1 0.91 0.94 0.92 32
accuracy 0.87 39
macro avg 0.79 0.75 0.77 39
weighted avg 0.87 0.87 0.87 39
What Does this Output Mean?
Both models were fairly accurate in detecting Parkinson Disease from voice data. Logistic Regression came ahead in accuracy and recall for the patient class, thus better used when it matters a lot to identify true cases. The SVM model was slightly less accurate but gave a more balanced treatment of the two classes and worked well with the scaled features.
In the course of this project, you learned how machine learning applies to health diagnostics using structured biomedical data. You also learned about data preprocessing, feature scaling, model training, and performance evaluation.
Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!
Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!
Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!
Reference:
https://colab.research.google.com/drive/1YJLjpGjU506joLS7Z3bSOl5RwCp4UKYB
779 articles published
Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...
Speak with Data Science Expert
By submitting, I accept the T&C and
Privacy Policy
Start Your Career in Data Science Today
Top Resources