View All
View All
View All
View All
View All
View All
View All
View All
View All
View All
View All

Detecting Parkinson Disease Project Using Python

By Rohit Sharma

Updated on Jul 21, 2025 | 7 min read | 1.38K+ views

Share:

Parkinson's is a neurological and progressive disorder that mainly affects movement. It interferes with muscle control and coordination and results in trembling, rigidity, and difficulty with balancing. The initial signs are often subtle or unnoticed, which is why detecting Parkinson Disease at an early stage can result in better management.

In this project, a machine learning model will be created to detect Parkinson's disease symptoms from voice measurements. These voice features were obtained from biomedical voice recordings of patients, given that vocal pattern changes are known to be a signal for the disease.

What Should You Know Beforehand?

Before you start this project, you will want to have at least some basic knowledge of the following:

Technologies and Libraries Used

For this project, the following tools and libraries will be used:

Tool / Library

Purpose

Python

Programming language

Google Colab

Cloud-based IDE for running the code

Pandas

Data manipulation and analysis

NumPy

Working with arrays and numerical data

Scikit-learn

Machine learning models and metrics

Matplotlib/Seaborn

Data visualization

Models That Will Be Utilized for Learning

Two models that are lightweight but still effective, will be used for binary classification tasks to detect Parkinson's Disease.

  • Logistic Regression: This is a simple and widely used linear classification model that works well when there is a margin separating classes between them. It predicts the probability that a specific instance belongs to a particular class and is used extensively in medical and diagnostic problems.
  • Support Vector Machine (SVM): It is a powerful algorithm that finds the best decision boundary to separate classes. It also deals with non-linear separation problems using kernel functions and is highly accurate for small- to medium-sized data. 

Time Taken and Difficulty

You can complete this project in around 2 hours. It serves as a great beginner-level hands-on introduction to biomedical data analysis and binary classification using machine learning.

How to Build a Parkinson Disease Detection Model

Let’s start building the project from scratch. We will start by:

  1. Loading and exploring the dataset
  2. Preprocessing the features
  3. Training the model
  4. Evaluating its performance

Without any further delay, let’s start!

Step 1: Download the Dataset

To train our Parkinson Disease detection model, we will use the Parkinson’s Disease dataset available on Kaggle. 

Follow the steps mentioned below to download the dataset:

  1. Open a new tab in any web browser. 
  2. Type https://www.kaggle.com/code/vikasukani/detecting-parkinson-s-disease-machine-learning/input
  3. On the Input Data page, in the right pane, under the Input section, click parkinsons.data
  4. Click the download icon

Step 2: Upload and Load the Dataset in Google Colab

Now that you have downloaded the file, upload it to Google Colab using the code below:

from google.colab import files
uploaded = files.upload()

This will prompt you to choose a file from your system. Select parkinsons.data file you just downloaded.

Once uploaded, use the code given below to load as well as preview the dataset:

import pandas as pd

# Load the dataset
df = pd.read_csv('parkinsons.data')

# Display the first five rows
df.head()

Doing so will help you verify that the dataset is loaded correctly.

Step 3: Explore the Dataset and Understand the Features

Before building the model, let’s comprehend what the dataset looks like and what each feature means.

The dataset contains 195 rows and 24 columns. Each row represents an individual’s biomedical voice measurements. The column status is our target variable:

  • status = 1: Patient has Parkinson’s Disease
  • status = 0: Healthy individual

Here’s the code to explore the data:

# Check the shape of the dataset
print("Dataset shape:", df.shape)

# Display column names
print("Column names:\n", df.columns)

# Basic statistics
df.describe()

The columns include:

  • name: ID for each subject (not useful for prediction)
  • MDVP:Fo(Hz), MDVP:Fhi(Hz), MDVP:Flo(Hz): Measures of vocal frequency
  • MDVP:Jitter(%), MDVP:Shimmer(dB): Measures of voice irregularities
  • NHR, HNR: Noise-to-harmonics and harmonics-to-noise ratios
  • RPDE, DFA: Nonlinear dynamical complexity measures
  • spread1, spread2, D2, PPE: Other signal processing features

Output

Dataset shape: (195, 24)

Column names:

 Index(['name', 'MDVP:Fo(Hz)', 'MDVP:Fhi(Hz)', 'MDVP:Flo(Hz)', 'MDVP:Jitter(%)',

       'MDVP:Jitter(Abs)', 'MDVP:RAP', 'MDVP:PPQ', 'Jitter:DDP',

       'MDVP:Shimmer', 'MDVP:Shimmer(dB)', 'Shimmer:APQ3', 'Shimmer:APQ5',

       'MDVP:APQ', 'Shimmer:DDA', 'NHR', 'HNR', 'status', 'RPDE', 'DFA',

       'spread1', 'spread2', 'D2', 'PPE'],

      dtype='object')

You can also check for missing values just to be sure. Here’s the code to accomplish the same:

# Check for null values
df.isnull().sum()

Output

background

Liverpool John Moores University

MS in Data Science

Dual Credentials

Master's Degree17 Months

Placement Assistance

Certification6 Months

If you are getting the same output it means - there are no missing values in the dataset. Now, let’s move to the next step. 

Step 4: Preprocess the Data and Prepare It for Modeling

In this step, we will do the following:

  • Drop the name column (since it’s just an identifier and doesn’t help in prediction)
  • Separate the features (X) and target variable (y)
  • Split the data into training and test sets

Here is the code to do so:

from sklearn.model_selection import train_test_split

# Drop the 'name' column
df = df.drop(['name'], axis=1)

# Separate features and target
X = df.drop('status', axis=1)
y = df['status']

# Split the dataset (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Check the shape of splits
print("Training set shape:", X_train.shape)
print("Test set shape:", X_test.shape)

Output

Training set shape: (156, 22)
Test set shape: (39, 22)

Now, the data is prepared. Let’s train the machine learning models in the next step. 

Step 5: Train the Models

Before we start training the models, let’s scale our features using the StandardScaler. Doing so will not only ensure that all input values are on the same scale, but will also help both models, especially Logistic Regression to converge faster and perform better.  

Here is the code to do so:

from sklearn.preprocessing import StandardScaler
# Initialize the scaler
scaler = StandardScaler()
# Fit on training data and transform both train and test sets
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Model 1: Logistic Regression

Use the below code to train the logistic regression model:

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Initialize and train the model
lr_model = LogisticRegression()
lr_model.fit(X_train_scaled, y_train)

# Predict on test set
y_pred_lr = lr_model.predict(X_test_scaled)

# Evaluate
print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred_lr))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_lr))
print("Classification Report:\n", classification_report(y_test, y_pred_lr))

Output

Logistic Regression Accuracy: 0.8974358974358975

Confusion Matrix:

 [[ 3  4]

 [ 0 32]]

Classification Report:

                       precision  recall   f1-score   support

               0           1.00      0.43       0.60          7

               1           0.89       1.00      0.94         32

       accuracy                                 0.90        39

     macro avg     0.94       0.71       0.77        39

weighted avg      0.91      0.90       0.88        39

 What Does this Output Mean?

  • The Logistic Regression model achieved ~90% accuracy on the test set.
  • It perfectly identified all patients with Parkinson’s (class 1) with 100% recall, meaning no actual patient was missed.
  • However, for healthy individuals (class 0), the recall dropped to 43%, indicating that the model misclassified 4 out of 7 healthy cases.
  • Despite this imbalance, the model showed strong overall performance with a weighted F1-score of 0.88.

Model 2: Support Vector Machine (SVM)

Now let’s train an SVM classifier with a linear kernel. Here is the code to accomplish the same:

from sklearn.svm import SVC

# Initialize and train the SVM model
svm_model = SVC(kernel='linear')
svm_model.fit(X_train_scaled, y_train)

# Predict on test set
y_pred_svm = svm_model.predict(X_test_scaled)

# Evaluate
print("SVM Accuracy:", accuracy_score(y_test, y_pred_svm))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_svm))
print("Classification Report:\n", classification_report(y_test,y_pred_svm))

Output

SVM Accuracy: 0.8717948717948718

Confusion Matrix:

 [[ 4  3]

 [ 2 30]]

Classification Report:

                        precision recall   f1-score   support

           0              0.67       0.57       0.62          7

           1              0.91        0.94       0.92        32

       accuracy                                  0.87        39

     macro avg    0.79         0.75      0.77        39

weighted avg    0.87         0.87      0.87        39

 What Does this Output Mean?

  • The model correctly predicted 87% of the test samples.
  • It performed very well for class 1 (patients with Parkinson's), with a precision of 0.91 and recall of 0.94.
  • Class 0 (healthy individuals) had slightly lower performance due to fewer samples, which is normal in imbalanced datasets.

Conclusion

Both models were fairly accurate in detecting Parkinson Disease from voice data. Logistic Regression came ahead in accuracy and recall for the patient class, thus better used when it matters a lot to identify true cases. The SVM model was slightly less accurate but gave a more balanced treatment of the two classes and worked well with the scaled features.

In the course of this project, you learned how machine learning applies to health diagnostics using structured biomedical data. You also learned about data preprocessing, feature scaling, model training, and performance evaluation.

Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!

Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!

Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!

Reference:
https://colab.research.google.com/drive/1YJLjpGjU506joLS7Z3bSOl5RwCp4UKYB

Rohit Sharma

779 articles published

Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...

Speak with Data Science Expert

+91

By submitting, I accept the T&C and
Privacy Policy

Start Your Career in Data Science Today

Top Resources

Recommended Programs

upGrad Logo

Certification

3 Months

Liverpool John Moores University Logo
bestseller

Liverpool John Moores University

MS in Data Science

Dual Credentials

Master's Degree

17 Months

IIIT Bangalore logo
bestseller

The International Institute of Information Technology, Bangalore

Executive Diploma in Data Science & AI

Placement Assistance

Executive PG Program

12 Months