Machine Learning Interviews can vary according to the types or categories, for instance, a few recruiters ask many Linear Regression interview questions. When going for the role of Machine Learning Engineer interview, they can specialize in categories like Coding, Research, Case Study, Project Management, Presentation, System Design, and Statistics. We will focus on the most common types of categories and how to prepare for them.
Getting your desired job as a machine learning engineer may need you to pass a machine learning interview. The categories included in these interviews are frequently coding, machine learning concepts, screening, and system design. Different facets of your expertise and knowledge in the topic are assessed in each category. In this article, we’ll examine the most typical machine learning interview questions and offer helpful preparation advice for each of them.
It is a common practice to test data science aspirants on commonly used machine learning algorithms in interviews. These conventional algorithms being linear regression, logistic regression, clustering, decision trees etc. Data scientists are expected to possess an indepth knowledge of these algorithms.
We consulted hiring managers and data scientists from various organisations to know about the typical ML questions which they ask in an interview. Based on their extensive feedback a set of question and answers were prepared to help aspiring data scientists in their conversations. Linear Regression interview questions are the most common in Machine Learning interviews. Q&As on these algorithms will be provided in a series of four blog posts.
Each blog post will cover the following topic:
 Linear Regression
 Logistic Regression
 Clustering
 Decision Trees and Questions which pertain to all algorithms
Let’s get started with linear regression!
1. What is linear regression?
In simple terms, linear regression is a method of finding the best straight line fitting to the given data, i.e. finding the best linear relationship between the independent and dependent variables.
In technical terms, linear regression is a machine learning algorithm that finds the best linearfit relationship on any given data, between independent and dependent variables. It is mostly done by the Sum of Squared Residuals Method.
2. State the assumptions in a linear regression model.
There are three main assumptions in a linear regression model:
 The assumption about the form of the model:
It is assumed that there is a linear relationship between the dependent and independent variables. It is known as the ‘linearity assumption’.  Assumptions about the residuals:
 Normality assumption: It is assumed that the error terms, ε^{(i)}, are normally distributed.
 Zero mean assumption: It is assumed that the residuals have a mean value of zero.
 Constant variance assumption: It is assumed that the residual terms have the same (but unknown) variance, σ^{2} This assumption is also known as the assumption of homogeneity or homoscedasticity.
 Independent error assumption: It is assumed that the residual terms are independent of each other, i.e. their pairwise covariance is zero.
 Assumptions about the estimators:
 The independent variables are measured without error.
 The independent variables are linearly independent of each other, i.e. there is no multicollinearity in the data.
Explanation:
 This is selfexplanatory.
 If the residuals are not normally distributed, their randomness is lost, which implies that the model is not able to explain the relation in the data.
Also, the mean of the residuals should be zero.
Y^{(i)i}= β_{0}+ β_{1}x^{(i)} + ε^{(i)}
This is the assumed linear model, where ε is the residual term.
E(Y) = E(β_{0}+ β_{1}x^{(i)} + ε^{(i)})
= E(β_{0}+ β_{1}x^{(i)} + ε^{(i)})
If the expectation(mean) of residuals, E(ε^{(i)}), is zero, the expectations of the target variable and the model become the same, which is one of the targets of the model.
The residuals (also known as error terms) should be independent. This means that there is no correlation between the residuals and the predicted values, or among the residuals themselves. If some correlation is present, it implies that there is some relation that the regression model is not able to identify.  If the independent variables are not linearly independent of each other, the uniqueness of the least squares solution (or normal equation solution) is lost.
Join the Artificial Intelligence Course online from the World’s top Universities – Masters, Executive Post Graduate Programs, and Advanced Certificate Program in ML & AI to fasttrack your career.
3. What is feature engineering? How do you apply it in the process of modelling?
Feature engineering is the process of transforming raw data into features that better represent the underlying problem to the predictive models resulting in improved model accuracy on unseen data.
In layman terms, feature engineering means the development of new features that may help you understand and model the problem in a better way. Feature engineering is of two kinds — business driven and datadriven. Businessdriven feature engineering revolves around the inclusion of features from a business point of view. The job here is to transform the business variables into features of the problem.
In the case of datadriven feature engineering, the features you add do not have any significant physical interpretation, but they help the model in the prediction of the target variable.
FYI: Free nlp course!
To apply feature engineering, one must be fully acquainted with the dataset. This involves knowing what the given data is, what it signifies, what the raw features are, etc. You must also have a crystal clear idea of the problem, such as what factors affect the target variable, what the physical interpretation of the variable is, etc.
4. What is the use of regularisation? Explain L1 and L2 regularisations.
Regularisation is a technique that is used to tackle the problem of overfitting of the model. When a very complex model is implemented on the training data, it overfits. At times, the simple model might not be able to generalise the data and the complex model overfits. To address this problem, regularisation is used.
Regularisation is nothing but adding the coefficient terms (betas) to the cost function so that the terms are penalised and are small in magnitude. This essentially helps in capturing the trends in the data and at the same time prevents overfitting by not letting the model become too complex.
 L1 or LASSO regularisation: Here, the absolute values of the coefficients are added to the cost function. This can be seen in the following equation; the highlighted part corresponds to the L1 or LASSO regularisation. This regularisation technique gives sparse results, which lead to feature selection as well.
 L2 or Ridge regularisation: Here, the squares of the coefficients are added to the cost function. This can be seen in the following equation, where the highlighted part corresponds to the L2 or Ridge regularisation.
5. How to choose the value of the parameter learning rate (α)?
Selecting the value of learning rate is a tricky business. If the value is too small, the gradient descent algorithm takes ages to converge to the optimal solution. On the other hand, if the value of the learning rate is high, the gradient descent will overshoot the optimal solution and most likely never converge to the optimal solution.
To overcome this problem, you can try different values of alpha over a range of values and plot the cost vs the number of iterations. Then, based on the graphs, the value corresponding to the graph showing the rapid decrease can be chosen.
The aforementioned graph is an ideal cost vs the number of iterations curve. Note that the cost initially decreases as the number of iterations increases, but after certain iterations, the gradient descent converges and the cost does not decrease anymore.
If you see that the cost is increasing with the number of iterations, your learning rate parameter is high and it needs to be decreased.
Best Machine Learning and AI Courses Online
6. How to choose the value of the regularisation parameter (λ)?
Selecting the regularisation parameter is a tricky business. If the value of λ is too high, it will lead to extremely small values of the regression coefficient β, which will lead to the model underfitting (high bias – low variance). On the other hand, if the value of λ is 0 (very small), the model will tend to overfit the training data (low bias – high variance).
There is no proper way to select the value of λ. What you can do is have a subsample of data and run the algorithm multiple times on different sets. Here, the person has to decide how much variance can be tolerated. Once the user is satisfied with the variance, that value of λ can be chosen for the full dataset.
One thing to be noted is that the value of λ selected here was optimal for that subset, not for the entire training data.
7. Can we use linear regression for time series analysis?
One can use linear regression for time series analysis, but the results are not promising. So, it is generally not advisable to do so. The reasons behind this are —
 Time series data is mostly used for the prediction of the future, but linear regression seldom gives good results for future prediction as it is not meant for extrapolation.
 Mostly, time series data have a pattern, such as during peak hours, festive seasons, etc., which would most likely be treated as outliers in the linear regression analysis.
8. What value is the sum of the residuals of a linear regression close to? Justify.
Ans The sum of the residuals of a linear regression is 0. Linear regression works on the assumption that the errors (residuals) are normally distributed with a mean of 0, i.e.
Y = β^{T} X + ε
Here, Y is the target or dependent variable,
β is the vector of the regression coefficient,
X is the feature matrix containing all the features as the columns,
ε is the residual term such that ε ~ N(0,σ^{2}).
So, the sum of all the residuals is the expected value of the residuals times the total number of data points. Since the expectation of residuals is 0, the sum of all the residual terms is zero.
Note: N(μ,σ^{2}) is the standard notation for a normal distribution having mean μ and standard deviation σ^{2}.
9. How does multicollinearity affect the linear regression?
Ans Multicollinearity occurs when some of the independent variables are highly correlated (positively or negatively) with each other. This multicollinearity causes a problem as it is against the basic assumption of linear regression. The presence of multicollinearity does not affect the predictive capability of the model. So, if you just want predictions, the presence of multicollinearity does not affect your output. However, if you want to draw some insights from the model and apply them in, let’s say, some business model, it may cause problems.
One of the major problems caused by multicollinearity is that it leads to incorrect interpretations and provides wrong insights. The coefficients of linear regression suggest the mean change in the target value if a feature is changed by one unit. So, if multicollinearity exists, this does not hold true as changing one feature will lead to changes in the correlated variable and consequent changes in the target variable. This leads to wrong insights and can produce hazardous results for a business.
A highly effective way of dealing with multicollinearity is the use of VIF (Variance Inflation Factor). Higher the value of VIF for a feature, more linearly correlated is that feature. Simply remove the feature with very high VIF value and retrain the model on the remaining dataset.
Indemand Machine Learning Skills
10. What is the normal form (equation) of linear regression? When should it be preferred to the gradient descent method?
The normal equation for linear regression is —
β=(X^{T}X)^{1}.X^{T}Y
Here, Y=β^{T}X is the model for the linear regression,
Y is the target or dependent variable,
β is the vector of the regression coefficient, which is arrived at using the normal equation,
X is the feature matrix containing all the features as the columns.
Note here that the first column in the X matrix consists of all 1s. This is to incorporate the offset value for the regression line.
Comparison between gradient descent and normal equation:
Gradient Descent  Normal Equation 
Needs hyperparameter tuning for alpha (learning parameter)  No such need 
It is an iterative process  It is a noniterative process 
O(kn^{2}) time complexity  O(n^{3}) time complexity due to evaluation of X^{T}X 
Prefered when n is extremely large  Becomes quite slow for large values of n 
Here, ‘k’ is the maximum number of iterations for gradient descent, and ‘n’ is the total number of data points in the training set.
Clearly, if we have large training data, normal equation is not prefered for use. For small values of ‘n’, normal equation is faster than gradient descent.
11. You run your regression on different subsets of your data, and in each subset, the beta value for a certain variable varies wildly. What could be the issue here?
This case implies that the dataset is heterogeneous. So, to overcome this problem, the dataset should be clustered into different subsets, and then separate models should be built for each cluster. Another way to deal with this problem is to use nonparametric models, such as decision trees, which can deal with heterogeneous data quite efficiently.
12. Your linear regression doesn’t run and communicates that there is an infinite number of best estimates for the regression coefficients. What could be wrong?
This condition arises when there is a perfect correlation (positive or negative) between some variables. In this case, there is no unique value for the coefficients, and hence, the given condition arises.
13. What do you mean by adjusted R2? How is it different from R2?
Adjusted R^{2}, just like R^{2}, is a representative of the number of points lying around the regression line. That is, it shows how well the model is fitting the training data. The formula for adjusted R^{2} is —
Here, n is the number of data points, and k is the number of features.
One drawback of R^{2} is that it will always increase with the addition of a new feature, whether the new feature is useful or not. The adjusted R^{2} overcomes this drawback. The value of the adjusted R^{2} increases only if the newly added feature plays a significant role in the model.
14. How do you interpret the residual vs fitted value curve?
The residual vs fitted value plot is used to see whether the predicted values and residuals have a correlation or not. If the residuals are distributed normally, with a mean around the fitted value and a constant variance, our model is working fine; otherwise, there is some issue with the model.
The most common problem that can be found when training the model over a large range of a dataset is heteroscedasticity(this is explained in the answer below). The presence of heteroscedasticity can be easily seen by plotting the residual vs fitted value curve.
15. What is heteroscedasticity? What are the consequences, and how can you overcome it?
A random variable is said to be heteroscedastic when different subpopulations have different variabilities (standard deviation).
The existence of heteroscedasticity gives rise to certain problems in the regression analysis as the assumption says that error terms are uncorrelated and, hence, the variance is constant. The presence of heteroscedasticity can often be seen in the form of a conelike scatter plot for residual vs fitted values.
One of the basic assumptions of linear regression is that heteroscedasticity is not present in the data. Due to the violation of assumptions, the Ordinary Least Squares (OLS) estimators are not the Best Linear Unbiased Estimators (BLUE). Hence, they do not give the least variance than other Linear Unbiased Estimators (LUEs).
There is no fixed procedure to overcome heteroscedasticity. However, there are some ways that may lead to a reduction of heteroscedasticity. They are —
 Logarithmising the data: A series that is increasing exponentially often results in increased variability. This can be overcome using the log transformation.
 Using weighted linear regression: Here, the OLS method is applied to the weighted values of X and Y. One way is to attach weights directly related to the magnitude of the dependent variable.
16. What is VIF? How do you calculate it?
Variance Inflation Factor (VIF) is used to check the presence of multicollinearity in a dataset. It is calculated as—
Here, VIFj is the value of VIF for the j^{th} variable,
R_{j}^{2} is the R^{2} value of the model when that variable is regressed against all the other independent variables.
If the value of VIF is high for a variable, it implies that the R^{2} value of the corresponding model is high, i.e. other independent variables are able to explain that variable. In simple terms, the variable is linearly dependent on some other variables.
17. How do you know that linear regression is suitable for any given data?
To see if linear regression is suitable for any given data, a scatter plot can be used. If the relationship looks linear, we can go for a linear model. But if it is not the case, we have to apply some transformations to make the relationship linear. Plotting the scatter plots is easy in case of simple or univariate linear regression. But in case of multivariate linear regression, twodimensional pairwise scatter plots, rotating plots, and dynamic graphs can be plotted.
18. How is hypothesis testing used in linear regression?
Hypothesis testing can be carried out in linear regression for the following purposes:
 To check whether a predictor is significant for the prediction of the target variable. Two common methods for this are —
 By the use of pvalues:
If the pvalue of a variable is greater than a certain limit (usually 0.05), the variable is insignificant in the prediction of the target variable.  By checking the values of the regression coefficient:
If the value of regression coefficient corresponding to a predictor is zero, that variable is insignificant in the prediction of the target variable and has no linear relationship with it.
 By the use of pvalues:
 To check whether the calculated regression coefficients are good estimators of the actual coefficients.
19. Explain gradient descent with respect to linear regression.
Gradient descent is an optimisation algorithm. In linear regression, it is used to optimise the cost function and find the values of the βs (estimators) corresponding to the optimised value of the cost function.
Gradient descent works like a ball rolling down a graph (ignoring the inertia). The ball moves along the direction of the greatest gradient and comes to rest at the flat surface (minima).
Mathematically, the aim of gradient descent for linear regression is to find the solution of
ArgMin J(Θ_{0},Θ_{1}), where J(Θ_{0},Θ_{1}) is the cost function of the linear regression. It is given by —
Here, h is the linear hypothesis model, h=Θ_{0} + Θ_{1}x, y is the true output, and m is the number of the data points in the training set.
Gradient Descent starts with a random solution, and then based on the direction of the gradient, the solution is updated to the new value where the cost function has a lower value.
The update is:
Repeat until convergence
20. How do you interpret a linear regression model?
A linear regression model is quite easy to interpret. The model is of the following form:
The significance of this model lies in the fact that one can easily interpret and understand the marginal changes and their consequences. For example, if the value of x_{0} increases by 1 unit, keeping other variables constant, the total increase in the value of y will be β_{i}. Mathematically, the intercept term (β_{0}) is the response when all the predictor terms are set to zero or not considered.
These 6 Machine Learning Techniques are Improving Healthcare
21. What is robust regression?
A regression model should be robust in nature. This means that with changes in a few observations, the model should not change drastically. Also, it should not be much affected by the outliers.
A regression model with OLS (Ordinary Least Squares) is quite sensitive to the outliers. To overcome this problem, we can use the WLS (Weighted Least Squares) method to determine the estimators of the regression coefficients. Here, less weights are given to the outliers or high leverage points in the fitting, making these points less impactful.
22. Which graphs are suggested to be observed before model fitting?
Before fitting the model, one must be well aware of the data, such as what the trends, distribution, skewness, etc. in the variables are. Graphs such as histograms, box plots, and dot plots can be used to observe the distribution of the variables. Apart from this, one must also analyse what the relationship between dependent and independent variables is. This can be done by scatter plots (in case of univariate problems), rotating plots, dynamic plots, etc.
23. What is the generalized linear model?
The generalized linear model is the derivative of the ordinary linear regression model. GLM is more flexible in terms of residuals and can be used where linear regression does not seem appropriate. GLM allows the distribution of residuals to be other than a normal distribution. It generalizes the linear regression by allowing the linear model to link to the target variable using the linking function. Model estimation is done using the method of maximum likelihood estimation.
24. Explain the biasvariance tradeoff.
Bias refers to the difference between the values predicted by the model and the real values. It is an error. One of the goals of an ML algorithm is to have a low bias.
Variance refers to the sensitivity of the model to small fluctuations in the training dataset. Another goal of an ML algorithm is to have low variance.
For a dataset that is not exactly linear, it is not possible to have both bias and variance low at the same time. A straight line model will have low variance but high bias, whereas a highdegree polynomial will have low bias but high variance.
There is no escaping the relationship between bias and variance in machine learning.
 Decreasing the bias increases the variance.
 Decreasing the variance increases the bias.
So, there is a tradeoff between the two; the ML specialist has to decide, based on the assigned problem, how much bias and variance can be tolerated. Based on this, the final model is built.
25. How can learning curves help create a better model?
Learning curves give the indication of the presence of overfitting or underfitting.
In a learning curve, the training error and crossvalidating error are plotted against the number of training data points. A typical learning curve looks like this:
If the training error and true error (crossvalidating error) converge to the same value and the corresponding value of the error is high, it indicates that the model is underfitting and is suffering from high bias.
26. Recognize the differences between machine learning’s regression and classification.
Classification vs. Regression in Machine Learning:
 Objective:
Classification: Focuses on predicting the category or class labels of new data points.
Regression: Aims to predict a continuous quantity or numeric value for new data.
 Output:
Classification: Outputs discrete values representing class labels (e.g., spam or not spam).
Regression: Outputs continuous values, such as predicting house prices or stock prices.
 Use Cases:
Classification: Commonly used in tasks like image recognition, sentiment analysis, or spam filtering.
Regression: Applied in scenarios like predicting sales, temperature, or any numeric outcome.
 Algorithms:
Classification: Algorithms include Decision Trees, Support Vector Machines, and Neural Networks.
Regression: Algorithms encompass Linear Regression, Decision Trees, and Random Forests.
 Evaluation:
Classification: Evaluated using metrics like accuracy, precision, and recall.
Regression: Assessed using metrics like Mean Squared Error (MSE) or Mean Absolute Error (MAE).
27. What is Confusion Matrix?
It is one of the most common and interesting machinelearning interview questions. Here is its simple answer.
 Definition: A Confusion Matrix is a table used in classification to evaluate the performance of a machine learning model. It clearly summarizes the model’s predictions versus the actual outcomes.
 Components:

 True Positives (TP): Instances correctly predicted as positive.
 True Negatives (TN): Instances correctly predicted as negative.
 False Positives (FP): Instances incorrectly predicted as positive.
 False Negatives (FN): Instances incorrectly predicted as negative.
 Purpose: It provides a deeper understanding of a model’s effectiveness by breaking down correct and incorrect predictions.
 Metrics: Derived metrics include accuracy, precision, recall, and F1score, offering a nuanced assessment of model performance.
28. Explain Logistic Regression
 Purpose: Logistic Regression is a statistical method used for binary classification problems, predicting the probability of an instance belonging to a particular class.
 Output: It produces probabilities using the logistic function, ensuring values between 0 and 1.
 Algorithm: Utilizes the logistic function (sigmoid) to model the relationship between the independent variables and the dependent binary outcome.
 Decision Boundary: Establishes a decision boundary, classifying instances based on the calculated probabilities.
 Application: Widely applied in predicting outcomes like whether an email is spam or not, disease diagnosis, and credit risk assessment.
 Linear Relationship: Assumes a linear relationship between input features and the log odds of the predicted outcome.
29. Why are Validation and Test Datasets Needed?
This is a mustknow topic in machine learning interview preparation.
Importance of Validation and Test Datasets:
 Training Dataset:

 Purpose: Used for training machine learning models by exposing them to labeled examples.
 Validation Dataset:

 Purpose: Essential for tuning model hyperparameters and preventing overfitting.
 Test Dataset:

 Purpose: Provides an unbiased evaluation of a model’s performance on new, unseen data.
 Generalization Check:

 Validation: Ensures the model generalizes well beyond the training set.
 Test: Verifies the model’s generalization to entirely new, unseen data.
 Model Selection:

 Validation: Guides the selection of the bestperforming model during training.
 Test: Confirms the chosen model’s effectiveness on independent data, validating its realworld applicability.
 Avoiding Overfitting:

 Validation: Guards against overfitting by finetuning the model based on its performance on a separate dataset.
 Test: Provides a final checkpoint to confirm the model’s robustness and suitability for deployment.
30. What is Dimensionality Reduction?
 Definition:
 Purpose: Dimensionality Reduction is a technique in machine learning aimed at reducing the number of input features or variables in a dataset while preserving essential information.
 Curse of Dimensionality:
 Issue: Mitigates the “curse of dimensionality,” where highdimensional data can lead to increased computational complexity and overfitting.
 Techniques:
 Principal Component Analysis (PCA): A linear technique that transforms data into a lowerdimensional space.
 tDistributed Stochastic Neighbor Embedding (tSNE): Nonlinear method suitable for visualizing highdimensional data in lowerdimensional space.
 Benefits:
 Computational Efficiency: Reduces computational load and memory requirements.
 Enhanced Model Performance: Addresses multicollinearity and improves model generalization.
 Applications:
 Image Processing: Simplifies image features.
 Text Mining: Condenses text data dimensions.
 Feature Engineering: Aids in feature selection and simplifies model interpretation.
31. What is the meaning of Parametric and Nonparametric Models?
 Parametric Models:
 Definition: Parametric models assume a specific functional form for the underlying data distribution.
 Characteristics: They have a fixed number of parameters that remain constant regardless of the size of the dataset.
 Examples: Linear Regression, Logistic Regression.
 Nonparametric Models:
 Definition: Nonparametric models make no assumptions about the underlying data distribution.
 Characteristics: They adapt and grow in complexity with the dataset size.
 Examples: knearest Neighbors (KNN), Decision Trees, and Support Vector Machines (SVM).
 Flexibility:
 Parametric: Constrained by assumed distribution, limiting flexibility.
 Nonparametric: Highly flexible, suitable for diverse data patterns.
 Data Size Impact:
 Parametric: Stable with a fixed set of parameters, less affected by data size.
 Nonparametric: Adaptability makes them more suitable for varying dataset sizes.
 Assumptions:
 Parametric: Requires assumptions about data distribution.
 Nonparametric: Free from distribution assumptions, providing more flexibility for various datasets.
32. What is Crossvalidation in Machine Learning?
You can expect this question in a typical machine learning interview. The answer is explained below.
 Definition:
 Purpose: Crossvalidation is a resampling technique used to assess a machine learning model’s performance by dividing the dataset into subsets for training and evaluation.
 KFold Crossvalidation:
 Procedure: Divide the dataset into K folds, using K1 folds for training and the remaining one for validation in each iteration.
 Benefits:
 Reduced Bias: Provides a more robust estimate of model performance, reducing bias introduced by a single traintest split.
 Stratified Crossvalidation:
 Application: Ensures that each fold maintains the proportion of classes present in the original dataset, which is particularly useful for imbalanced datasets.
 LeaveOneOut Crossvalidation (LOOCV):
 Special Case: When K equals the number of instances in the dataset, a singlefold validation is created.
 Model Selection:
 Use: Aids in selecting the bestperforming model and helps prevent overfitting or underfitting.
33. What is Entropy in Machine Learning?
 Definition:
 Information Measure: Entropy is a measure of uncertainty or disorder in a set of data, often used in the context of decision trees and information theory.
 Information Gain:
 Concept: In decision tree algorithms, entropy is used to calculate information gain, representing the reduction in uncertainty achieved by splitting a dataset based on a particular feature.
 Calculation:
 Formula: Entropy is mathematically expressed as the negative sum of the probabilities of each class multiplied by the logarithm of the probability.
 Low Entropy:
 Interpretation: Low entropy indicates high certainty or homogeneity in a dataset.
 Decision Trees:
 Role: Entropy guides decision tree splits, favoring features that maximize information gain, leading to more accurate and efficient tree structures.
 Entropy Reduction:
 Objective: Minimizing entropy through optimal feature selection contributes to improved decisionmaking and model performance.
34. What is Epoch in Machine Learning?
 Definition:
 Temporal Unit: An epoch refers to one complete pass through the entire training dataset by a machine learning model during training.
 Training Iteration:
 Purpose: Models learn from the entire dataset in each epoch, adjusting weights and biases to minimize the loss function.
 Batch Processing:
 Subdivisions: In deep learning, epochs are composed of smaller batches, allowing for more efficient updates of model parameters.
 Convergence Check:
 Monitoring: Researchers often monitor training performance over multiple epochs to assess convergence and prevent overfitting.
 Hyperparameter:
 Tuning: The number of epochs is a hyperparameter that requires tuning to optimize model performance without unnecessary computational costs.
 Early Stopping:
 Strategy: Training may be halted early if further epochs don’t significantly improve performance, preventing prolonged computation without substantial gains.
35. What are Type I and Type II Errors?
 Type I Error (False Positive):
 Definition: Type I error occurs when a null hypothesis is incorrectly rejected, indicating a false positive result.
 Significance: Often denoted by the symbol α, it represents the level of significance or the probability of making such an error.
 Type II Error (False Negative):
 Definition: Type II error happens when a false null hypothesis is not rejected, leading to a false negative outcome.
 Power: Represented by the symbol β, it is correlated with the statistical power of a test, indicating the probability of accepting a false null hypothesis.
 Tradeoff:
 Balancing Act: In hypothesis testing, there is a tradeoff between Type I and Type II errors; reducing one typically increases the other.
 Critical in Hypothesis Testing:
 Importance: Understanding and minimizing Type I and Type II errors are crucial in designing robust statistical tests and ensuring the validity of results.
36. How is a Random Forest different from a Gradient Boosting Machine (GBM)?
 Ensemble Learning:
 Random Forest: It is an ensemble learning method that builds multiple decision trees and merges their predictions through averaging or voting.
 GBM: Gradient Boosting Machine is another ensemble method that constructs decision trees sequentially, with each tree correcting the errors of the previous ones.
 Tree Construction:
 Random Forest: Trees are constructed independently, and the final prediction is an aggregation of individual tree predictions.
 GBM: Trees are built sequentially, focusing on reducing the errors of the previous models.
 Training Process:
 Random Forest: Training is parallelized as trees are constructed independently.
 GBM: Training is sequential, with each tree attempting to improve upon the errors of the ensemble.
 Overfitting:
 Random Forest: Less prone to overfitting due to the averaging effect of multiple trees.
 GBM: More sensitive to overfitting, especially if the number of trees is not properly tuned.
 Handling Outliers:
 Random Forest: Robust to outliers as individual trees might be affected, but the ensemble is less likely to be.
 GBM: Sensitive to outliers, as subsequent trees may attempt to correct errors introduced by outliers in earlier trees.
37. Differentiate between Sigmoid and Softmax Functions.
This is one of the popular machine learning coding interview questions. I have explained the differences between the two functions in a simple manner. Read below.
 Purpose:
 Sigmoid: Primarily used for binary classification, providing independent probabilities for each class.
 Softmax: Applied in multiclass classification, offering a probability distribution over multiple classes.
 Output Range:
 Sigmoid: Outputs individual probabilities between 0 and 1, suitable for binary decisions.
 Softmax: Generates a normalized probability distribution across classes, ensuring the sum equals 1.
 Application:
 Sigmoid: Common in binary classification neural networks.
 Softmax: Ideal for neural networks handling multiple mutually exclusive classes.
 Independence:
 Sigmoid: Assumes instances can belong to multiple classes.
 Softmax: Assumes instances belong to a single exclusive class.
 Activation Function:
 Sigmoid: Used in the output layer for binary classification.
 Softmax: Employed in the output layer for multiclass classification.
 Decision Boundary:
 Sigmoid: Binary decisions based on a threshold (e.g., 0.5).
 Softmax: Assigns instances to the class with the highest probability.
38. What are the Two Main Types of Filtering in Machine Learning?
Two Main Types of Filtering in Machine Learning:
 Temporal Filtering:
 Purpose: Focuses on analyzing and processing data over time.
 Application: Commonly used in timeseries analysis and forecasting tasks.
 Examples: Moving averages exponential smoothing.
 Frequency Filtering:
 Purpose: Concentrates on the frequency components within data.
 Application: Applied in signal processing, image processing, and feature extraction.
 Examples: Fourier Transform, wavelet analysis.
39. What is Ensemble Learning?
 Definition:
 Ensemble Learning involves combining predictions from multiple machine learning models to enhance overall performance and accuracy.
 Key Components:
 Base Models: Ensemble methods utilize diverse base models, such as decision trees or neural networks.
 Voting or Weighting: Combining predictions through voting (majority) or assigning weights based on model performance.
 Advantages:
 Improved Accuracy: Ensemble methods often outperform individual models, capturing a more comprehensive understanding of complex patterns.
 Robustness: They are less prone to overfitting and generalizing well to diverse datasets.
 Types of Ensemble Learning:
 Bagging (Bootstrap Aggregating): Parallel training of multiple models on bootstrapped subsets.
 Boosting: Sequential training where models focus on correcting errors of predecessors.
40. What is the difference between the Standard scalar and the MinMax Scaler?
 Scaling Method:
 Standard Scaler: Utilizes zscore normalization, transforming data to have a mean of 0 and a standard deviation of 1.
 MinMax Scaler: Scales data to a specific range, usually between 0 and 1, maintaining the relative distances between values.
 Effect on Outliers:
 Standard Scaler: Sensitive to outliers, as it considers the mean and standard deviation.
 MinMax Scaler: Less sensitive to outliers, as it focuses on the range of values.
 Output Range:
 Standard Scaler: May produce values outside the 0 to 1 range.
 MinMax Scaler: Constricts values to the specified range.
 Use Cases:
 Standard Scaler: Suitable when the distribution of features is approximately Gaussian.
 MinMax Scaler: Effective when features have varying scales, and a specific range is desired.
41. How does tree splitting take place?
 Feature Selection:
 Decision Point: Identify the feature that best splits the dataset based on certain criteria, commonly using measures like Gini impurity or information gain.
 Splitting Criteria:
 Threshold Determination: Establish a threshold value for the selected feature that optimally divides the data into subsets.
 Categorical Features: For categorical features, split based on distinct categories.
 Evaluation:
 Criterion Evaluation: Assess the effectiveness of the split using the chosen impurity measure.
 Best Split: Choose the split that minimizes impurity or maximizes information gain.
 Recursive Process:
 Repeat: Continue recursively splitting each subset until a stopping condition is met, such as a predefined tree depth or a minimum number of samples per leaf.
42. What is the F1score, and How Is It Used?
 Calculation:
 Precision and Recall: The F1score is the harmonic mean of precision and recall, combining both metrics into a single value.
 Formula: F1 = 2 * (Precision * Recall) / (Precision + Recall).
 Balanced Metric:
 Harmonizes Precision and Recall: This is particularly useful when there is an uneven class distribution, ensuring a balanced evaluation of a classifier’s performance.
 Application:
 Binary Classification: Commonly applied in scenarios where there are two classes (positive and negative).
 Imbalanced Datasets: Suitable for assessing models on datasets where one class significantly outnumbers the other.
43. What is Overfitting, and how can it be avoided?
 Definition:
 Issue: Overfitting occurs when a model learns the training data too well, capturing noise and patterns that don’t generalize to new, unseen data.
 Causes:
 Complex Models: Overly complex models, such as deep neural networks, are prone to overfitting.
 Small Datasets: Limited training data increases the likelihood of the model memorizing noise.
 Avoidance Strategies:
 Regularization: Introduce penalties for complex model structures to discourage overfitting.
 CrossValidation: Evaluate model performance on multiple subsets of the data to ensure generalization.
 Feature Selection: Choose relevant features and avoid unnecessary complexity.
 Data Augmentation: Increase dataset size through transformations to expose the model to diverse examples.
44. What is the Hypothesis in Machine Learning?
 Definition:
 Assumption: In machine learning, a hypothesis is an assumption or conjecture about the relationship between input features and the target variable.
 Representation:
 Function Form: Often represented as a mathematical function that maps input features to the predicted output.
 Training Process:
 Adjustment: During training, the model iteratively adjusts its hypothesis based on the error between predicted and actual outcomes.
 Example:
 Linear Regression: In linear regression, the hypothesis might be a linear equation expressing the relationship between input features and the target variable.
45. What is the Variance Inflation Factor?
 Definition:
 Multicollinearity Measure: VIF is a statistical measure that quantifies the extent to which the variance of an estimated regression coefficient increases when predictors are highly correlated.
 Calculation:
 Formula: VIF is calculated for each predictor in a regression model as the ratio of the variance of the model with all predictors to the variance of a model with only that predictor.
 Interpretation:
 High VIF: Values exceeding 10 indicate significant multicollinearity, suggesting that predictors may be too correlated.
 Impact:
 Effects: High VIF values can lead to unstable and less reliable coefficient estimates in regression models.
Machine Learning Interviews and How to Ace Them
Machine Learning Interviews can vary according to the types or categories, for instance a few recruiters ask many Linear Regression interview questions. When going for the role of Machine Learning Engineer interview, they can specialise in categories like Coding, Research, Case Study, Project Management, Presentation, System Design, and Statistics. We will focus on the most common types of categories and how to prepare for them.
1. Coding
Coding and programming are significant components of a machine learning interview and are frequently used to screen applicants. To do well in these interviews, you need to have solid programming abilities. Coding interviews typically run 45 to 60 minutes and are made up of only two questions. The interviewer poses the topic and anticipates that the applicant would address it in the least amount of time possible.
How to prepare – You can prepare for these interviews by having a good understanding of the data structures, complexities of time and space, management skills, and the ability to understand and resolve a problem. upGrad has a great software engineering course that can help you enhance your coding skills and ace that interview.
In machine learning interviews, coding and programming abilities are essential and frequently utilized to evaluate candidates. You’ll be given coding issues to effectively solve in a constrained amount of time throughout these interviews. Strong programming skills, data structure expertise, an understanding of time and space complexities, and problemsolving talents are necessary to succeed in these interviews.
Consider enrolling in a software engineering course, such as the one provided by upGrad, to prepare for coding interviews. It can help you improve your coding abilities and get ready for the coding problems that will come up during the interview.
During these interviews, your knowledge of machine learning principles will be carefully assessed. Questions may encompass subjects like convolutional layers, recurrent neural networks, generative adversarial networks, and speech recognition, depending on the employment needs.
2. Machine Learning
Your understanding of machine learning will be evaluated through interviews. Convolutional layers, recurrent neural networks, generative adversary networks, speech recognition, and other topics may be covered depending on the employment needs.
How to prepare – To be able to ace this interview, you must ensure that you have a thorough understanding of the job roles and responsibilities. This will help you identify the specifications of ML that you must study. However, if you do not come across any specifications, you must deeply understand the basics. An indepth course in ML that upGrad provides can help you with that. You can also study the latest articles on ML and AI to understand their latest trends and you can incorporate them on a regular basis.
3. Screening
This interview is somewhat informal and typically one of the initial points of the interview. A prospective employer often handles it. This interview’s major goal is to provide the applicant with a sense of the business, the role, and the duties. In a more informal atmosphere, the candidate is also questioned about their past to determine whether their area of interest matches the position.
How to prepare – This is a very nontechnical part of the interview. All this required is your honesty and the basics of your specialization in Machine Learning.
In the initial stage of the interview process, the screening interview is frequently casual. Its main objective is to give the applicant an overview of the organization, the position, and the duties. To determine whether a candidate is a good fit for the role, questions about their experience and hobbies may be asked.
Being truthful about your history and showcasing your general and machine learningspecific knowledge are important aspects of screening interview preparation.
4. System Design
Such interviews test a person’s capacity to create a fully scalable solution from beginning to finish. The majority of engineers are so preoccupied with an issue that they frequently overlook the wider picture. A system design interview calls for an understanding of numerous elements that combine to produce a solution. These elements include the frontend layout, the load balancer, the cache, and more. An effective and scalable endtoend system is easier to develop when these issues are well understood.
How to prepare – Understand the concepts and components of the system design project. Use reallife examples to explain the structure to your interviewer for a better understanding of the project.
Interviews for system design assess a candidate’s capacity to create a fully scalable solution from scratch. It involves knowledge of numerous elements that contribute to a scalable endtoend system, including frontend layout, load balancing, caching, and more.
Learn the terms and elements of system design projects to perform well in a system design interview. To help the interviewer better comprehend your approach, use examples from realworld situations while describing the structure you propose.
If there is a significant gap between the converging values of the training and crossvalidation errors, i.e. the crossvalidating error is significantly higher than the training error, it suggests that the model is overfitting the training data and is suffering from a high variance.
Popular AI and ML Blogs & Free Courses
If there is a significant gap between the converging values of the training and crossvalidating errors, i.e. the crossvalidating error is significantly higher than the training error, it suggests that the model is overfitting the training data and is suffering from a high variance.
Machine Learning Engineers: Myths vs. Realities
That’s the end of the first section of this series. Stick around for the next part of the series which consist of questions based on Logistic Regression. Feel free to post your comments.
Coauthored by – Ojas Agarwal
You can check our Executive PG Programme in Machine Learning & AI, which provides practical handson workshops, onetoone industry mentor, 12 case studies and assignments, IIITB Alumni status, and more.