Feature engineering is one of the most important aspects of any data science project. Feature engineering refers to the techniques used for extracting and refining features from the raw data. Feature engineering techniques are used to create proper input data for the model and to improve the performance of the model.
The models are trained and built on the features that we derive from the raw data to provide the required output. It may happen that the data which we have is not good enough for the model to learn something from it. If we are able to derive the features which find the solution to our underlying problem, it would turn out to be a good representation of the data. Better is the representation of the data, better will be the fit of the model and better results will be exhibited by the model.
The workflow of any data science project is an iterative process rather than a one-time process. In most data science projects, a base model is created after creating and refining the features from the raw data. Upon obtaining the results of the base model, some existing features can be tweaked, and some new features are also derived from the data to optimize the model results.
The techniques used in the feature engineering process may provide the results in the same way for all the algorithms and data sets. Some of the common techniques used in the feature engineering process are as follows:
1. Value Transformation
The values of the features can be transformed into some other metric by using parameters like the logarithmic function, root function, exponential function, etc. There are some limitations for these functions and may not be used for all the types of data sets. For instance, the root transformation or the logarithmic transformation cannot be applied to the features that contain negative values.
One of the most commonly used functions is the logarithmic function. The logarithmic function can help in reducing the skewness of the data that may be skewed towards one end. The log transformation tends to normalize the data which reduces the effect of the outliers on the performance of the model.
It also helps in reducing the magnitude of the values in a feature. This is useful when we are using some algorithms which consider the features with greater values to be of greater importance than the others.
2. Data Imputation
Data imputation refers to filling up the missing values in a data set with some statistical value. This technique is important as some algorithms do not work on the missing values which either restrict us to use other algorithms or impute these missing values. It is preferred to use it if the percentage of missing values in a feature is less (around 5 to 10%) else it would lead to more distortion in the distribution of the data. There are different methods to do it for numerical and categorical features.
We can impute the missing values in numerical features with arbitrary values within a specified range or with statistical measures like mean, median, etc. These imputations must be made carefully as the statistical measures are prone to outliers which would rather degrade the performance of the model. For categorical features, we can impute the missing values with an additional category that is missing in the data set or simply impute them as missing if the category is unknown.
The former requires a good sense of domain knowledge to be able to find the correct category while the latter is more of an alternative for generalization. We can also use mode to impute the categorical features. Imputing the data with mode might also lead to over-representation of the most frequent label if the missing values are too high in number.
3. Categorical Encoding
One of the requirements in many algorithms is that the input data should be numerical in nature. This turns out to be a constraint for using categorical features in such algorithms. To represent the categorical features as numbers, we need to perform categorical encoding. Some of the methods to convert the categorical features into numbers are as follows:
1. One-hot encoding: – One-hot encoding creates a new feature that takes a value (either 0 or 1) for each label in a categorical feature. This new feature indicates if that label of the category is present for each observation. For instance, assume there are 4 labels in a categorical feature, then upon applying one-hot encoding, it would create 4 Boolean features.
The same amount of information can also be extracted with 3 features as if all the features contain 0, then the value of categorical feature would be the 4th label. The application of this method increases the feature space if there are many categorical features with a high number of labels in the data set.
2. Frequency encoding: – This method calculates the count or the percentage of each label in the categorical feature and maps it against the same label. This method does not extend the feature space of the data set. One drawback of this method is that if the two or more labels have the same count in the data set, it would give the map the same number for all of the labels. This would lead to the loss of crucial information.
3. Ordinal encoding: – Also known as Label encoding, this method maps the distinct values of a categorical feature with a number ranging from 0 to n-1, with n being the distinct number of labels in the feature. This method does not enlarge the feature space of the data set. But it does create an ordinal relationship within the labels in a feature.
4. Handling of Outliers
Outliers are the data points whose values are very different from the rest of the lot. To handle these outliers, we need to detect them first. We can detect them using visualizations like box-plot and scatter-plot in Python, or we can use the interquartile range (IQR). The interquartile range is the difference between the first quarter (25th percentile) and the third quarter (75th percentile).
The values which do not fall in the range of (Q1 – 1.5*IQR) and (Q3 + 1.5*IQR) are termed as outliers. After detecting the outliers, we can handle them by removing them from the data set, applying some transformation, treating them as missing values to impute them using some method, etc.
5. Feature Scaling
Feature scaling is used to change the values of the features and to bring them within a range. It is important to apply this process if we are using algorithms like SVM, Linear regression, KNN, etc that are sensitive to the magnitude of the values. To scale the features, we can perform standardization, normalization, min-max scaling. Normalization rescales the values of a feature range from -1 to 1. It is the ratio of subtraction of each observation and the mean to the subtraction of the maximum and minimum value of that feature. i.e. [X – mean(X)]/[max(X) – min(X)].
In min-max scaling, it uses the minimum value of the feature instead of the mean. This method is very sensitive to the outliers as it only considers the end-values of the feature. Standardization rescales the values of a feature from 0 to 1. It does not normalize the distribution of the data whereas the former method will do it.
6. Handling Date and Time Variables
We come across many variables that indicate the date and time in different formats. We can derive more features from the date like the month, day of the week/month, year, weekend or not, the difference between the dates, etc. This can allow us to extract more insightful information from the data set. From the time features, we can also extract information like hours, minutes, seconds, etc.
One thing that most people miss out on is that all the date and time variables are cyclic features. For example, suppose we need to check which day between Wednesday (3) and Saturday (7) is closer to Sunday (being a 1). Now we know that Saturday is closer but in numerical terms, it will be a Wednesday as the distance between 3 and 1 is less than that of 7 and 1. The same can be applied when the time format is in 24-hour format.
To tackle this problem, we can express these variables as a representation of sin and cos function. For the ‘minute’ feature, we can apply sin and cos function using NumPy to represent it in cyclic nature as follows:
minute_feature_sin = np.sin(df[‘minute_feature’]*(2*π/60))
minute_feature_cos = np.cos(df[‘minute_feature’]*(2*π/60))
(Note: Dividing by 60 because there are 60 minutes in an hour. If you want to do it for months, divide it by 12 and so on)
By plotting these features on a scatter plot, you will notice that these features exhibit a cyclic relationship between them.
Also Read: Machine Learning Project Ideas & Topics
The article focused on the importance of feature engineering alongside citing some common techniques used in the process of feature engineering. It depends on the algorithm and the data at hand to decide on which techniques of all the above listed would provide better insights.
But that’s really a hard catch and not safe to assume as the data sets can be different and the algorithms used for the data can vary as well. The better approach is to follow an incremental approach and keep a track of the models that have been built along with their results rather than performing feature engineering recklessly.
If you’re interested to learn more about machine learning, check out IIIT-B & upGrad’s PG Diploma in Machine Learning & AI which is designed for working professionals and offers 450+ hours of rigorous training, 30+ case studies & assignments, IIIT-B Alumni status, 5+ practical hands-on capstone projects & job assistance with top firms.