Data Preprocessing

 DATA PREPROCESSING

The objective of data pre-processing is to analyze, filter, transform, and encode data so that a machine learning algorithm can understand and work with the processed output. The phrase "garbage in, garbage out" is very much apt to data mining and machine learning projects. The presence of any unclean data like missing attributes, attribute values, containing noise or outliers, and duplicate or wrong data will degrade the quality of the ML results. So, It is important to manipulate or transform the raw data in a useful and efficient format before it is used in Machine learning to ensure or enhance performance.



Important Libraries for Data Preprocessing:

To do data preprocessing in Python, we need to import some predefined Python libraries. These libraries are used to perform some specific tasks. There are three specific libraries that we will use for data preprocessing.

·       Numpy: The Numpy Python library is used to include many kinds of mathematical operations in the code. This is the basic package for scientific computing in Python. It also supports adding large multidimensional arrays and matrices.

·       Matplotlib: The second library is matplotlib, which is a Python 2D plotting library, and along with this library we need to import a pyplot sub library. This library is used to plot any kind of graph in Python for code.

·       Pandas: The last library is the Pandas library, which is one of the most famous Python libraries and is used for importing and managing datasets. It is an open-source data manipulation and analysis library.

 1. DATA CLEANING

REMOVE MISSING VALUES

Missing values in a data are a problem since they can often skew the results, depending on their type. This means that because the data came from an unrepresentative sample, the findings could not be generalizable to situations outside of our study. There are multiple ways to handle missing values and those are presented below.

·       Deleting a particular row – In this method, a specific row that has a null value for a feature or a specific column with more than 75% missing values is removed. However, this method is not completely efficient, and it is recommended that to use it only when the dataset contains a sufficient number of samples. You must ensure that there is no additional bias after deleting the data. When a row contains more than 75% missing values, it can be deleted.

·       Replacing missing values with Zero - This technique may work for simple datasets because the data assumes zero as a base number, indicating that the value is missing. However, in most cases, zero can represent a value in and of itself. For instance, suppose a sensor generates temperature values and the dataset is from a tropical region. Similarly, in most cases, filling in missing values with 0 would be misleading to the model. Only when the dataset is independent of its effect can 0 be used as a replacement. In phone bill data, for example, a missing value in the billed amount column can be replaced by zero because it may indicate that the user did not subscribe to the plan that month.

·       Calculating the mean, median, mode – For features with numerical data, such as age, salary, year, etc., this method is helpful. The mean, median, or mode of a certain feature, column, or row that contains a missing value can be calculated here, and the result can be used to fill in the gap. The dataset can gain variance using this technique, and any data loss can be effectively offset. Consequently, it produces better results than the first strategy (omitting rows and columns). This technique can be used on a feature that contains numerical information, such as the year column or the home team goal column. In order to fill in the missing data, we can compute the feature's mean, median, or mode.

·       In place of cricketer skill, his average score ie run rate can be taken over median and mode as best choice. In place of age, salary, median can be used as it is the best choice than mean which will tend towards high occurrences. Mode will be helpful in places where more rows have same values. Mode is preferred for data with less variance. Generally, Mode can be used for missing values in categorical data.

      Utilizing the deviation of nearby numbers is another method of approximation. However, linear data perform best for this. This is an estimate that might introduce variation into the data set. But this approach, which produces better outcomes than removing rows and columns, can negate the loss of the data. The three estimates mentioned above can be substituted as a statistical method of handling missing values. This technique is also known as training data leakage. It can also be approximated using the variance of nearby values. If the data is linear, this will function better. The mean, median, or mode of a specific number of nearby values or rows with the same key element can be used in some cases. For the data variable having longitudinal behaviour, it might make sense to use the last valid observation to fill the missing value. This is known as the Last observation carried forward (LOCF) method. This method isn’t used in our pre-processing stages.

REMOVE NOISY DATA

Noisy Data are those which are meaningless, unstructured and faulty data, that cannot be correctly read and interpreted by Programs. It can be generated due to faulty data collection, data entry errors etc.

It can be handled in following ways:

Binning Method:

This technique uses sorted data to smooth it out. The entire set of data is separated into equal-sized pieces before the task is finished using a range of techniques. Each segment is dealt with independently. To finish the operation, one can use boundary values or replace all the data in a segment with its mean. 

Regression:

In this method, smoothing the data involves fitting it to a regression function. There are two types of regression that can be used: linear or multiple (having multiple independent variables). 

Clustering:

This strategy creates a cluster from related data. The outliers might not be noticed or they might be outside of the clusters.

2.  DATA TRANSFORMATION

Before performing data mining, data transformation is a crucial data preprocessing technique that must be applied to the data in order to produce patterns that are simpler to comprehend. Data transformation transforms the data into clean, useable data by altering its format, structure, or values.

Feature scaling

Feature Scaling is a method or technique to normalize or standardize the independent variables of a data set in a particular range. As the range of raw data values ​​varies widely, in some machine learning algorithms which works based on Euclidean distance, the objective functions will not perform well without the feature scaling. There are two common techniques in Feature Scaling.

·       Normalization: In this method, the data values are scaled ​​within a specified range (-1.0 to 1.0 or 0.0 to 1.0) It is also called as Min-Max scaling.

Here, The formula is X’ = X - Xmin / Xmax - Xmin

·       Standardization: In this method, the data values ​​are centered around the mean with unit standard deviation. This means that the mean of the attribute becomes 0 and the resulting distribution has a unit standard deviation.

Here, the formula is X’ = X - Xmean / Standard deviation

For feature scaling, we will use StandardScaler class of sklearn.preprocessing library.

Attribute Selection:

In this strategy, new attributes are built from the given set of attributes to help the data mining process.

Discretization:

This is done to replace the raw values ​​of the numeric attribute with interval or concept levels.

Concept Hierarchy Generation:

Here, properties are transformed from lower to higher in the hierarchy. Example "City" can be transformed into "Country"

Data rescaling

If data consists of attributes at different scales, many ML algorithms can benefit from rescaling the attributes so they are all at the same scale. This is useful for optimization algorithms used at the core of ML algorithms such as gradient descent. It is also useful for algorithms that weight inputs, such as regression and neural networks, and algorithms that use distance measures, such as K-Nearest Neighbors.

Binarizing Data

Data can be transformed using a binary threshold. All values ​​above the threshold are marked with 1 and all values ​​below are marked with 0. This is called binarization of the data or thresholding of the data. It is useful when there is a possibility of creating sharp values. It's also useful while developing a feature and want to add new features that indicate something meaningful.

Binary Classification

Here, a field can be converted into binary form. Converting types of disease to "Diseased" and "non diseased"

Encoding categorical data

Categorical data is statistical data consisting of categorical data variables converted into categories.

Since the machine learning model works purely on math and numbers, but if our dataset has a categorical variable, it can create problems while building the model. Therefore, it is necessary to encode these categorical variables as numbers.

With these numbers, the machine learning model can assume that there is some correlation between these variables that will produce false results. So, to get rid of this problem we will use a dummy encoding. 

Dummy variable

Dummy variables are those that have the value 0 or 1. The value 1 indicates the presence of this variable in a particular column and the remaining variables become 0. With dummy encoding, we will get number of columns as equal to the number of categories. For Dummy Encoding, we will use the One-Hot-Encoder class from the preprocessor library.

3.    DATA REDUCTION

Large dataset is used in data mining. It is impractical and impossible to perform data analysis and mining when there is a large volume of data to process. While lowering the amount of data, data reduction techniques guarantee its integrity. Data reduction is the process of taking a larger volume of original data and reducing it significantly by preserving the integrity of the original data.

Many data reduction techniques are utilized to provide a reduced version of the dataset that is substantially smaller in size. The effectiveness of the data mining procedure is increased through data reduction, and the same analytical outcomes are obtained.

Data reduction can save energy, can reduce physical storage costs and can decrease data center track. The different stages of data reduction are:

Data Cube Aggregation:

Aggregation operations are applied to data to build data blocks.  Certain columns can be summarized to form the columns which are only needed. If dataset contains different types of attacks on a machine and if we need to just process with total attacks, we can include attack count feature and discard others.

Dimensionality Reduction:

We employ the attribute needed for our analysis if we come across data that is only marginally significant. By removing the attributes from the data set under examination, dimensionality reduction lowers the amount of original data. It decreases data size by getting rid of unnecessary or outmoded elements. This method reduces the size of the data encryption mechanisms. It may or may not have a loss. If after rebuilding from compressed data the original data can be recovered, then this reduction is called lossless reduction, otherwise it is called lossy reduction.

Most effective methods for dimensionality reduction are includes wavelet transform Attribute Subset Selection, and PCA (principal component analysis).

Numerosity Reduction:

The numerosity reduction decreases the original data volume and expresses it in a much smaller format. In this reduction strategy, smaller representations of the data or mathematical models are used in place of the actual data. This method consists of two kinds of numerosity reduction: parametric and non-parametric.

When employing parametric approaches, data is modelled in some way. The model is used to estimate the data so that just the parameters of the data—instead of the actual data—need to be stored. log-linear and regression techniques are employed for this model.

Histograms, clustering, sampling, and data cube aggregation are some of the techniques employed for storing condensed representations of the data.

Data Compression

Data compression is the process of modifying, encoding, or transforming the structure of data such that it takes up less space. By eliminating redundancy and presenting data in binary form, data compression includes creating a compact representation of information. Lossless compression is the term used to describe data that can successfully be recovered from its compressed state. Lossy compression, on the other hand, is the opposite, when it is impossible to recover the original form from the compressed form. Additionally utilised for data compression are the methods of numerosity and dimension reduction. The most common encoding techniques are Huffman Encoding & run-length Encoding.

Comments

Popular posts from this blog

Components of a Research Paper

Artificial Intelligence - An Overview by Dharaneish