Data Preprocessing
DATA PREPROCESSING
The objective of data pre-processing is to analyze,
filter, transform, and encode data so that a machine learning algorithm can
understand and work with the processed output. The phrase "garbage in,
garbage out" is very much apt to data mining and machine learning
projects. The presence of any unclean data like missing attributes, attribute
values, containing noise or outliers, and duplicate or wrong data will degrade
the quality of the ML results. So, It is important to manipulate or transform
the raw data in a useful and efficient format before it is used in Machine
learning to ensure or enhance performance.
Important Libraries for Data
Preprocessing:
To do data preprocessing in Python, we need to import
some predefined Python libraries. These libraries are used to perform some
specific tasks. There are three specific libraries that we will use for data
preprocessing.
·
Numpy: The Numpy Python
library is used to include many kinds of mathematical operations in the code.
This is the basic package for scientific computing in Python. It also supports
adding large multidimensional arrays and matrices.
·
Matplotlib: The second library is
matplotlib, which is a Python 2D plotting library, and along with this library
we need to import a pyplot sub library. This library is used to plot any kind
of graph in Python for code.
· Pandas: The last library is the Pandas library, which is one of the most famous Python libraries and is used for importing and managing datasets. It is an open-source data manipulation and analysis library.
1. DATA CLEANING
REMOVE MISSING VALUES
Missing values in a data are a problem since they can
often skew the results, depending on their type. This means that because the
data came from an unrepresentative sample, the findings could not be
generalizable to situations outside of our study. There are multiple ways to
handle missing values and those are presented below.
· Deleting a particular row – In this method, a specific row that has a null value for a feature or a specific column with more than 75% missing values is removed. However, this method is not completely efficient, and it is recommended that to use it only when the dataset contains a sufficient number of samples. You must ensure that there is no additional bias after deleting the data. When a row contains more than 75% missing values, it can be deleted.
· Replacing missing values with Zero - This technique may work for simple datasets because the data assumes zero as a base number, indicating that the value is missing. However, in most cases, zero can represent a value in and of itself. For instance, suppose a sensor generates temperature values and the dataset is from a tropical region. Similarly, in most cases, filling in missing values with 0 would be misleading to the model. Only when the dataset is independent of its effect can 0 be used as a replacement. In phone bill data, for example, a missing value in the billed amount column can be replaced by zero because it may indicate that the user did not subscribe to the plan that month.
·
Calculating
the mean, median, mode –
For features with numerical data, such as age, salary, year, etc., this method
is helpful. The mean, median, or mode of a certain feature, column, or row that
contains a missing value can be calculated here, and the result can be used to
fill in the gap. The dataset can gain variance using this technique, and any
data loss can be effectively offset. Consequently, it produces better results
than the first strategy (omitting rows and columns). This technique can be used
on a feature that contains numerical information, such as the year column or
the home team goal column. In order to fill in the missing data, we can compute
the feature's mean, median, or mode.
· In place of cricketer skill, his average score ie run rate can be taken over median and mode as best choice. In place of age, salary, median can be used as it is the best choice than mean which will tend towards high occurrences. Mode will be helpful in places where more rows have same values. Mode is preferred for data with less variance. Generally, Mode can be used for missing values in categorical data.
Utilizing the deviation of nearby numbers is another method of approximation. However, linear data perform best for this. This is an estimate that might introduce variation into the data set. But this approach, which produces better outcomes than removing rows and columns, can negate the loss of the data. The three estimates mentioned above can be substituted as a statistical method of handling missing values. This technique is also known as training data leakage. It can also be approximated using the variance of nearby values. If the data is linear, this will function better. The mean, median, or mode of a specific number of nearby values or rows with the same key element can be used in some cases. For the data variable having longitudinal behaviour, it might make sense to use the last valid observation to fill the missing value. This is known as the Last observation carried forward (LOCF) method. This method isn’t used in our pre-processing stages.
REMOVE NOISY DATA
Noisy Data are those which are meaningless,
unstructured and faulty data, that cannot be correctly read and interpreted by Programs.
It can be generated due to faulty data collection, data entry errors etc.
It can be handled in following ways:
Binning Method:
This technique uses sorted data to smooth it out. The entire set of data is separated into equal-sized pieces before the task is finished using a range of techniques. Each segment is dealt with independently. To finish the operation, one can use boundary values or replace all the data in a segment with its mean.
Regression:
In this method, smoothing the data involves fitting it
to a regression function. There are two types of regression that can be used:
linear or multiple (having multiple independent variables).
Clustering:
This strategy creates a cluster from related data. The
outliers might not be noticed or they might be outside of the clusters.
2. DATA TRANSFORMATION
Before performing data mining, data transformation is
a crucial data preprocessing technique that must be applied to the data in
order to produce patterns that are simpler to comprehend. Data transformation
transforms the data into clean, useable data by altering its format, structure,
or values.
Feature scaling
Feature Scaling is a method or technique to normalize
or standardize the independent variables of a data set in a particular range.
As the range of raw data values varies widely, in some machine learning
algorithms which works based on Euclidean distance, the objective functions
will not perform well without the feature scaling. There are two common
techniques in Feature Scaling.
·
Normalization:
In this
method, the data values are scaled within a specified range (-1.0 to 1.0 or
0.0 to 1.0) It is also called as Min-Max scaling.
Here, The formula is X’ = X - Xmin
/ Xmax - Xmin
·
Standardization: In this method, the
data values are centered around the mean with unit standard deviation. This
means that the mean of the attribute becomes 0 and the resulting distribution
has a unit standard deviation.
Here, the formula is X’ = X - Xmean
/ Standard deviation
For feature scaling, we will use StandardScaler class
of sklearn.preprocessing library.
Attribute Selection:
In this strategy, new attributes are built from the
given set of attributes to help the data mining process.
Discretization:
This is done to replace the raw values of the
numeric attribute with interval or concept levels.
Concept Hierarchy Generation:
Here, properties are transformed from lower to higher
in the hierarchy. Example "City" can be transformed into "Country"
Data rescaling
If data consists of attributes at different scales,
many ML algorithms can benefit from rescaling the attributes so they are all at
the same scale. This is useful for optimization algorithms used at the core of ML
algorithms such as gradient descent. It is also useful for algorithms that weight
inputs, such as regression and neural networks, and algorithms that use
distance measures, such as K-Nearest Neighbors.
Binarizing Data
Data can be transformed using a binary threshold. All
values above the threshold are marked with 1 and all values below are
marked with 0. This is called binarization of the data or thresholding of the
data. It is useful when there is a possibility of creating sharp values. It's
also useful while developing a feature and want to add new features that
indicate something meaningful.
Binary Classification
Here, a field can be converted into binary form. Converting types of disease to "Diseased" and "non diseased"
Encoding categorical data
Categorical data is statistical data consisting of
categorical data variables converted into categories.
Since the machine learning model works purely on math
and numbers, but if our dataset has a categorical variable, it can create
problems while building the model. Therefore, it is necessary to encode these
categorical variables as numbers.
With these numbers, the machine learning model can
assume that there is some correlation between these variables that will produce
false results. So, to get rid of this problem we will use a dummy
encoding.
Dummy variable
Dummy variables are those that have the value 0 or 1.
The value 1 indicates the presence of this variable in a particular column and
the remaining variables become 0. With dummy encoding, we will get number of
columns as equal to the number of categories. For Dummy Encoding, we will use
the One-Hot-Encoder class from the preprocessor library.
3.
DATA REDUCTION
Large dataset is used in data mining. It is
impractical and impossible to perform data analysis and mining when there is a
large volume of data to process. While lowering the amount of data, data
reduction techniques guarantee its integrity. Data reduction is the process of
taking a larger volume of original data and reducing it significantly by
preserving the integrity of the original data.
Many data reduction techniques are utilized to provide
a reduced version of the dataset that is substantially smaller in size. The
effectiveness of the data mining procedure is increased through data reduction,
and the same analytical outcomes are obtained.
Data reduction can save energy, can reduce physical
storage costs and can decrease data center track. The different stages of data
reduction are:
Data Cube Aggregation:
Aggregation operations are applied to data to build
data blocks. Certain columns can be
summarized to form the columns which are only needed. If dataset contains
different types of attacks on a machine and if we need to just process with
total attacks, we can include attack count feature and discard others.
Dimensionality Reduction:
We employ the attribute needed for our analysis if we
come across data that is only marginally significant. By removing the
attributes from the data set under examination, dimensionality reduction lowers
the amount of original data. It decreases data size by getting rid of
unnecessary or outmoded elements. This method reduces the size of the data
encryption mechanisms. It may or may not have a loss. If after rebuilding from
compressed data the original data can be recovered, then this reduction is
called lossless reduction, otherwise it is called lossy reduction.
Most effective methods for dimensionality reduction are
includes wavelet transform Attribute Subset Selection, and PCA (principal
component analysis).
Numerosity Reduction:
The numerosity reduction decreases the original data
volume and expresses it in a much smaller format. In this reduction strategy,
smaller representations of the data or mathematical models are used in place of
the actual data. This method consists of two kinds of numerosity reduction:
parametric and non-parametric.
When employing parametric approaches, data is modelled
in some way. The model is used to estimate the data so that just the parameters
of the data—instead of the actual data—need to be stored. log-linear and
regression techniques are employed for this model.
Histograms, clustering, sampling, and data cube
aggregation are some of the techniques employed for storing condensed
representations of the data.
Data Compression
Data compression is the process of modifying,
encoding, or transforming the structure of data such that it takes up less
space. By eliminating redundancy and presenting data in binary form, data
compression includes creating a compact representation of information. Lossless
compression is the term used to describe data that can successfully be
recovered from its compressed state. Lossy compression, on the other hand, is
the opposite, when it is impossible to recover the original form from the
compressed form. Additionally utilised for data compression are the methods of
numerosity and dimension reduction. The most common encoding techniques are
Huffman Encoding & run-length Encoding.
Comments
Post a Comment