Due to their usually huge size (often several gigabytes or more) and their probable origin from various, heterogeneous sources, the real-world databases of today are highly susceptible to noisy, incomplete, and unreliable information.
Why Is Data Preprocessing Important?
Data have quality when they fulfill the intended usage criteria. Certain criteria, including accuracy, completeness, validity, timeliness, credibility, and interpretability.
Major Data Preprocessing Task and Techniques
Therefore, a useful pre-processing step is to run the data through some data cleaning routines. Data cleaning routines work to clean the data by filling in missing values, smoothing noisy data, identifying or removing outliers, and solving inconsistencies. Data reduction obtains a reduced data set representation which is much smaller in size but provides the same (or nearly the same) analytical results. Strategies for data reduction include the reduction of dimensionality and the reduction of numbers.
Data encoding schemes are used to achieve a simplified or “compressed” representation of the original data through dimensionality reduction. Examples include techniques for data compression (e.g. principal components analysis and wavelet transforms), sub-set selection of attributes (e.g. deletion of obsolete attributes) and creation of attributes (e.g. where a limited set of more valuable attributes are extracted from the original set). In the numerosity reduction, data is replaced by alternative, smaller representations using parametric models (e.g. regression or log-linear models) or non-parametric models (e.g. histograms, clusters, sampling or data aggregation). Normalization, data discretization, and concept hierarchy generation are forms of data transformation. Discretization and the generation of concept hierarchy are powerful tools for data analysis, allowing multi-level data analysis.
Data from the real world appears to be incomplete, noisy, or incoherent. Routines for data cleaning (or data cleaning) try to fill in missing values, smooth out noise while identifying outliers, and correct data inconsistencies. There are a few methods to fill up the missing values.
- Ignore the tuple.
- Fill in the missing value manually.
- Use a global constant to fill in the missing value.
- Use a measure of central tendency for the attribute (e.g., the mean or median) to fill in the missing value.
- Use the attribute mean or median for all samples belonging to the same class as the given tuple.
- Use the most probable value to fill in the missing value.
There are few data smoothing methods on noisy data. The sorted values are distributed in a set of “buckets” or bins. Binning methods smooth a sorted data value by checking its “neighborhood,” i.e. its values. Multiple linear regression is a linear regression extension, involving more than two attributes and fitting the data to a multi-dimensional surface. Linear regression involves finding the “right” line to match two attributes (or variables) in order to predict the other attribute. Outlier analysis: Outliers can be identified by clustering, for example, where similar values are grouped into classes or “clusters.” Intuitively, values outside the set of clusters can be considered outliers.
We’ve been looking at methods to tackle missing data and data smoothing so far. Missing values, noise and anomalies lead to information inaccuracy. Discrepancies can be caused by multiple factors, including poorly designed data entry forms with multiple optional fields, human data entry errors, deliberate errors (e.g. respondents who do not want to disclose information about themselves) and data decay (e.g. outdated addresses). Certain examples of inconsistencies include errors that record information and software failures in instrumentation systems. So, how do we detect discrepancies? As a starting point, use any information about the properties of the data you may already have. For example, what are each attribute’s data type and domain? For each attribute, what are the relevant values? The basic explanations of statistical data are useful here to catch patterns in data and to recognize anomalies. They are variants of methods for analyzing data. Tools for data auditing recognize inconsistencies by analyzing the data for finding rules and relationships and identifying data that breaches such conditions.
In data integration, redundancy is another important issue. If it can be “derived” from another attribute or set of attributes, an attribute (such as annual revenue, for example) may be redundant. Correlation analysis can identify certain redundancies. Based on the available data, this study will calculate how strongly one attribute implies the other, given two attributes. There are a few correlation analysis methods.
- χ2 Correlation Test for Nominal Data
- Correlation Coefficient for Numeric Data
- Covariance of Numeric Data
Data reduction techniques can be applied in order to obtain a reduced representation of the data set which is much smaller in volume, but which keeps the integrity of the original data close. That is, mining should be more effective on the reduced data set while generating the same (or nearly the same) analytical results. Data reduction strategies include principal components analysis, attribute subset selection, parametric data reduction (regression and log-linear models), histograms, clustering, sampling, and data cube aggregation.
Data Transformation and Data Discretization
The data will be transformed or condensed into forms appropriate for analysis in data transformation. Data transformation methods are as follows:
- Attribute construction
- Concept hierarchy generation for nominal data
Using the neural network backpropagation algorithm for classification mining can help speed up the learning process by normalizing the input values for each attribute calculated in the training tuples. Normalization data attempts to give equal weight to all attributes. There are many methods for data normalization include min-max normalization, z-score normalization, and normalization by decimal scaling.
Binning is also used as a discretization method for data reduction. For example, attribute values can be discredited by applying equal width or equal frequency binning, and then replacing each bin value with the mean or median bin, as in bin media smoothing or bin medians smoothing, respectively.