Major Tasks in Data Preprocessing - Data cleaning

Datamining & Warehousing
Dr.VIDHYA B
ASSISTANT PROFESSOR & HEAD
Department of Computer Technology
Sri Ramakrishna College of Arts and Science
Coimbatore - 641 006
Tamil Nadu, India
1
Unit 2 - Preprocessing

2
2
Chapter 3: Data Preprocessing
 Data Preprocessing: An Overview
 Data Quality
 Major Tasks in Data Preprocessing
 Data Cleaning
 Data Integration
 Data Reduction
 Data Transformation and Data Discretization
 Summary

3
Data Quality: Why Preprocess the Data?
 Measures for data quality: A multidimensional view
 Accuracy: correct or wrong, accurate or not
 Completeness: not recorded, unavailable, …
 Consistency: some modified but some not, dangling, …
 Timeliness: timely update

Believability: how trustable the data are correct
 Interpretability: how easily the data can be understood
Sri Ramakrishna College of Arts & Science

4
 Example : Analyzing the company’s data for branch’s sales.
 Inspect the company’s database and data warehouse, users of
database system, some data have reported errors, unusual values,
and inconsistencies in the data recorded for some transactions.
 Data analyzing by data mining techniques are
 incomplete (lacking attribute values or certain attributes of interest,
or containing only aggregate data);
 inaccurate or noisy (containing errors, or values that deviate from
the expected);
 inconsistent (e.g., containing discrepancies in the department codes
used to categorize items)

5
 Reasons for inaccurate data (i.e., having incorrect attribute values):
 The data collection instruments used may be faulty.
 There may have been human or computer errors occurring at data
entry.
 Users may purposely submit incorrect data values for mandatory
fields when they do not wish to submit personal information (e.g.,
by choosing the default value “January 1” displayed for birthday).
This is known as disguised missing data.
 There may be technology limitations: limited buffer size for
coordinating synchronized data transfer and consumption.
 Incorrect data may also result from inconsistencies in naming
conventions or data codes, or inconsistent formats for input fields
(e.g., date). Duplicate tuples also require data cleaning

6
 Reasons for incomplete data:
 Attributes of interest, may not always be available, such as customer
information for sales transaction data.
 Relevant data may not be recorded due to a misunderstanding or because
of equipment malfunctions
 The recording of the data history or modifications may have been
overlooked. Missing data, particularly for tuples with missing values for
some attributes, may need to be inferred.
 Timeliness also affects data quality. The month-end data are not updated in a
timely fashion has a negative impact on the data quality.
 Two other factors affecting data quality are believability and interpretability
 Believability reflects how much the data are trusted by users
 Interpretability reflects how easy the data are understood

7
Major Tasks in Data Preprocessing
 Data cleaning
 Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies.

dirty data can cause confusion for the mining procedure,
resulting in unreliable output.
 Data integration

Integration of multiple databases, data cubes, or files

Some attributes representing a given concept may have
different names in different databases, causing
inconsistencies and redundancies.

Eg: customer id in one data store and cust id in another.

Large amount of redundant data may slow down or confuse
the knowledge discovery process..

8
 Data reduction is a reduced representation of the data set smaller in
volume, yet produces the same (or almost the same) analytical results.

Dimensionality reduction: data encoding schemes are applied to obtain a
reduced or “compressed” representation of the original data. Eg: attribute
subset selection (e.g., removing irrelevant attributes) attribute
construction (e.g., where a small set of more useful attributes is derived
from the original set).

Numerosity reduction : the data are replaced by alternative, smaller
representations using parametric models (e.g., regression or log-linear
models) or nonparametric models (e.g., histograms, clusters, sampling, or
data aggregation).
 Data compression
 Data transformation and data discretization

powerful tools for data mining allow data mining at multiple abstraction
levels are Normalization & Concept hierarchy generation

9

10
10
Chapter 3: Data Preprocessing
 Data Preprocessing: An Overview
 Data Quality
 Major Tasks in Data Preprocessing
 Data Cleaning
 Data Integration
 Data Reduction
 Data Transformation and Data Discretization
 Summary

11
Data Cleaning - Introduction
 Data in the Real World Is Dirty: Lots of potentially incorrect
data, e.g., instrument faulty, human or computer error,
transmission error
 Data Cleaning process:

12
Data Cleaning - Introduction
 Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g.,
instrument faulty, human or computer error, transmission error
 Data cleaning (or data cleansing) routines attempt to fill in missing
values, smooth out noise while identifying outliers, and correct
inconsistencies in the data.
 incomplete: lacking attribute values, lacking certain attributes of interest,
or containing only aggregate data

e.g., Occupation=“ ” (missing data)
 noisy: containing noise, errors, or outliers

e.g., Salary=“−10” (an error)

inconsistent: containing discrepancies in codes or names, e.g.,

Age=“42”, Birthday=“03/07/2010”

Was rating “1, 2, 3”, now rating “A, B, C”

discrepancy between duplicate records

Intentional (e.g., disguised missing data)

Jan. 1 as everyone’s birthday?

13
Data Cleaning – Missing Values
 Data is not always available
 E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
 Missing data may be due to
 equipment malfunction
 inconsistent with other recorded data and thus
deleted
 data not entered due to misunderstanding
 certain data may not be considered important at the
time of entry
 not register history or changes of the data
 Missing data may need to be inferred

14
 Ignore the tuple:
 when the class label is missing
 Not very effective, unless the tuple contains several
attributes with missing values. It is especially poor when
the percentage of missing values per attribute varies
considerably.
 Fill in the missing value manually:
 Time consuming and may not be feasible given a large data
set with many missing values.
 Use a global constant to fill in the missing value:
 Replace all missing attribute values by the same constant
such as a label like “Unknown” or −∞

15
 Use a measure of central tendency for the attribute (e.g.,
the mean or median) to fill in the missing value:
 For normal (symmetric) data distributions, the mean can be
used, while skewed data distribution should employ the
median
 Use the attribute mean or median for all samples
belonging to the same class as the given tuple:
 If the data distribution for a given class is skewed, the
median value is a better choice
 Use the most probable value to fill in the missing value:
 Determined with regression, inference-based tools using a
Bayesian formalism, or decision tree induction

16
Data Cleaning - Noisy Data
 Noise: random error or variance in a measured variable
 Incorrect attribute values may be due to

faulty data collection instruments
 data entry problems
 data transmission problems
 technology limitation
 inconsistency in naming convention
 Three methods to remove Noisy data:

Binning
 Regression

Outlier Analysis

17
How to Handle Noisy Data?
 Binning is also used as a discretization.
 first sort data and partition into (equal-frequency)
bins
 then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.)
In smoothing by bin means,
each value in a bin is replaced by the mean value
of the bin.
For example, the mean of the values4, 8, and 15
in Bin 1 is 9.
Therefore, each original value in this bin is
replaced by the value 9.
Similarly, smoothing by bin medians can be
employed, in which each bin value is replaced by
the bin median.
In smoothing by bin boundaries, the minimum
and maximum values in a given bin are identified
as the bin boundaries.
Each bin value is then replaced by the closet
boundary value.

18
How to Handle Noisy Data?
 Regression

Data smoothing can also be done by regression.

Converts data values to a function.

Linear regression involves finding the “best” line
to fit two attributes (or variables) so that one
attribute can be used to predict the other.

Multiple linear regression more than two
attributes are involved and the data are fit to a
multidimensional surface.
 Outliers analysis

Detected by clustering similar values are organized
into groups, or “clusters.”

Intuitively, values that fall outside of the set of
clusters may be considered outliers

19
Data Cleaning as a Process
 Data discrepancy detection
The first step in data cleaning as a process is discrepancy detection
Several factors of data discrepancy detection are:
 poorly designed data entry forms have many optional fields
 human error in data entry
 deliberate errors – users does not want to revel their secret
 data decay – outdated addresses
 inconsistent data representations & inconsistent use of codes
 errors in instrumentation devices
 when the data are (inadequately) used for purposes other than
originally intended.
 Inconsistencies due to data integration

20
 Data discrepancy detection
 Use metadata (e.g., domain, range, dependency, distribution)

Check field overloading
 Check uniqueness rule, consecutive rule and null rule
- A unique rule says that each value of the given attribute must be
different from all other values for that attribute.
- A consecutive rule says that there can be no missing values between
the lowest and highest values for the attribute, and that all values must
also be unique.
- A null rule specifies the use of blanks, question marks, special
characters, or other strings that may indicate the null condition (e.g.,
where a value for a given attribute is not available), and how such
values should be handled.

21
 Use commercial tools

Data scrubbing: use simple domain knowledge (e.g., postal
code, spell-check) to detect errors and make corrections

Data auditing: by analyzing data to discover rules and
relationship to detect violators (e.g., correlation and
clustering to find outliers)
 Data migration and integration
 Data migration tools: allow transformations to be specified
 ETL (Extraction/Transformation/Loading) tools: allow users to
specify transformations through a graphical user interface.
 Integration of the two processes data discrepancy & data
transformation which is error-prone and time consuming
 Iterative and interactive new approach –
e.g., Potter’s Wheels) publicly available data tool.
 Development of declarative languages

Major Tasks in Data Preprocessing - Data cleaning

More Related Content

Similar to Major Tasks in Data Preprocessing - Data cleaning (20)

More from VidhyaB10 (16)

Recently uploaded (20)

Major Tasks in Data Preprocessing - Data cleaning