SlideShare a Scribd company logo
Datamining & Warehousing
Dr.VIDHYA B
ASSISTANT PROFESSOR & HEAD
Department of Computer Technology
Sri Ramakrishna College of Arts and Science
Coimbatore - 641 006
Tamil Nadu, India
1
Unit 2 - Preprocessing
2
2
Chapter 3: Data Preprocessing
 Data Preprocessing: An Overview
 Data Quality
 Major Tasks in Data Preprocessing
 Data Cleaning
 Data Integration
 Data Reduction
 Data Transformation and Data Discretization
 Summary
3
Data Quality: Why Preprocess the Data?
 Measures for data quality: A multidimensional view
 Accuracy: correct or wrong, accurate or not
 Completeness: not recorded, unavailable, …
 Consistency: some modified but some not, dangling, …
 Timeliness: timely update

Believability: how trustable the data are correct
 Interpretability: how easily the data can be understood
Sri Ramakrishna College of Arts & Science
4
Data Quality: Why Preprocess the Data?
 Example : Analyzing the company’s data for branch’s sales.
 Inspect the company’s database and data warehouse, users of
database system, some data have reported errors, unusual values,
and inconsistencies in the data recorded for some transactions.
 Data analyzing by data mining techniques are
 incomplete (lacking attribute values or certain attributes of interest,
or containing only aggregate data);
 inaccurate or noisy (containing errors, or values that deviate from
the expected);
 inconsistent (e.g., containing discrepancies in the department codes
used to categorize items)
Sri Ramakrishna College of Arts & Science
Sri Ramakrishna College of Arts & Science
5
Data Quality: Why Preprocess the Data?
 Reasons for inaccurate data (i.e., having incorrect attribute values):
 The data collection instruments used may be faulty.
 There may have been human or computer errors occurring at data
entry.
 Users may purposely submit incorrect data values for mandatory
fields when they do not wish to submit personal information (e.g.,
by choosing the default value “January 1” displayed for birthday).
This is known as disguised missing data.
 There may be technology limitations: limited buffer size for
coordinating synchronized data transfer and consumption.
 Incorrect data may also result from inconsistencies in naming
conventions or data codes, or inconsistent formats for input fields
(e.g., date). Duplicate tuples also require data cleaning
Sri Ramakrishna College of Arts & Science
6
Data Quality: Why Preprocess the Data?
 Reasons for incomplete data:
 Attributes of interest, may not always be available, such as customer
information for sales transaction data.
 Relevant data may not be recorded due to a misunderstanding or because
of equipment malfunctions
 The recording of the data history or modifications may have been
overlooked. Missing data, particularly for tuples with missing values for
some attributes, may need to be inferred.
 Timeliness also affects data quality. The month-end data are not updated in a
timely fashion has a negative impact on the data quality.
 Two other factors affecting data quality are believability and interpretability
 Believability reflects how much the data are trusted by users
 Interpretability reflects how easy the data are understood
Sri Ramakrishna College of Arts & Science
7
Major Tasks in Data Preprocessing
 Data cleaning
 Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies.

dirty data can cause confusion for the mining procedure,
resulting in unreliable output.
 Data integration

Integration of multiple databases, data cubes, or files

Some attributes representing a given concept may have
different names in different databases, causing
inconsistencies and redundancies.

Eg: customer id in one data store and cust id in another.

Large amount of redundant data may slow down or confuse
the knowledge discovery process..
Sri Ramakrishna College of Arts & Science
8
Major Tasks in Data Preprocessing
 Data reduction is a reduced representation of the data set smaller in
volume, yet produces the same (or almost the same) analytical results.

Dimensionality reduction: data encoding schemes are applied to obtain a
reduced or “compressed” representation of the original data. Eg: attribute
subset selection (e.g., removing irrelevant attributes) attribute
construction (e.g., where a small set of more useful attributes is derived
from the original set).

Numerosity reduction : the data are replaced by alternative, smaller
representations using parametric models (e.g., regression or log-linear
models) or nonparametric models (e.g., histograms, clusters, sampling, or
data aggregation).
 Data compression
 Data transformation and data discretization

powerful tools for data mining allow data mining at multiple abstraction
levels are Normalization & Concept hierarchy generation
Sri Ramakrishna College of Arts & Science
9
Major Tasks in Data Preprocessing
Sri Ramakrishna College of Arts & Science
10
10
Chapter 3: Data Preprocessing
 Data Preprocessing: An Overview
 Data Quality
 Major Tasks in Data Preprocessing
 Data Cleaning
 Data Integration
 Data Reduction
 Data Transformation and Data Discretization
 Summary
11
Data Cleaning - Introduction
 Data in the Real World Is Dirty: Lots of potentially incorrect
data, e.g., instrument faulty, human or computer error,
transmission error
 Data Cleaning process:
Sri Ramakrishna College of Arts & Science
12
Data Cleaning - Introduction
 Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g.,
instrument faulty, human or computer error, transmission error
 Data cleaning (or data cleansing) routines attempt to fill in missing
values, smooth out noise while identifying outliers, and correct
inconsistencies in the data.
 incomplete: lacking attribute values, lacking certain attributes of interest,
or containing only aggregate data

e.g., Occupation=“ ” (missing data)
 noisy: containing noise, errors, or outliers

e.g., Salary=“−10” (an error)

inconsistent: containing discrepancies in codes or names, e.g.,

Age=“42”, Birthday=“03/07/2010”

Was rating “1, 2, 3”, now rating “A, B, C”

discrepancy between duplicate records

Intentional (e.g., disguised missing data)

Jan. 1 as everyone’s birthday?
Sri Ramakrishna College of Arts & Science
13
Data Cleaning – Missing Values
 Data is not always available
 E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
 Missing data may be due to
 equipment malfunction
 inconsistent with other recorded data and thus
deleted
 data not entered due to misunderstanding
 certain data may not be considered important at the
time of entry
 not register history or changes of the data
 Missing data may need to be inferred
14
Data Cleaning – Missing Values
 Ignore the tuple:
 when the class label is missing
 Not very effective, unless the tuple contains several
attributes with missing values. It is especially poor when
the percentage of missing values per attribute varies
considerably.
 Fill in the missing value manually:
 Time consuming and may not be feasible given a large data
set with many missing values.
 Use a global constant to fill in the missing value:
 Replace all missing attribute values by the same constant
such as a label like “Unknown” or −∞
Sri Ramakrishna College of Arts & Science
15
Data Cleaning – Missing Values
 Use a measure of central tendency for the attribute (e.g.,
the mean or median) to fill in the missing value:
 For normal (symmetric) data distributions, the mean can be
used, while skewed data distribution should employ the
median
 Use the attribute mean or median for all samples
belonging to the same class as the given tuple:
 If the data distribution for a given class is skewed, the
median value is a better choice
 Use the most probable value to fill in the missing value:
 Determined with regression, inference-based tools using a
Bayesian formalism, or decision tree induction
Sri Ramakrishna College of Arts & Science
16
Data Cleaning - Noisy Data
 Noise: random error or variance in a measured variable
 Incorrect attribute values may be due to

faulty data collection instruments
 data entry problems
 data transmission problems
 technology limitation
 inconsistency in naming convention
 Three methods to remove Noisy data:

Binning
 Regression

Outlier Analysis
Sri Ramakrishna College of Arts & Science
17
How to Handle Noisy Data?
 Binning is also used as a discretization.
 first sort data and partition into (equal-frequency)
bins
 then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.)
In smoothing by bin means,
each value in a bin is replaced by the mean value
of the bin.
For example, the mean of the values4, 8, and 15
in Bin 1 is 9.
Therefore, each original value in this bin is
replaced by the value 9.
Similarly, smoothing by bin medians can be
employed, in which each bin value is replaced by
the bin median.
In smoothing by bin boundaries, the minimum
and maximum values in a given bin are identified
as the bin boundaries.
Each bin value is then replaced by the closet
boundary value.
Sri Ramakrishna College of Arts & Science
18
How to Handle Noisy Data?
 Regression

Data smoothing can also be done by regression.

Converts data values to a function.

Linear regression involves finding the “best” line
to fit two attributes (or variables) so that one
attribute can be used to predict the other.

Multiple linear regression more than two
attributes are involved and the data are fit to a
multidimensional surface.
 Outliers analysis

Detected by clustering similar values are organized
into groups, or “clusters.”

Intuitively, values that fall outside of the set of
clusters may be considered outliers
19
Data Cleaning as a Process
 Data discrepancy detection
The first step in data cleaning as a process is discrepancy detection
Several factors of data discrepancy detection are:
 poorly designed data entry forms have many optional fields
 human error in data entry
 deliberate errors – users does not want to revel their secret
 data decay – outdated addresses
 inconsistent data representations & inconsistent use of codes
 errors in instrumentation devices
 when the data are (inadequately) used for purposes other than
originally intended.
 Inconsistencies due to data integration
Sri Ramakrishna College of Arts & Science
20
Data Cleaning as a Process
 Data discrepancy detection
 Use metadata (e.g., domain, range, dependency, distribution)

Check field overloading
 Check uniqueness rule, consecutive rule and null rule
- A unique rule says that each value of the given attribute must be
different from all other values for that attribute.
- A consecutive rule says that there can be no missing values between
the lowest and highest values for the attribute, and that all values must
also be unique.
- A null rule specifies the use of blanks, question marks, special
characters, or other strings that may indicate the null condition (e.g.,
where a value for a given attribute is not available), and how such
values should be handled.
21
Data Cleaning as a Process
 Use commercial tools

Data scrubbing: use simple domain knowledge (e.g., postal
code, spell-check) to detect errors and make corrections

Data auditing: by analyzing data to discover rules and
relationship to detect violators (e.g., correlation and
clustering to find outliers)
 Data migration and integration
 Data migration tools: allow transformations to be specified
 ETL (Extraction/Transformation/Loading) tools: allow users to
specify transformations through a graphical user interface.
 Integration of the two processes data discrepancy & data
transformation which is error-prone and time consuming
 Iterative and interactive new approach –
e.g., Potter’s Wheels) publicly available data tool.
 Development of declarative languages

More Related Content

PPT
Chapter 2 Cond (1).ppt
PDF
data processing.pdf
PPT
preproccessing level 3 for students.ppt
PPT
Preprocessing.ppt
PPTX
Data preprocessing
PDF
Data Preparation and Preprocessing , Data Cleaning
PPT
Preprocessing data mining hhxdzsdsasaasa
PDF
Data Preprocessing -Data Quality Noisy Data
Chapter 2 Cond (1).ppt
data processing.pdf
preproccessing level 3 for students.ppt
Preprocessing.ppt
Data preprocessing
Data Preparation and Preprocessing , Data Cleaning
Preprocessing data mining hhxdzsdsasaasa
Data Preprocessing -Data Quality Noisy Data

Similar to Major Tasks in Data Preprocessing - Data cleaning (20)

PPT
Data Preprocessing_17924109858fc09abd41bc880e540c13.ppt
PPTX
Data Preprocessing
PPT
03 preprocessing
PPT
Data preprocessing in precision agriculture
PPT
data Preprocessing different techniques summarized
PDF
Data preprocessing
PDF
Data mining and data warehouse lab manual updated
PDF
Copy of Data preprocessing.pdf give notes regarding mining concpts
PPTX
03Preprocessing_plp.pptx
PPT
03Preprocessing.ppt
PPT
Data Preprocessing and Visualizsdjvnovrnververdfvdfation
PPT
03Predddddddddddddddddddddddprocessling.ppt
PPT
data mining preprocessing notes and pptt
PPTX
03Preprocessing_plp.pptx
PPT
03Preprocessing.ppt
PPT
Preprocessing.ppt
PPT
Preprocessing concepts and techniques.ppt
PPT
03Preprocessing for student computer sciecne.ppt
PPT
Chapter 3. Data Preprocessing.ppt
PPT
Unit 3-2.ppt
Data Preprocessing_17924109858fc09abd41bc880e540c13.ppt
Data Preprocessing
03 preprocessing
Data preprocessing in precision agriculture
data Preprocessing different techniques summarized
Data preprocessing
Data mining and data warehouse lab manual updated
Copy of Data preprocessing.pdf give notes regarding mining concpts
03Preprocessing_plp.pptx
03Preprocessing.ppt
Data Preprocessing and Visualizsdjvnovrnververdfvdfation
03Predddddddddddddddddddddddprocessling.ppt
data mining preprocessing notes and pptt
03Preprocessing_plp.pptx
03Preprocessing.ppt
Preprocessing.ppt
Preprocessing concepts and techniques.ppt
03Preprocessing for student computer sciecne.ppt
Chapter 3. Data Preprocessing.ppt
Unit 3-2.ppt
Ad

More from VidhyaB10 (16)

PPTX
ANN – NETWORK ARCHITECTURE in Natural Language Processing
PPTX
Exploring and Processing Text data using NLP
PPTX
NLP Introduction - Natural Language Processing and Artificial Intelligence Ov...
PPTX
Applications & Text Representations.pptx
PPT
Preprocessing - Data Integration Tuple Duplication
PPT
Applications ,Issues & Technology in Data mining -
PPTX
Python Visualization API Primersubplots
PPTX
Python _dataStructures_ List, Tuples, its functions
PPTX
Python_Functions_Modules_ User define Functions-
PPT
Datamining - Introduction - Knowledge Discovery in Databases
PPTX
INSTRUCTION PROCESSOR DESIGN Computer system architecture
PPTX
Disk Scheduling in OS computer deals with multiple processes over a period of...
PPTX
Unit 2 digital fundamentals boolean func.pptx
PPTX
Digital Fundamental - Binary Codes-Logic Gates
PPTX
unit 5-files.pptx
PPTX
Python_Unit1_Introduction.pptx
ANN – NETWORK ARCHITECTURE in Natural Language Processing
Exploring and Processing Text data using NLP
NLP Introduction - Natural Language Processing and Artificial Intelligence Ov...
Applications & Text Representations.pptx
Preprocessing - Data Integration Tuple Duplication
Applications ,Issues & Technology in Data mining -
Python Visualization API Primersubplots
Python _dataStructures_ List, Tuples, its functions
Python_Functions_Modules_ User define Functions-
Datamining - Introduction - Knowledge Discovery in Databases
INSTRUCTION PROCESSOR DESIGN Computer system architecture
Disk Scheduling in OS computer deals with multiple processes over a period of...
Unit 2 digital fundamentals boolean func.pptx
Digital Fundamental - Binary Codes-Logic Gates
unit 5-files.pptx
Python_Unit1_Introduction.pptx
Ad

Recently uploaded (20)

PPTX
Lesson notes of climatology university.
PDF
A systematic review of self-coping strategies used by university students to ...
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PDF
RTP_AR_KS1_Tutor's Guide_English [FOR REPRODUCTION].pdf
PDF
Paper A Mock Exam 9_ Attempt review.pdf.
PPTX
Radiologic_Anatomy_of_the_Brachial_plexus [final].pptx
PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PPTX
master seminar digital applications in india
DOC
Soft-furnishing-By-Architect-A.F.M.Mohiuddin-Akhand.doc
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PPTX
Cell Structure & Organelles in detailed.
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PDF
Yogi Goddess Pres Conference Studio Updates
PDF
Complications of Minimal Access Surgery at WLH
PDF
Trump Administration's workforce development strategy
PDF
Classroom Observation Tools for Teachers
PDF
Chinmaya Tiranga quiz Grand Finale.pdf
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PPTX
UV-Visible spectroscopy..pptx UV-Visible Spectroscopy – Electronic Transition...
Lesson notes of climatology university.
A systematic review of self-coping strategies used by university students to ...
Final Presentation General Medicine 03-08-2024.pptx
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
RTP_AR_KS1_Tutor's Guide_English [FOR REPRODUCTION].pdf
Paper A Mock Exam 9_ Attempt review.pdf.
Radiologic_Anatomy_of_the_Brachial_plexus [final].pptx
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
master seminar digital applications in india
Soft-furnishing-By-Architect-A.F.M.Mohiuddin-Akhand.doc
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
Cell Structure & Organelles in detailed.
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
Yogi Goddess Pres Conference Studio Updates
Complications of Minimal Access Surgery at WLH
Trump Administration's workforce development strategy
Classroom Observation Tools for Teachers
Chinmaya Tiranga quiz Grand Finale.pdf
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
UV-Visible spectroscopy..pptx UV-Visible Spectroscopy – Electronic Transition...

Major Tasks in Data Preprocessing - Data cleaning

  • 1. Datamining & Warehousing Dr.VIDHYA B ASSISTANT PROFESSOR & HEAD Department of Computer Technology Sri Ramakrishna College of Arts and Science Coimbatore - 641 006 Tamil Nadu, India 1 Unit 2 - Preprocessing
  • 2. 2 2 Chapter 3: Data Preprocessing  Data Preprocessing: An Overview  Data Quality  Major Tasks in Data Preprocessing  Data Cleaning  Data Integration  Data Reduction  Data Transformation and Data Discretization  Summary
  • 3. 3 Data Quality: Why Preprocess the Data?  Measures for data quality: A multidimensional view  Accuracy: correct or wrong, accurate or not  Completeness: not recorded, unavailable, …  Consistency: some modified but some not, dangling, …  Timeliness: timely update  Believability: how trustable the data are correct  Interpretability: how easily the data can be understood Sri Ramakrishna College of Arts & Science
  • 4. 4 Data Quality: Why Preprocess the Data?  Example : Analyzing the company’s data for branch’s sales.  Inspect the company’s database and data warehouse, users of database system, some data have reported errors, unusual values, and inconsistencies in the data recorded for some transactions.  Data analyzing by data mining techniques are  incomplete (lacking attribute values or certain attributes of interest, or containing only aggregate data);  inaccurate or noisy (containing errors, or values that deviate from the expected);  inconsistent (e.g., containing discrepancies in the department codes used to categorize items) Sri Ramakrishna College of Arts & Science Sri Ramakrishna College of Arts & Science
  • 5. 5 Data Quality: Why Preprocess the Data?  Reasons for inaccurate data (i.e., having incorrect attribute values):  The data collection instruments used may be faulty.  There may have been human or computer errors occurring at data entry.  Users may purposely submit incorrect data values for mandatory fields when they do not wish to submit personal information (e.g., by choosing the default value “January 1” displayed for birthday). This is known as disguised missing data.  There may be technology limitations: limited buffer size for coordinating synchronized data transfer and consumption.  Incorrect data may also result from inconsistencies in naming conventions or data codes, or inconsistent formats for input fields (e.g., date). Duplicate tuples also require data cleaning Sri Ramakrishna College of Arts & Science
  • 6. 6 Data Quality: Why Preprocess the Data?  Reasons for incomplete data:  Attributes of interest, may not always be available, such as customer information for sales transaction data.  Relevant data may not be recorded due to a misunderstanding or because of equipment malfunctions  The recording of the data history or modifications may have been overlooked. Missing data, particularly for tuples with missing values for some attributes, may need to be inferred.  Timeliness also affects data quality. The month-end data are not updated in a timely fashion has a negative impact on the data quality.  Two other factors affecting data quality are believability and interpretability  Believability reflects how much the data are trusted by users  Interpretability reflects how easy the data are understood Sri Ramakrishna College of Arts & Science
  • 7. 7 Major Tasks in Data Preprocessing  Data cleaning  Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies.  dirty data can cause confusion for the mining procedure, resulting in unreliable output.  Data integration  Integration of multiple databases, data cubes, or files  Some attributes representing a given concept may have different names in different databases, causing inconsistencies and redundancies.  Eg: customer id in one data store and cust id in another.  Large amount of redundant data may slow down or confuse the knowledge discovery process.. Sri Ramakrishna College of Arts & Science
  • 8. 8 Major Tasks in Data Preprocessing  Data reduction is a reduced representation of the data set smaller in volume, yet produces the same (or almost the same) analytical results.  Dimensionality reduction: data encoding schemes are applied to obtain a reduced or “compressed” representation of the original data. Eg: attribute subset selection (e.g., removing irrelevant attributes) attribute construction (e.g., where a small set of more useful attributes is derived from the original set).  Numerosity reduction : the data are replaced by alternative, smaller representations using parametric models (e.g., regression or log-linear models) or nonparametric models (e.g., histograms, clusters, sampling, or data aggregation).  Data compression  Data transformation and data discretization  powerful tools for data mining allow data mining at multiple abstraction levels are Normalization & Concept hierarchy generation Sri Ramakrishna College of Arts & Science
  • 9. 9 Major Tasks in Data Preprocessing Sri Ramakrishna College of Arts & Science
  • 10. 10 10 Chapter 3: Data Preprocessing  Data Preprocessing: An Overview  Data Quality  Major Tasks in Data Preprocessing  Data Cleaning  Data Integration  Data Reduction  Data Transformation and Data Discretization  Summary
  • 11. 11 Data Cleaning - Introduction  Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument faulty, human or computer error, transmission error  Data Cleaning process: Sri Ramakrishna College of Arts & Science
  • 12. 12 Data Cleaning - Introduction  Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument faulty, human or computer error, transmission error  Data cleaning (or data cleansing) routines attempt to fill in missing values, smooth out noise while identifying outliers, and correct inconsistencies in the data.  incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data  e.g., Occupation=“ ” (missing data)  noisy: containing noise, errors, or outliers  e.g., Salary=“−10” (an error)  inconsistent: containing discrepancies in codes or names, e.g.,  Age=“42”, Birthday=“03/07/2010”  Was rating “1, 2, 3”, now rating “A, B, C”  discrepancy between duplicate records  Intentional (e.g., disguised missing data)  Jan. 1 as everyone’s birthday? Sri Ramakrishna College of Arts & Science
  • 13. 13 Data Cleaning – Missing Values  Data is not always available  E.g., many tuples have no recorded value for several attributes, such as customer income in sales data  Missing data may be due to  equipment malfunction  inconsistent with other recorded data and thus deleted  data not entered due to misunderstanding  certain data may not be considered important at the time of entry  not register history or changes of the data  Missing data may need to be inferred
  • 14. 14 Data Cleaning – Missing Values  Ignore the tuple:  when the class label is missing  Not very effective, unless the tuple contains several attributes with missing values. It is especially poor when the percentage of missing values per attribute varies considerably.  Fill in the missing value manually:  Time consuming and may not be feasible given a large data set with many missing values.  Use a global constant to fill in the missing value:  Replace all missing attribute values by the same constant such as a label like “Unknown” or −∞ Sri Ramakrishna College of Arts & Science
  • 15. 15 Data Cleaning – Missing Values  Use a measure of central tendency for the attribute (e.g., the mean or median) to fill in the missing value:  For normal (symmetric) data distributions, the mean can be used, while skewed data distribution should employ the median  Use the attribute mean or median for all samples belonging to the same class as the given tuple:  If the data distribution for a given class is skewed, the median value is a better choice  Use the most probable value to fill in the missing value:  Determined with regression, inference-based tools using a Bayesian formalism, or decision tree induction Sri Ramakrishna College of Arts & Science
  • 16. 16 Data Cleaning - Noisy Data  Noise: random error or variance in a measured variable  Incorrect attribute values may be due to  faulty data collection instruments  data entry problems  data transmission problems  technology limitation  inconsistency in naming convention  Three methods to remove Noisy data:  Binning  Regression  Outlier Analysis Sri Ramakrishna College of Arts & Science
  • 17. 17 How to Handle Noisy Data?  Binning is also used as a discretization.  first sort data and partition into (equal-frequency) bins  then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.) In smoothing by bin means, each value in a bin is replaced by the mean value of the bin. For example, the mean of the values4, 8, and 15 in Bin 1 is 9. Therefore, each original value in this bin is replaced by the value 9. Similarly, smoothing by bin medians can be employed, in which each bin value is replaced by the bin median. In smoothing by bin boundaries, the minimum and maximum values in a given bin are identified as the bin boundaries. Each bin value is then replaced by the closet boundary value. Sri Ramakrishna College of Arts & Science
  • 18. 18 How to Handle Noisy Data?  Regression  Data smoothing can also be done by regression.  Converts data values to a function.  Linear regression involves finding the “best” line to fit two attributes (or variables) so that one attribute can be used to predict the other.  Multiple linear regression more than two attributes are involved and the data are fit to a multidimensional surface.  Outliers analysis  Detected by clustering similar values are organized into groups, or “clusters.”  Intuitively, values that fall outside of the set of clusters may be considered outliers
  • 19. 19 Data Cleaning as a Process  Data discrepancy detection The first step in data cleaning as a process is discrepancy detection Several factors of data discrepancy detection are:  poorly designed data entry forms have many optional fields  human error in data entry  deliberate errors – users does not want to revel their secret  data decay – outdated addresses  inconsistent data representations & inconsistent use of codes  errors in instrumentation devices  when the data are (inadequately) used for purposes other than originally intended.  Inconsistencies due to data integration Sri Ramakrishna College of Arts & Science
  • 20. 20 Data Cleaning as a Process  Data discrepancy detection  Use metadata (e.g., domain, range, dependency, distribution)  Check field overloading  Check uniqueness rule, consecutive rule and null rule - A unique rule says that each value of the given attribute must be different from all other values for that attribute. - A consecutive rule says that there can be no missing values between the lowest and highest values for the attribute, and that all values must also be unique. - A null rule specifies the use of blanks, question marks, special characters, or other strings that may indicate the null condition (e.g., where a value for a given attribute is not available), and how such values should be handled.
  • 21. 21 Data Cleaning as a Process  Use commercial tools  Data scrubbing: use simple domain knowledge (e.g., postal code, spell-check) to detect errors and make corrections  Data auditing: by analyzing data to discover rules and relationship to detect violators (e.g., correlation and clustering to find outliers)  Data migration and integration  Data migration tools: allow transformations to be specified  ETL (Extraction/Transformation/Loading) tools: allow users to specify transformations through a graphical user interface.  Integration of the two processes data discrepancy & data transformation which is error-prone and time consuming  Iterative and interactive new approach – e.g., Potter’s Wheels) publicly available data tool.  Development of declarative languages