SlideShare a Scribd company logo
2
Most read
3
Most read
7
Most read
Submitted by,
M. Kavitha M.Sc.,
Nadar Saraswathi College of
Art & Science, Theni.
Data Mining
Data Integration and
Transformation
Data Integration
* Data Integration involves combining data from
several disparate source, which are stored using various
technologies and provide a unified view of the data.
* The later initiative is often called a data warehouse.
* It merges the data from multiple data stores (data
source).
* It includes multiple databases, data cubes or flat
files.
* Metadata, correlation analysis, data conflict detection
and resolution of semantic heterogeneity contribute towards
smooth data integration.
Advantages :
1. Independence.
2. Faster query processing.
3. Complex query processing.
4. Advanced data summarization & storage possible.
5. High volume data processing.
Disadvantages :
1. Latency (since data needs to be loaded using ETL).
2. Costlier (data localization, infrastructure, security).
There are a number of issues to consider during data integration.
1. Schema Integration.
2. Redundancy.
3. Detection and resolution of data value conflicts.
Schema integration :
The real-world entities from multiple source be matched
is referred to as the entity identification problem.
For example,
Data analyst or the computer be sure that customer_id in
one database and cust_number in another refer to the same
entity. Databases and data warehouses that is a data about the
data it’s a meta data.
Redundancy :
* It is another important issue.
* An attribute may be redundant if it can be “derived”
from another table, such as annual revenue.
* Some redundancies can be detected by correlation
analysis.
For example,
Two attributes, such analysis can measure how
strongly one attribute implies the other based on the
available data.
The correlation between attributes attribute A and B by
Detection and resolution of data value conflicts :
* A third important issue in data integration is the
detection and resolution of data value conflicts.
* The same real-world entity, attribute values from
different sources. This may be due to differences in
representation, scaling, or encoding.
* An attribute in one system may be recorded at a
lower level of abstraction than the “same” attribute in another.
* For example, the total sales in one database may
refer to one branch of All Electronics, an attribute of the same
name in another database may refer to the total sales for All
Electronics stores in a given region.
Data Transformation
* Data transformation the data are transformed or
consolidated into forms in appropriate for mining.
* Data transformation can involve
1. Smoothing.
2. Aggregation.
3. Generalization.
4. Normalization.
5. Attribute construction.
Smoothing :
Which works to remove the noise from data. Such
techniques include binning, clustering and regression.
Aggregation :
* Where summary or aggregation operations are applied
to the data.
* For example, the daily sales data may be aggregated so
as to compute monthly and annual total amounts.
Generalization :
* The data where low-level or “primitive” data are placed
by higher-level concepts through the use of concept through
the use of concept hierarchies.
* For example, the attributes like street can be
generalized to higher-level concept city or country when the
numeric attributes to higher-level concept young, middle-
aged and street.
Normalization :
Where the attribute data are scaled so as to fall within
a specified range, such as -1.0 to 1.0 or 0.0 to 1.0
Attribute construction :
Where new attribute are a constructed and added
from the given set of attributes to help the mining
process.
There are many method for data normalization.
* Min-Max normalization.
* Z-Score normalization.
* Normalization by decimal scaling.
Min – Max Normalization :
It performs a linear transformation on the original data.
Suppose that min A and max A are the minimum and
maximum values of attributes A. A Min – Max
normalization maps a value v of A to v’ in the range.
Z – Score Normalization :
The Z – Score normalization a value of an attribute A
are normalized based on the mean and standard deviation of
A. A value v of A is normalized to v’
Normalization by Decimal Scaling :
Normalization by decimal scaling normalizes by moving
the decimal point of values of attribute A.
The number of decimal points moved depends on the
maximum absolute value of A. A value v of A is normalized
to v’ by computing
where j is the smallest integer such that Max(|V’|) < 1.
Thank You

More Related Content

PDF
Classification in Data Mining
PPTX
Data mining technique (decision tree)
PPT
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
PPT
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
PPT
Data PreProcessing
PPT
Chapter 5. Data Cube Technology.ppt
PPTX
Data preprocessing
PPT
Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...
Classification in Data Mining
Data mining technique (decision tree)
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Data PreProcessing
Chapter 5. Data Cube Technology.ppt
Data preprocessing
Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...

What's hot (20)

PPTX
Density based methods
PPTX
Ensemble learning
PPT
Association rule mining
PPT
5.2 mining time series data
PPTX
05 Clustering in Data Mining
PPTX
Clustering in Data Mining
PPTX
Overfitting & Underfitting
PPTX
Data reduction
PPTX
Hierarchical clustering.pptx
PPTX
OLAP & DATA WAREHOUSE
PPTX
Data mining: Classification and prediction
PPTX
Data Reduction
PPTX
Data mining tasks
PPT
2.3 bayesian classification
PPTX
Data cube computation
PPTX
lazy learners and other classication methods
PPT
2.4 rule based classification
PPT
Data preprocessing
PPT
Conceptual dependency
Density based methods
Ensemble learning
Association rule mining
5.2 mining time series data
05 Clustering in Data Mining
Clustering in Data Mining
Overfitting & Underfitting
Data reduction
Hierarchical clustering.pptx
OLAP & DATA WAREHOUSE
Data mining: Classification and prediction
Data Reduction
Data mining tasks
2.3 bayesian classification
Data cube computation
lazy learners and other classication methods
2.4 rule based classification
Data preprocessing
Conceptual dependency
Ad

Similar to Data Integration and Transformation in Data mining (20)

PPTX
Data integration
PPT
Data preprocessing
PPT
Data preprocessing
DOC
Data Mining: Data Preprocessing
PPTX
CST 466 exam help data mining mod2.pptx
PPT
Data pre processing
PPT
Data preprocessing 2
PPT
PPTX
Data preprocessing
PPTX
Data preprocessing
PPTX
Datapreprocessing
PPTX
Data preprocessing
PPTX
DRK_Introduction to Data mining and Knowledge discovery
PPTX
Data Preprocessing
PPT
Preprocessing
PPT
Data preprocessing ppt1
PPT
Cssu dw dm
PPT
DataPreProcessing
PPT
Preprocessing
PPT
Datapreprocessing
Data integration
Data preprocessing
Data preprocessing
Data Mining: Data Preprocessing
CST 466 exam help data mining mod2.pptx
Data pre processing
Data preprocessing 2
Data preprocessing
Data preprocessing
Datapreprocessing
Data preprocessing
DRK_Introduction to Data mining and Knowledge discovery
Data Preprocessing
Preprocessing
Data preprocessing ppt1
Cssu dw dm
DataPreProcessing
Preprocessing
Datapreprocessing
Ad

More from kavitha muneeshwaran (13)

PPTX
Physical Security
PPTX
Digital Audio
PPTX
Data structure
PPTX
Internet Programming with Java
PPTX
Digital image processing
PPTX
Staffing level estimation
PPTX
Transaction Management - Deadlock Handling
PPT
Digital Logic circuit
PPTX
C and C++ functions
PPTX
I/O system in intel 80386 microcomputer architecture
PPTX
narrow Band ISDN
Physical Security
Digital Audio
Data structure
Internet Programming with Java
Digital image processing
Staffing level estimation
Transaction Management - Deadlock Handling
Digital Logic circuit
C and C++ functions
I/O system in intel 80386 microcomputer architecture
narrow Band ISDN

Recently uploaded (20)

PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PPTX
Lesson notes of climatology university.
PPTX
Institutional Correction lecture only . . .
PPTX
Tissue processing ( HISTOPATHOLOGICAL TECHNIQUE
PDF
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
PDF
VCE English Exam - Section C Student Revision Booklet
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PPTX
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PDF
Chinmaya Tiranga quiz Grand Finale.pdf
PPTX
Introduction-to-Literarature-and-Literary-Studies-week-Prelim-coverage.pptx
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PDF
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PPTX
Cell Types and Its function , kingdom of life
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
STATICS OF THE RIGID BODIES Hibbelers.pdf
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
Pharmacology of Heart Failure /Pharmacotherapy of CHF
Lesson notes of climatology university.
Institutional Correction lecture only . . .
Tissue processing ( HISTOPATHOLOGICAL TECHNIQUE
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
VCE English Exam - Section C Student Revision Booklet
Supply Chain Operations Speaking Notes -ICLT Program
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
Abdominal Access Techniques with Prof. Dr. R K Mishra
Chinmaya Tiranga quiz Grand Finale.pdf
Introduction-to-Literarature-and-Literary-Studies-week-Prelim-coverage.pptx
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
Cell Types and Its function , kingdom of life
Module 4: Burden of Disease Tutorial Slides S2 2025
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student

Data Integration and Transformation in Data mining

  • 1. Submitted by, M. Kavitha M.Sc., Nadar Saraswathi College of Art & Science, Theni. Data Mining Data Integration and Transformation
  • 2. Data Integration * Data Integration involves combining data from several disparate source, which are stored using various technologies and provide a unified view of the data. * The later initiative is often called a data warehouse. * It merges the data from multiple data stores (data source). * It includes multiple databases, data cubes or flat files. * Metadata, correlation analysis, data conflict detection and resolution of semantic heterogeneity contribute towards smooth data integration.
  • 3. Advantages : 1. Independence. 2. Faster query processing. 3. Complex query processing. 4. Advanced data summarization & storage possible. 5. High volume data processing. Disadvantages : 1. Latency (since data needs to be loaded using ETL). 2. Costlier (data localization, infrastructure, security).
  • 4. There are a number of issues to consider during data integration. 1. Schema Integration. 2. Redundancy. 3. Detection and resolution of data value conflicts. Schema integration : The real-world entities from multiple source be matched is referred to as the entity identification problem. For example, Data analyst or the computer be sure that customer_id in one database and cust_number in another refer to the same entity. Databases and data warehouses that is a data about the data it’s a meta data.
  • 5. Redundancy : * It is another important issue. * An attribute may be redundant if it can be “derived” from another table, such as annual revenue. * Some redundancies can be detected by correlation analysis. For example, Two attributes, such analysis can measure how strongly one attribute implies the other based on the available data. The correlation between attributes attribute A and B by
  • 6. Detection and resolution of data value conflicts : * A third important issue in data integration is the detection and resolution of data value conflicts. * The same real-world entity, attribute values from different sources. This may be due to differences in representation, scaling, or encoding. * An attribute in one system may be recorded at a lower level of abstraction than the “same” attribute in another. * For example, the total sales in one database may refer to one branch of All Electronics, an attribute of the same name in another database may refer to the total sales for All Electronics stores in a given region.
  • 7. Data Transformation * Data transformation the data are transformed or consolidated into forms in appropriate for mining. * Data transformation can involve 1. Smoothing. 2. Aggregation. 3. Generalization. 4. Normalization. 5. Attribute construction. Smoothing : Which works to remove the noise from data. Such techniques include binning, clustering and regression.
  • 8. Aggregation : * Where summary or aggregation operations are applied to the data. * For example, the daily sales data may be aggregated so as to compute monthly and annual total amounts. Generalization : * The data where low-level or “primitive” data are placed by higher-level concepts through the use of concept through the use of concept hierarchies. * For example, the attributes like street can be generalized to higher-level concept city or country when the numeric attributes to higher-level concept young, middle- aged and street.
  • 9. Normalization : Where the attribute data are scaled so as to fall within a specified range, such as -1.0 to 1.0 or 0.0 to 1.0 Attribute construction : Where new attribute are a constructed and added from the given set of attributes to help the mining process. There are many method for data normalization. * Min-Max normalization. * Z-Score normalization. * Normalization by decimal scaling.
  • 10. Min – Max Normalization : It performs a linear transformation on the original data. Suppose that min A and max A are the minimum and maximum values of attributes A. A Min – Max normalization maps a value v of A to v’ in the range. Z – Score Normalization : The Z – Score normalization a value of an attribute A are normalized based on the mean and standard deviation of A. A value v of A is normalized to v’
  • 11. Normalization by Decimal Scaling : Normalization by decimal scaling normalizes by moving the decimal point of values of attribute A. The number of decimal points moved depends on the maximum absolute value of A. A value v of A is normalized to v’ by computing where j is the smallest integer such that Max(|V’|) < 1.