SlideShare a Scribd company logo
UNDER THE GUIDENCE: M.FLORENCE DAYANA , Head of the Department
NAME OF THE STUDENT: P.ABILA
A.AISHWARYA LAKSHMI
V.AISHWARYA
A.AYEESHABI
REGISTER NUMBER: CB17S 250338
CB17S 250344
CB17S 250343
CB17S 250355
SUBJECT CODE: P8MCA22
BATCH : 2017-2020
YEAR : 2020
WHAT IS DATA MINING?
 Data mining is the process of discovering patterns in large data
sets involving methods at the intersection of machine learning, statistics,
and database systems.
 Data mining is an interdisciplinary subfield of computer
science and statistics with an overall goal to extract information (with
intelligent methods) from a data set and transform the information into a
comprehensible structure for further use.
 Data mining is the analysis step of the "knowledge discovery in databases"
process or KDD.
 It is the process of uncovering trends, common themes or patterns in
“big data”. ... For example, an early form of data mining was used by
companies to analyze huge amounts of scanner data from supermarkets.
Unit i
 Data mining is the analysis step of the "knowledge discovery in
databases" process or KDD.
 It is the process of uncovering trends, common themes or patterns in
“big data”. ... For example, an early form of data mining was used by
companies to analyze huge amounts of scanner data from supermarkets.
DATA WAREHOUSES
 A Data Warehousing (DW) is process for collecting and managing data
from varied sources to provide meaningful business insights. A Data
warehouse is typically used to connect and analyze business data from
heterogeneous sources. The data warehouse is the core of the BI system
which is built for data analysis and reporting.
 It is a blend of technologies and components which aids the strategic use of
data. It is electronic storage of a large amount of information by a business
which is designed for query and analysis instead of transaction processing.
It is a process of transforming data into information and making it available
to users in a timely manner to make a difference.
HOW DATA WAREHOUSE WORKS?
 A Data Warehouse works as a central repository where information arrives
from one or more data sources. Data flows into a data warehouse from the
transactional system and other relational databases.
 Data may be:
 Structured
 Semi-structured
 Unstructured data
 The data is processed, transformed, and ingested so that users can access the
processed data in the Data Warehouse through Business Intelligence tools,
SQL clients, and spreadsheets. A data warehouse merges information
coming from different sources into one comprehensive database.
DATA MINING
FUNCTIONALITIES AND
TASKS
 Data mining deals with the kind of patterns that can be mined. On the basis
of the kind of data to be mined, there are two categories of functions
involved in Data Mining
 Descriptive
 Classification and Prediction
 Descriptive Function
The descriptive function deals with the general properties of data in
the database. Here is the list of descriptive functions
 Class/Concept Description
 Mining of Frequent Patterns
 Mining of Associations
 Mining of Correlations
 Mining of Clusters
Class/Concept Description
Class/Concept refers to the data to be associated with the classes or
concepts. For example, in a company, the classes of items for sales include
computer and printers, and concepts of customers include big spenders and
budget spenders. Such descriptions of a class or a concept are called
class/concept descriptions. These descriptions can be derived by the following
two ways −
Data Characterization − This refers to summarizing data of class under
study. This class under study is called as Target Class.
Data Discrimination − It refers to the mapping or classification of a class
with some predefined group or class.
Mining of Frequent Patterns
Frequent patterns are those patterns that occur frequently in transactional
data. Here is the list of kind of frequent patterns −
Frequent Item Set − It refers to a set of items that frequently appear
together, for example, milk and bread.
Frequent Subsequence − A sequence of patterns that occur frequently
such as purchasing a camera is followed by memory card.
.
Frequent Sub Structure − Substructure refers to different structural
forms, such as graphs, trees, or lattices, which may be combined with item-
sets or subsequences.
Mining of Association
Associations are used in retail sales to identify patterns that are frequently
purchased together. This process refers to the process of uncovering the
relationship among data and determining association rules.
For example, a retailer generates an association rule that shows that 70% of time
milk is sold with bread and only 30% of times biscuits are sold with bread.
Mining of Correlations
It is a kind of additional analysis performed to uncover interesting
statistical correlations between associated-attribute-value pairs or between two
item sets to analyze that if they have positive, negative or no effect on each
other.
Mining of Clusters
Cluster refers to a group of similar kind of objects. Cluster analysis refers
to forming group of objects that are very similar to each other but are highly
different from the objects in other clusters.
Classification and Prediction
Classification is the process of finding a model that describes the data
classes or concepts. The purpose is to be able to use this model to predict the class of
objects whose class label is unknown. This derived model is based on the analysis of
sets of training data. The derived model can be presented in the following forms −
Classification (IF-THEN) Rules
Decision Trees
Mathematical Formulae
Neural Networks
The list of functions involved in these processes are as follows −
Classification − It predicts the class of objects whose class label is
unknown. Its objective is to find a derived model that describes and
distinguishes data classes or concepts. The Derived Model is based on the
analysis set of training data i.e. the data object whose class label is well
known.
Prediction − It is used to predict missing or unavailable numerical data
values rather than class labels. Regression Analysis is generally used for
prediction. Prediction can also be used for identification of distribution
trends based on available data.
Outlier Analysis − Outliers may be defined as the data objects that do
not comply with the general behavior or model of the data available.
Evolution Analysis − Evolution analysis refers to the description and
model regularities or trends for objects whose behavior changes over
time.
Data Mining Task Primitives
We can specify a data mining task in the form of a data mining query.
This query is input to the system.
A data mining query is defined in terms of data mining task primitives.
Note − These primitives allow us to communicate in an interactive manner with
the data mining system. Here is the list of Data Mining Task Primitives −
Set of task relevant data to be mined.
Kind of knowledge to be mined.
Background knowledge to be used in discovery process.
Interestingness measures and thresholds for pattern evaluation.
Representation for visualizing the discovered patterns.
Set of task relevant data to be mined
This is the portion of database in which the user is interested. This
portion includes the following −
Database Attributes
Data Warehouse dimensions of interest
Kind of knowledge to be mined
It refers to the kind of functions to be performed. These functions are
Characterization
Discrimination
Association and Correlation Analysis
Classification
Prediction
Clustering
Outlier Analysis
Evolution Analysis
Background knowledge
The background knowledge allows data to be mined at multiple
levels of abstraction.
Interestingness measures and thresholds for pattern evaluation
This is used to evaluate the patterns that are discovered by the process
of knowledge discovery. There are different interesting measures for different
kind of knowledge.
Representation for visualizing the discovered patterns
This refers to the form in which discovered patterns are to be
displayed. These representations may include the following. −
Rules
Tables
Charts
Graphs
Decision Trees
Cubes
DATA MINING ISSUES
 Data mining is not an easy task, as the algorithms used can get very
complex and data is not always available at one place. It needs to be
integrated from various heterogeneous data sources. These factors also
create some issues. Here in this tutorial, we will discuss the major issues
regarding −
 Mining Methodology and User Interaction
 Performance Issues
 Diverse Data Types Issues
 The following diagram describes the major issues.
Unit i
Mining Methodology and User Interaction Issues
It refers to the following kinds of issues −
Mining different kinds of knowledge in databases − Different users may
be interested in different kinds of knowledge. Therefore it is necessary for data
mining to cover a broad range of knowledge discovery task.
Interactive mining of knowledge at multiple levels of abstraction − The
data mining process needs to be interactive because it allows users to focus the
search for patterns, providing and refining data mining requests based on the
returned results.
Incorporation of background knowledge − To guide discovery process and
to express the discovered patterns, the background knowledge can be used.
Background knowledge may be used to express the discovered patterns not
only in concise terms but at multiple levels of abstraction.
Data mining query languages and ad hoc data mining − Data Mining
Query language that allows the user to describe ad hoc mining tasks, should be
integrated with a data warehouse query language and optimized for efficient
and flexible data mining.
Presentation and visualization of data mining results − Once the patterns are
discovered it needs to be expressed in high level languages, and visual
representations. These representations should be easily understandable.
Handling noisy or incomplete data − The data cleaning methods are required
to handle the noise and incomplete objects while mining the data regularities. If
the data cleaning methods are not there then the accuracy of the discovered
patterns will be poor.
Pattern evaluation − The patterns discovered should be interesting because
either they represent common knowledge or lack novelty.
Performance Issues
There can be performance-related issues such as follows −
Efficiency and scalability of data mining algorithms − In order to effectively
extract the information from huge amount of data in databases, data mining
algorithm must be efficient and scalable.
Parallel, distributed, and incremental mining algorithms − The
factors such as huge size of databases, wide distribution of data, and
complexity of data mining methods motivate the development of
parallel and distributed data mining algorithms. These algorithms divide
the data into partitions which is further processed in a parallel fashion.
Then the results from the partitions is merged. The incremental
algorithms, update databases without mining the data again from
scratch.
Diverse Data Types Issues
Handling of relational and complex types of data − The database
may contain complex data objects, multimedia data objects, spatial data,
temporal data etc. It is not possible for one system to mine all these kind
of data.
Mining information from heterogeneous databases and global
information systems − The data is available at different data sources on
LAN or WAN. These data source may be structured, semi structured or
unstructured. Therefore mining the knowledge from them adds
challenges to data mining.
APPLICATIONS AND TRENDS IN
DATA MINING
 Data mining is widely used in diverse areas. There are a number of
commercial data mining system available today and yet there are many
challenges in this field. In this tutorial, we will discuss the applications and
the trend of data mining.
 Data Mining Applications
Here is the list of areas where data mining is widely used −
 Financial Data Analysis
 Retail Industry
 Telecommunication Industry
 Biological Data Analysis
 Other Scientific Applications
 Intrusion Detection
SOCIAL IMPLICATIONS OF DATA
MINING
Privacy
Profiling
Unauthorised use
Financial Data Analysis
The financial data in banking and financial industry is generally reliable and of
high quality which facilitates systematic data analysis and data mining. Some of the
typical cases are as follows −
Design and construction of data warehouses for multidimensional data analysis
and data mining.
Loan payment prediction and customer credit policy analysis.
Classification and clustering of customers for targeted marketing.
Detection of money laundering and other financial crimes.
Retail Industry
Data Mining has its great application in Retail Industry because it collects large
amount of data from on sales, customer purchasing history, goods transportation,
consumption and services. It is natural that the quantity of data collected will continue
to expand rapidly because of the increasing ease, availability and popularity of the web.
Data mining in retail industry helps in identifying customer buying patterns and trends
that lead to improved quality of customer service and good customer retention and
satisfaction. Here is the list of examples of data mining in the retail industry −
Design and Construction of data warehouses based on the benefits of data
mining.
Multidimensional analysis of sales, customers, products, time and region.
Analysis of effectiveness of sales campaigns.
Customer Retention.
Product recommendation and cross-referencing of items.
Telecommunication Industry
Today the telecommunication industry is one of the most emerging industries
providing various services such as fax, pager, cellular phone, internet messenger,
images, e-mail, web data transmission, etc. Due to the development of new
computer and communication technologies, the telecommunication industry is
rapidly expanding. This is the reason why data mining is become very important to
help and understand the business.
Data mining in telecommunication industry helps in identifying the
telecommunication patterns, catch fraudulent activities, make better use of resource,
and improve quality of service.
Here is the list of examples for which data mining improves
telecommunication services −
Multidimensional Analysis of Telecommunication data.
Fraudulent pattern analysis.
Identification of unusual patterns.
Multidimensional association and sequential patterns analysis.
Mobile Telecommunication services.
Use of visualization tools in telecommunication data analysis.
Biological Data Analysis
In recent times, we have seen a tremendous growth in the field of
biology such as genomics, proteomics, functional Genomics and biomedical
research. Biological data mining is a very important part of Bioinformatics.
Following are the aspects in which data mining contributes for biological
data analysis −
Semantic integration of heterogeneous, distributed genomic and
proteomic databases.
Alignment, indexing, similarity search and comparative analysis multiple
nucleotide sequences.
Discovery of structural patterns and analysis of genetic networks and
protein pathways.
Association and path analysis.
Visualization tools in genetic data analysis.
Other Scientific Applications
The applications discussed above tend to handle relatively small and
homogeneous data sets for which the statistical techniques are appropriate. Huge
amount of data have been collected from scientific domains such as geosciences,
astronomy, etc. A large amount of data sets is being generated because of the fast
numerical simulations in various fields such as climate and ecosystem modeling,
chemical engineering, fluid dynamics, etc. Following are the applications of data
mining in the field of Scientific Applications −
Data Warehouses and data preprocessing.
Graph-based mining.
Visualization and domain specific knowledge.
Intrusion Detection
Intrusion refers to any kind of action that threatens integrity, confidentiality,
or the availability of network resources. In this world of connectivity, security
has become the major issue. With increased usage of internet and availability of
the tools and tricks for intruding and attacking network prompted intrusion
detection to become a critical component of network administration. Here is the
list of areas in which data mining technology may be applied for intrusion
detection −
Development of data mining algorithm for intrusion detection.
Association and correlation analysis, aggregation to help select and build
discriminating attributes.
Analysis of Stream data.
Distributed data mining.
Visualization and query tools.
Data Mining System Products
There are many data mining system products and domain specific data
mining applications. The new data mining systems and applications are being
added to the previous systems. Also, efforts are being made to standardize data
mining languages.
Choosing a Data Mining System
The selection of a data mining system depends on the following features −
Data Types − The data mining system may handle formatted text, record-based
data, and relational data. The data could also be in ASCII text, relational database
data or data warehouse data. Therefore, we should check what exact format the data
mining system can handle.
System Issues − We must consider the compatibility of a data mining system with
different operating systems. One data mining system may run on only one operating
system or on several. There are also data mining systems that provide web-based user
interfaces and allow XML data as input.
Data Sources − Data sources refer to the data formats in which data mining system
will operate. Some data mining system may work only on ASCII text files while
others on multiple relational sources. Data mining system should also support ODBC
connections or OLE DB for ODBC connections.
Data Mining functions and methodologies − There are some data mining systems
that provide only one data mining function such as classification while some provides
multiple data mining functions such as concept description, discovery-driven OLAP
analysis, association mining, linkage analysis, statistical analysis, classification,
prediction, clustering, outlier analysis, similarity search, etc.
Coupling data mining with databases or data warehouse systems − Data
mining systems need to be coupled with a database or a data warehouse system.
The coupled components are integrated into a uniform information processing
environment. Here are the types of coupling listed below −
No coupling
Loose Coupling
Semi tight Coupling
Tight Coupling
Scalability − There are two scalability issues in data mining −
Row (Database size) Scalability − A data mining system is considered as
row scalable when the number or rows are enlarged 10 times. It takes no
more than 10 times to execute a query.
Column (Dimension) Salability − A data mining system is considered as
column scalable if the mining query execution time increases linearly with
the number of columns.
Visualization Tools − Visualization in data mining can be categorized as follows −
Data Visualization
Mining Results Visualization
Mining process visualization
Visual data mining
Data Mining query language and graphical user interface −
An easy-to-use graphical user interface is important to promote user-guided,
interactive data mining. Unlike relational database systems, data mining systems do
not share underlying data mining query language.
Trends in Data Mining
Data mining concepts are still evolving and here are the latest trends that we get
to see in this field −
Application Exploration.
Scalable and interactive data mining methods.
Integration of data mining with database systems, data warehouse systems and
web database systems.
Standardization of data mining query language.
Visual data mining.
New methods for mining complex types of data.
Biological data mining.
Data mining and software engineering.
Web mining.
Distributed data mining.
Real time data mining.
Multi database data mining.
Privacy protection and information security in data mining.

More Related Content

PDF
Ghhh
PDF
Data mining and data warehouse lab manual updated
PPTX
Data mining: Classification and prediction
PPT
1.2 steps and functionalities
PPTX
01 Introduction to Data Mining
PPTX
Data mining tasks
DOCX
DataMining Techniq
PPTX
Classification and prediction in data mining
Ghhh
Data mining and data warehouse lab manual updated
Data mining: Classification and prediction
1.2 steps and functionalities
01 Introduction to Data Mining
Data mining tasks
DataMining Techniq
Classification and prediction in data mining

What's hot (18)

PPT
Data Mining
DOCX
Database
PPTX
Research trends in data warehousing and data mining
ODP
Data mining
PPT
Dma unit 1
PDF
Introduction to Data Mining
PPT
Knowledge discovery thru data mining
PDF
Data Mining And Data Warehousing Laboratory File Manual
PPTX
Data mining , Knowledge Discovery Process, Classification
PPTX
Data Mining Primitives, Languages & Systems
PPTX
Data mining an introduction
PPTX
The 8 Step Data Mining Process
PPTX
Data Cleaning Techniques
PPT
Data warehousing and online analytical processing
PDF
I1802055259
PPT
Data Preprocessing
PPT
Data mining
PPT
Data mininng trends
Data Mining
Database
Research trends in data warehousing and data mining
Data mining
Dma unit 1
Introduction to Data Mining
Knowledge discovery thru data mining
Data Mining And Data Warehousing Laboratory File Manual
Data mining , Knowledge Discovery Process, Classification
Data Mining Primitives, Languages & Systems
Data mining an introduction
The 8 Step Data Mining Process
Data Cleaning Techniques
Data warehousing and online analytical processing
I1802055259
Data Preprocessing
Data mining
Data mininng trends
Ad

Similar to Unit i (20)

DOCX
data mining and data warehousing
PDF
Overview of Data Mining
PPT
Data Mining-2023 (2).ppt
PPT
Sanjeev Kumar Dash D ata Mining-2023.ppt
PPT
1328cvkdlgkdgjfdkjgjdfgdfkgdflgkgdfglkjgld8679 - Copy.ppt
PDF
data mining
PPTX
Unit-V-Introduction to Data Mining.pptx
PPTX
Unit3-AssociationRuleMining and data techniques.pptx
PPT
Dwdmunit1 a
PPT
20IT501_DWDM_PPT_Unit_II.ppt
PDF
G045033841
PDF
Lect 1 introduction
PPT
Data mining techniques unit 1
PPTX
Introduction_to_Data_Mining12345678.pptx
PPT
20IT501_DWDM_PPT_Unit_II.ppt
DOCX
Seminar Report Vaibhav
PDF
Data Mining Module 1 Business Analytics.
PDF
Data Mining
PPT
Chapter 01Intro.ppt full explanation used
data mining and data warehousing
Overview of Data Mining
Data Mining-2023 (2).ppt
Sanjeev Kumar Dash D ata Mining-2023.ppt
1328cvkdlgkdgjfdkjgjdfgdfkgdflgkgdfglkjgld8679 - Copy.ppt
data mining
Unit-V-Introduction to Data Mining.pptx
Unit3-AssociationRuleMining and data techniques.pptx
Dwdmunit1 a
20IT501_DWDM_PPT_Unit_II.ppt
G045033841
Lect 1 introduction
Data mining techniques unit 1
Introduction_to_Data_Mining12345678.pptx
20IT501_DWDM_PPT_Unit_II.ppt
Seminar Report Vaibhav
Data Mining Module 1 Business Analytics.
Data Mining
Chapter 01Intro.ppt full explanation used
Ad

Recently uploaded (20)

PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PDF
Trump Administration's workforce development strategy
PDF
LNK 2025 (2).pdf MWEHEHEHEHEHEHEHEHEHEHE
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PDF
Anesthesia in Laparoscopic Surgery in India
PPTX
Lesson notes of climatology university.
PDF
Updated Idioms and Phrasal Verbs in English subject
PDF
RMMM.pdf make it easy to upload and study
PPTX
UNIT III MENTAL HEALTH NURSING ASSESSMENT
PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PPTX
master seminar digital applications in india
PDF
Microbial disease of the cardiovascular and lymphatic systems
PPTX
Cell Types and Its function , kingdom of life
PPTX
Tissue processing ( HISTOPATHOLOGICAL TECHNIQUE
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PDF
ChatGPT for Dummies - Pam Baker Ccesa007.pdf
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PDF
LDMMIA Reiki Yoga Finals Review Spring Summer
PPTX
Orientation - ARALprogram of Deped to the Parents.pptx
PPTX
202450812 BayCHI UCSC-SV 20250812 v17.pptx
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
Trump Administration's workforce development strategy
LNK 2025 (2).pdf MWEHEHEHEHEHEHEHEHEHEHE
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
Anesthesia in Laparoscopic Surgery in India
Lesson notes of climatology university.
Updated Idioms and Phrasal Verbs in English subject
RMMM.pdf make it easy to upload and study
UNIT III MENTAL HEALTH NURSING ASSESSMENT
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
master seminar digital applications in india
Microbial disease of the cardiovascular and lymphatic systems
Cell Types and Its function , kingdom of life
Tissue processing ( HISTOPATHOLOGICAL TECHNIQUE
Supply Chain Operations Speaking Notes -ICLT Program
ChatGPT for Dummies - Pam Baker Ccesa007.pdf
STATICS OF THE RIGID BODIES Hibbelers.pdf
LDMMIA Reiki Yoga Finals Review Spring Summer
Orientation - ARALprogram of Deped to the Parents.pptx
202450812 BayCHI UCSC-SV 20250812 v17.pptx

Unit i

  • 1. UNDER THE GUIDENCE: M.FLORENCE DAYANA , Head of the Department NAME OF THE STUDENT: P.ABILA A.AISHWARYA LAKSHMI V.AISHWARYA A.AYEESHABI REGISTER NUMBER: CB17S 250338 CB17S 250344 CB17S 250343 CB17S 250355 SUBJECT CODE: P8MCA22 BATCH : 2017-2020 YEAR : 2020
  • 2. WHAT IS DATA MINING?  Data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems.  Data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information (with intelligent methods) from a data set and transform the information into a comprehensible structure for further use.  Data mining is the analysis step of the "knowledge discovery in databases" process or KDD.  It is the process of uncovering trends, common themes or patterns in “big data”. ... For example, an early form of data mining was used by companies to analyze huge amounts of scanner data from supermarkets.
  • 4.  Data mining is the analysis step of the "knowledge discovery in databases" process or KDD.  It is the process of uncovering trends, common themes or patterns in “big data”. ... For example, an early form of data mining was used by companies to analyze huge amounts of scanner data from supermarkets.
  • 5. DATA WAREHOUSES  A Data Warehousing (DW) is process for collecting and managing data from varied sources to provide meaningful business insights. A Data warehouse is typically used to connect and analyze business data from heterogeneous sources. The data warehouse is the core of the BI system which is built for data analysis and reporting.  It is a blend of technologies and components which aids the strategic use of data. It is electronic storage of a large amount of information by a business which is designed for query and analysis instead of transaction processing. It is a process of transforming data into information and making it available to users in a timely manner to make a difference.
  • 6. HOW DATA WAREHOUSE WORKS?  A Data Warehouse works as a central repository where information arrives from one or more data sources. Data flows into a data warehouse from the transactional system and other relational databases.  Data may be:  Structured  Semi-structured  Unstructured data  The data is processed, transformed, and ingested so that users can access the processed data in the Data Warehouse through Business Intelligence tools, SQL clients, and spreadsheets. A data warehouse merges information coming from different sources into one comprehensive database.
  • 7. DATA MINING FUNCTIONALITIES AND TASKS  Data mining deals with the kind of patterns that can be mined. On the basis of the kind of data to be mined, there are two categories of functions involved in Data Mining  Descriptive  Classification and Prediction  Descriptive Function The descriptive function deals with the general properties of data in the database. Here is the list of descriptive functions  Class/Concept Description  Mining of Frequent Patterns  Mining of Associations  Mining of Correlations  Mining of Clusters
  • 8. Class/Concept Description Class/Concept refers to the data to be associated with the classes or concepts. For example, in a company, the classes of items for sales include computer and printers, and concepts of customers include big spenders and budget spenders. Such descriptions of a class or a concept are called class/concept descriptions. These descriptions can be derived by the following two ways − Data Characterization − This refers to summarizing data of class under study. This class under study is called as Target Class. Data Discrimination − It refers to the mapping or classification of a class with some predefined group or class. Mining of Frequent Patterns Frequent patterns are those patterns that occur frequently in transactional data. Here is the list of kind of frequent patterns − Frequent Item Set − It refers to a set of items that frequently appear together, for example, milk and bread. Frequent Subsequence − A sequence of patterns that occur frequently such as purchasing a camera is followed by memory card. .
  • 9. Frequent Sub Structure − Substructure refers to different structural forms, such as graphs, trees, or lattices, which may be combined with item- sets or subsequences. Mining of Association Associations are used in retail sales to identify patterns that are frequently purchased together. This process refers to the process of uncovering the relationship among data and determining association rules. For example, a retailer generates an association rule that shows that 70% of time milk is sold with bread and only 30% of times biscuits are sold with bread. Mining of Correlations It is a kind of additional analysis performed to uncover interesting statistical correlations between associated-attribute-value pairs or between two item sets to analyze that if they have positive, negative or no effect on each other. Mining of Clusters Cluster refers to a group of similar kind of objects. Cluster analysis refers to forming group of objects that are very similar to each other but are highly different from the objects in other clusters.
  • 10. Classification and Prediction Classification is the process of finding a model that describes the data classes or concepts. The purpose is to be able to use this model to predict the class of objects whose class label is unknown. This derived model is based on the analysis of sets of training data. The derived model can be presented in the following forms − Classification (IF-THEN) Rules Decision Trees Mathematical Formulae Neural Networks The list of functions involved in these processes are as follows − Classification − It predicts the class of objects whose class label is unknown. Its objective is to find a derived model that describes and distinguishes data classes or concepts. The Derived Model is based on the analysis set of training data i.e. the data object whose class label is well known. Prediction − It is used to predict missing or unavailable numerical data values rather than class labels. Regression Analysis is generally used for prediction. Prediction can also be used for identification of distribution trends based on available data.
  • 11. Outlier Analysis − Outliers may be defined as the data objects that do not comply with the general behavior or model of the data available. Evolution Analysis − Evolution analysis refers to the description and model regularities or trends for objects whose behavior changes over time. Data Mining Task Primitives We can specify a data mining task in the form of a data mining query. This query is input to the system. A data mining query is defined in terms of data mining task primitives. Note − These primitives allow us to communicate in an interactive manner with the data mining system. Here is the list of Data Mining Task Primitives − Set of task relevant data to be mined. Kind of knowledge to be mined. Background knowledge to be used in discovery process. Interestingness measures and thresholds for pattern evaluation. Representation for visualizing the discovered patterns.
  • 12. Set of task relevant data to be mined This is the portion of database in which the user is interested. This portion includes the following − Database Attributes Data Warehouse dimensions of interest Kind of knowledge to be mined It refers to the kind of functions to be performed. These functions are Characterization Discrimination Association and Correlation Analysis Classification Prediction Clustering Outlier Analysis Evolution Analysis Background knowledge The background knowledge allows data to be mined at multiple levels of abstraction.
  • 13. Interestingness measures and thresholds for pattern evaluation This is used to evaluate the patterns that are discovered by the process of knowledge discovery. There are different interesting measures for different kind of knowledge. Representation for visualizing the discovered patterns This refers to the form in which discovered patterns are to be displayed. These representations may include the following. − Rules Tables Charts Graphs Decision Trees Cubes
  • 14. DATA MINING ISSUES  Data mining is not an easy task, as the algorithms used can get very complex and data is not always available at one place. It needs to be integrated from various heterogeneous data sources. These factors also create some issues. Here in this tutorial, we will discuss the major issues regarding −  Mining Methodology and User Interaction  Performance Issues  Diverse Data Types Issues  The following diagram describes the major issues.
  • 16. Mining Methodology and User Interaction Issues It refers to the following kinds of issues − Mining different kinds of knowledge in databases − Different users may be interested in different kinds of knowledge. Therefore it is necessary for data mining to cover a broad range of knowledge discovery task. Interactive mining of knowledge at multiple levels of abstraction − The data mining process needs to be interactive because it allows users to focus the search for patterns, providing and refining data mining requests based on the returned results. Incorporation of background knowledge − To guide discovery process and to express the discovered patterns, the background knowledge can be used. Background knowledge may be used to express the discovered patterns not only in concise terms but at multiple levels of abstraction. Data mining query languages and ad hoc data mining − Data Mining Query language that allows the user to describe ad hoc mining tasks, should be integrated with a data warehouse query language and optimized for efficient and flexible data mining.
  • 17. Presentation and visualization of data mining results − Once the patterns are discovered it needs to be expressed in high level languages, and visual representations. These representations should be easily understandable. Handling noisy or incomplete data − The data cleaning methods are required to handle the noise and incomplete objects while mining the data regularities. If the data cleaning methods are not there then the accuracy of the discovered patterns will be poor. Pattern evaluation − The patterns discovered should be interesting because either they represent common knowledge or lack novelty. Performance Issues There can be performance-related issues such as follows − Efficiency and scalability of data mining algorithms − In order to effectively extract the information from huge amount of data in databases, data mining algorithm must be efficient and scalable.
  • 18. Parallel, distributed, and incremental mining algorithms − The factors such as huge size of databases, wide distribution of data, and complexity of data mining methods motivate the development of parallel and distributed data mining algorithms. These algorithms divide the data into partitions which is further processed in a parallel fashion. Then the results from the partitions is merged. The incremental algorithms, update databases without mining the data again from scratch. Diverse Data Types Issues Handling of relational and complex types of data − The database may contain complex data objects, multimedia data objects, spatial data, temporal data etc. It is not possible for one system to mine all these kind of data. Mining information from heterogeneous databases and global information systems − The data is available at different data sources on LAN or WAN. These data source may be structured, semi structured or unstructured. Therefore mining the knowledge from them adds challenges to data mining.
  • 19. APPLICATIONS AND TRENDS IN DATA MINING  Data mining is widely used in diverse areas. There are a number of commercial data mining system available today and yet there are many challenges in this field. In this tutorial, we will discuss the applications and the trend of data mining.  Data Mining Applications Here is the list of areas where data mining is widely used −  Financial Data Analysis  Retail Industry  Telecommunication Industry  Biological Data Analysis  Other Scientific Applications  Intrusion Detection
  • 20. SOCIAL IMPLICATIONS OF DATA MINING Privacy Profiling Unauthorised use
  • 21. Financial Data Analysis The financial data in banking and financial industry is generally reliable and of high quality which facilitates systematic data analysis and data mining. Some of the typical cases are as follows − Design and construction of data warehouses for multidimensional data analysis and data mining. Loan payment prediction and customer credit policy analysis. Classification and clustering of customers for targeted marketing. Detection of money laundering and other financial crimes. Retail Industry Data Mining has its great application in Retail Industry because it collects large amount of data from on sales, customer purchasing history, goods transportation, consumption and services. It is natural that the quantity of data collected will continue to expand rapidly because of the increasing ease, availability and popularity of the web. Data mining in retail industry helps in identifying customer buying patterns and trends that lead to improved quality of customer service and good customer retention and satisfaction. Here is the list of examples of data mining in the retail industry −
  • 22. Design and Construction of data warehouses based on the benefits of data mining. Multidimensional analysis of sales, customers, products, time and region. Analysis of effectiveness of sales campaigns. Customer Retention. Product recommendation and cross-referencing of items. Telecommunication Industry Today the telecommunication industry is one of the most emerging industries providing various services such as fax, pager, cellular phone, internet messenger, images, e-mail, web data transmission, etc. Due to the development of new computer and communication technologies, the telecommunication industry is rapidly expanding. This is the reason why data mining is become very important to help and understand the business. Data mining in telecommunication industry helps in identifying the telecommunication patterns, catch fraudulent activities, make better use of resource, and improve quality of service.
  • 23. Here is the list of examples for which data mining improves telecommunication services − Multidimensional Analysis of Telecommunication data. Fraudulent pattern analysis. Identification of unusual patterns. Multidimensional association and sequential patterns analysis. Mobile Telecommunication services. Use of visualization tools in telecommunication data analysis. Biological Data Analysis In recent times, we have seen a tremendous growth in the field of biology such as genomics, proteomics, functional Genomics and biomedical research. Biological data mining is a very important part of Bioinformatics. Following are the aspects in which data mining contributes for biological data analysis − Semantic integration of heterogeneous, distributed genomic and proteomic databases.
  • 24. Alignment, indexing, similarity search and comparative analysis multiple nucleotide sequences. Discovery of structural patterns and analysis of genetic networks and protein pathways. Association and path analysis. Visualization tools in genetic data analysis. Other Scientific Applications The applications discussed above tend to handle relatively small and homogeneous data sets for which the statistical techniques are appropriate. Huge amount of data have been collected from scientific domains such as geosciences, astronomy, etc. A large amount of data sets is being generated because of the fast numerical simulations in various fields such as climate and ecosystem modeling, chemical engineering, fluid dynamics, etc. Following are the applications of data mining in the field of Scientific Applications − Data Warehouses and data preprocessing. Graph-based mining. Visualization and domain specific knowledge.
  • 25. Intrusion Detection Intrusion refers to any kind of action that threatens integrity, confidentiality, or the availability of network resources. In this world of connectivity, security has become the major issue. With increased usage of internet and availability of the tools and tricks for intruding and attacking network prompted intrusion detection to become a critical component of network administration. Here is the list of areas in which data mining technology may be applied for intrusion detection − Development of data mining algorithm for intrusion detection. Association and correlation analysis, aggregation to help select and build discriminating attributes. Analysis of Stream data. Distributed data mining. Visualization and query tools. Data Mining System Products There are many data mining system products and domain specific data mining applications. The new data mining systems and applications are being added to the previous systems. Also, efforts are being made to standardize data mining languages.
  • 26. Choosing a Data Mining System The selection of a data mining system depends on the following features − Data Types − The data mining system may handle formatted text, record-based data, and relational data. The data could also be in ASCII text, relational database data or data warehouse data. Therefore, we should check what exact format the data mining system can handle. System Issues − We must consider the compatibility of a data mining system with different operating systems. One data mining system may run on only one operating system or on several. There are also data mining systems that provide web-based user interfaces and allow XML data as input. Data Sources − Data sources refer to the data formats in which data mining system will operate. Some data mining system may work only on ASCII text files while others on multiple relational sources. Data mining system should also support ODBC connections or OLE DB for ODBC connections. Data Mining functions and methodologies − There are some data mining systems that provide only one data mining function such as classification while some provides multiple data mining functions such as concept description, discovery-driven OLAP analysis, association mining, linkage analysis, statistical analysis, classification, prediction, clustering, outlier analysis, similarity search, etc.
  • 27. Coupling data mining with databases or data warehouse systems − Data mining systems need to be coupled with a database or a data warehouse system. The coupled components are integrated into a uniform information processing environment. Here are the types of coupling listed below − No coupling Loose Coupling Semi tight Coupling Tight Coupling Scalability − There are two scalability issues in data mining − Row (Database size) Scalability − A data mining system is considered as row scalable when the number or rows are enlarged 10 times. It takes no more than 10 times to execute a query. Column (Dimension) Salability − A data mining system is considered as column scalable if the mining query execution time increases linearly with the number of columns.
  • 28. Visualization Tools − Visualization in data mining can be categorized as follows − Data Visualization Mining Results Visualization Mining process visualization Visual data mining Data Mining query language and graphical user interface − An easy-to-use graphical user interface is important to promote user-guided, interactive data mining. Unlike relational database systems, data mining systems do not share underlying data mining query language. Trends in Data Mining Data mining concepts are still evolving and here are the latest trends that we get to see in this field − Application Exploration. Scalable and interactive data mining methods. Integration of data mining with database systems, data warehouse systems and web database systems. Standardization of data mining query language. Visual data mining.
  • 29. New methods for mining complex types of data. Biological data mining. Data mining and software engineering. Web mining. Distributed data mining. Real time data mining. Multi database data mining. Privacy protection and information security in data mining.