Unit i

UNDER THE GUIDENCE: M.FLORENCE DAYANA , Head of the Department
NAME OF THE STUDENT: P.ABILA
A.AISHWARYA LAKSHMI
V.AISHWARYA
A.AYEESHABI
REGISTER NUMBER: CB17S 250338
CB17S 250344
CB17S 250343
CB17S 250355
SUBJECT CODE: P8MCA22
BATCH : 2017-2020
YEAR : 2020

WHAT IS DATA MINING?
 Data mining is the process of discovering patterns in large data
sets involving methods at the intersection of machine learning, statistics,
and database systems.
 Data mining is an interdisciplinary subfield of computer
science and statistics with an overall goal to extract information (with
intelligent methods) from a data set and transform the information into a
comprehensible structure for further use.
 Data mining is the analysis step of the "knowledge discovery in databases"
process or KDD.
 It is the process of uncovering trends, common themes or patterns in
“big data”. ... For example, an early form of data mining was used by
companies to analyze huge amounts of scanner data from supermarkets.

 Data mining is the analysis step of the "knowledge discovery in
databases" process or KDD.
 It is the process of uncovering trends, common themes or patterns in
“big data”. ... For example, an early form of data mining was used by
companies to analyze huge amounts of scanner data from supermarkets.

DATA WAREHOUSES
 A Data Warehousing (DW) is process for collecting and managing data
from varied sources to provide meaningful business insights. A Data
warehouse is typically used to connect and analyze business data from
heterogeneous sources. The data warehouse is the core of the BI system
which is built for data analysis and reporting.
 It is a blend of technologies and components which aids the strategic use of
data. It is electronic storage of a large amount of information by a business
which is designed for query and analysis instead of transaction processing.
It is a process of transforming data into information and making it available
to users in a timely manner to make a difference.

HOW DATA WAREHOUSE WORKS?
 A Data Warehouse works as a central repository where information arrives
from one or more data sources. Data flows into a data warehouse from the
transactional system and other relational databases.
 Data may be:
 Structured
 Semi-structured
 Unstructured data
 The data is processed, transformed, and ingested so that users can access the
processed data in the Data Warehouse through Business Intelligence tools,
SQL clients, and spreadsheets. A data warehouse merges information
coming from different sources into one comprehensive database.

DATA MINING
FUNCTIONALITIES AND
TASKS
 Data mining deals with the kind of patterns that can be mined. On the basis
of the kind of data to be mined, there are two categories of functions
involved in Data Mining
 Descriptive
 Classification and Prediction
 Descriptive Function
The descriptive function deals with the general properties of data in
the database. Here is the list of descriptive functions
 Class/Concept Description
 Mining of Frequent Patterns
 Mining of Associations
 Mining of Correlations
 Mining of Clusters

Class/Concept Description
Class/Concept refers to the data to be associated with the classes or
concepts. For example, in a company, the classes of items for sales include
computer and printers, and concepts of customers include big spenders and
budget spenders. Such descriptions of a class or a concept are called
class/concept descriptions. These descriptions can be derived by the following
two ways −
Data Characterization − This refers to summarizing data of class under
study. This class under study is called as Target Class.
Data Discrimination − It refers to the mapping or classification of a class
with some predefined group or class.
Mining of Frequent Patterns
Frequent patterns are those patterns that occur frequently in transactional
data. Here is the list of kind of frequent patterns −
Frequent Item Set − It refers to a set of items that frequently appear
together, for example, milk and bread.
Frequent Subsequence − A sequence of patterns that occur frequently
such as purchasing a camera is followed by memory card.
.

Frequent Sub Structure − Substructure refers to different structural
forms, such as graphs, trees, or lattices, which may be combined with item-
sets or subsequences.
Mining of Association
Associations are used in retail sales to identify patterns that are frequently
purchased together. This process refers to the process of uncovering the
relationship among data and determining association rules.
For example, a retailer generates an association rule that shows that 70% of time
milk is sold with bread and only 30% of times biscuits are sold with bread.
Mining of Correlations
It is a kind of additional analysis performed to uncover interesting
statistical correlations between associated-attribute-value pairs or between two
item sets to analyze that if they have positive, negative or no effect on each
other.
Mining of Clusters
Cluster refers to a group of similar kind of objects. Cluster analysis refers
to forming group of objects that are very similar to each other but are highly
different from the objects in other clusters.

Classification and Prediction
Classification is the process of finding a model that describes the data
classes or concepts. The purpose is to be able to use this model to predict the class of
objects whose class label is unknown. This derived model is based on the analysis of
sets of training data. The derived model can be presented in the following forms −
Classification (IF-THEN) Rules
Decision Trees
Mathematical Formulae
Neural Networks
The list of functions involved in these processes are as follows −
Classification − It predicts the class of objects whose class label is
unknown. Its objective is to find a derived model that describes and
distinguishes data classes or concepts. The Derived Model is based on the
analysis set of training data i.e. the data object whose class label is well
known.
Prediction − It is used to predict missing or unavailable numerical data
values rather than class labels. Regression Analysis is generally used for
prediction. Prediction can also be used for identification of distribution
trends based on available data.

Outlier Analysis − Outliers may be defined as the data objects that do
not comply with the general behavior or model of the data available.
Evolution Analysis − Evolution analysis refers to the description and
model regularities or trends for objects whose behavior changes over
time.
Data Mining Task Primitives
We can specify a data mining task in the form of a data mining query.
This query is input to the system.
A data mining query is defined in terms of data mining task primitives.
Note − These primitives allow us to communicate in an interactive manner with
the data mining system. Here is the list of Data Mining Task Primitives −
Set of task relevant data to be mined.
Kind of knowledge to be mined.
Background knowledge to be used in discovery process.
Interestingness measures and thresholds for pattern evaluation.
Representation for visualizing the discovered patterns.

Set of task relevant data to be mined
This is the portion of database in which the user is interested. This
portion includes the following −
Database Attributes
Data Warehouse dimensions of interest
Kind of knowledge to be mined
It refers to the kind of functions to be performed. These functions are
Characterization
Discrimination
Association and Correlation Analysis
Classification
Prediction
Clustering
Outlier Analysis
Evolution Analysis
Background knowledge
The background knowledge allows data to be mined at multiple
levels of abstraction.

Interestingness measures and thresholds for pattern evaluation
This is used to evaluate the patterns that are discovered by the process
of knowledge discovery. There are different interesting measures for different
kind of knowledge.
Representation for visualizing the discovered patterns
This refers to the form in which discovered patterns are to be
displayed. These representations may include the following. −
Rules
Tables
Charts
Graphs
Decision Trees
Cubes

DATA MINING ISSUES
 Data mining is not an easy task, as the algorithms used can get very
complex and data is not always available at one place. It needs to be
integrated from various heterogeneous data sources. These factors also
create some issues. Here in this tutorial, we will discuss the major issues
regarding −
 Mining Methodology and User Interaction
 Performance Issues
 Diverse Data Types Issues
 The following diagram describes the major issues.

Mining Methodology and User Interaction Issues
It refers to the following kinds of issues −
Mining different kinds of knowledge in databases − Different users may
be interested in different kinds of knowledge. Therefore it is necessary for data
mining to cover a broad range of knowledge discovery task.
Interactive mining of knowledge at multiple levels of abstraction − The
data mining process needs to be interactive because it allows users to focus the
search for patterns, providing and refining data mining requests based on the
returned results.
Incorporation of background knowledge − To guide discovery process and
to express the discovered patterns, the background knowledge can be used.
Background knowledge may be used to express the discovered patterns not
only in concise terms but at multiple levels of abstraction.
Data mining query languages and ad hoc data mining − Data Mining
Query language that allows the user to describe ad hoc mining tasks, should be
integrated with a data warehouse query language and optimized for efficient
and flexible data mining.

Presentation and visualization of data mining results − Once the patterns are
discovered it needs to be expressed in high level languages, and visual
representations. These representations should be easily understandable.
Handling noisy or incomplete data − The data cleaning methods are required
to handle the noise and incomplete objects while mining the data regularities. If
the data cleaning methods are not there then the accuracy of the discovered
patterns will be poor.
Pattern evaluation − The patterns discovered should be interesting because
either they represent common knowledge or lack novelty.
Performance Issues
There can be performance-related issues such as follows −
Efficiency and scalability of data mining algorithms − In order to effectively
extract the information from huge amount of data in databases, data mining
algorithm must be efficient and scalable.

Parallel, distributed, and incremental mining algorithms − The
factors such as huge size of databases, wide distribution of data, and
complexity of data mining methods motivate the development of
parallel and distributed data mining algorithms. These algorithms divide
the data into partitions which is further processed in a parallel fashion.
Then the results from the partitions is merged. The incremental
algorithms, update databases without mining the data again from
scratch.
Diverse Data Types Issues
Handling of relational and complex types of data − The database
may contain complex data objects, multimedia data objects, spatial data,
temporal data etc. It is not possible for one system to mine all these kind
of data.
Mining information from heterogeneous databases and global
information systems − The data is available at different data sources on
LAN or WAN. These data source may be structured, semi structured or
unstructured. Therefore mining the knowledge from them adds
challenges to data mining.

APPLICATIONS AND TRENDS IN
DATA MINING
 Data mining is widely used in diverse areas. There are a number of
commercial data mining system available today and yet there are many
challenges in this field. In this tutorial, we will discuss the applications and
the trend of data mining.
 Data Mining Applications
Here is the list of areas where data mining is widely used −
 Financial Data Analysis
 Retail Industry
 Telecommunication Industry
 Biological Data Analysis
 Other Scientific Applications
 Intrusion Detection

SOCIAL IMPLICATIONS OF DATA
MINING
Privacy
Profiling
Unauthorised use

Financial Data Analysis
The financial data in banking and financial industry is generally reliable and of
high quality which facilitates systematic data analysis and data mining. Some of the
typical cases are as follows −
Design and construction of data warehouses for multidimensional data analysis
and data mining.
Loan payment prediction and customer credit policy analysis.
Classification and clustering of customers for targeted marketing.
Detection of money laundering and other financial crimes.
Retail Industry
Data Mining has its great application in Retail Industry because it collects large
amount of data from on sales, customer purchasing history, goods transportation,
consumption and services. It is natural that the quantity of data collected will continue
to expand rapidly because of the increasing ease, availability and popularity of the web.
Data mining in retail industry helps in identifying customer buying patterns and trends
that lead to improved quality of customer service and good customer retention and
satisfaction. Here is the list of examples of data mining in the retail industry −

Design and Construction of data warehouses based on the benefits of data
mining.
Multidimensional analysis of sales, customers, products, time and region.
Analysis of effectiveness of sales campaigns.
Customer Retention.
Product recommendation and cross-referencing of items.
Telecommunication Industry
Today the telecommunication industry is one of the most emerging industries
providing various services such as fax, pager, cellular phone, internet messenger,
images, e-mail, web data transmission, etc. Due to the development of new
computer and communication technologies, the telecommunication industry is
rapidly expanding. This is the reason why data mining is become very important to
help and understand the business.
Data mining in telecommunication industry helps in identifying the
telecommunication patterns, catch fraudulent activities, make better use of resource,
and improve quality of service.

Here is the list of examples for which data mining improves
telecommunication services −
Multidimensional Analysis of Telecommunication data.
Fraudulent pattern analysis.
Identification of unusual patterns.
Multidimensional association and sequential patterns analysis.
Mobile Telecommunication services.
Use of visualization tools in telecommunication data analysis.
Biological Data Analysis
In recent times, we have seen a tremendous growth in the field of
biology such as genomics, proteomics, functional Genomics and biomedical
research. Biological data mining is a very important part of Bioinformatics.
Following are the aspects in which data mining contributes for biological
data analysis −
Semantic integration of heterogeneous, distributed genomic and
proteomic databases.

Alignment, indexing, similarity search and comparative analysis multiple
nucleotide sequences.
Discovery of structural patterns and analysis of genetic networks and
protein pathways.
Association and path analysis.
Visualization tools in genetic data analysis.
Other Scientific Applications
The applications discussed above tend to handle relatively small and
homogeneous data sets for which the statistical techniques are appropriate. Huge
amount of data have been collected from scientific domains such as geosciences,
astronomy, etc. A large amount of data sets is being generated because of the fast
numerical simulations in various fields such as climate and ecosystem modeling,
chemical engineering, fluid dynamics, etc. Following are the applications of data
mining in the field of Scientific Applications −
Data Warehouses and data preprocessing.
Graph-based mining.
Visualization and domain specific knowledge.

Intrusion Detection
Intrusion refers to any kind of action that threatens integrity, confidentiality,
or the availability of network resources. In this world of connectivity, security
has become the major issue. With increased usage of internet and availability of
the tools and tricks for intruding and attacking network prompted intrusion
detection to become a critical component of network administration. Here is the
list of areas in which data mining technology may be applied for intrusion
detection −
Development of data mining algorithm for intrusion detection.
Association and correlation analysis, aggregation to help select and build
discriminating attributes.
Analysis of Stream data.
Distributed data mining.
Visualization and query tools.
Data Mining System Products
There are many data mining system products and domain specific data
mining applications. The new data mining systems and applications are being
added to the previous systems. Also, efforts are being made to standardize data
mining languages.

Choosing a Data Mining System
The selection of a data mining system depends on the following features −
Data Types − The data mining system may handle formatted text, record-based
data, and relational data. The data could also be in ASCII text, relational database
data or data warehouse data. Therefore, we should check what exact format the data
mining system can handle.
System Issues − We must consider the compatibility of a data mining system with
different operating systems. One data mining system may run on only one operating
system or on several. There are also data mining systems that provide web-based user
interfaces and allow XML data as input.
Data Sources − Data sources refer to the data formats in which data mining system
will operate. Some data mining system may work only on ASCII text files while
others on multiple relational sources. Data mining system should also support ODBC
connections or OLE DB for ODBC connections.
Data Mining functions and methodologies − There are some data mining systems
that provide only one data mining function such as classification while some provides
multiple data mining functions such as concept description, discovery-driven OLAP
analysis, association mining, linkage analysis, statistical analysis, classification,
prediction, clustering, outlier analysis, similarity search, etc.

Coupling data mining with databases or data warehouse systems − Data
mining systems need to be coupled with a database or a data warehouse system.
The coupled components are integrated into a uniform information processing
environment. Here are the types of coupling listed below −
No coupling
Loose Coupling
Semi tight Coupling
Tight Coupling
Scalability − There are two scalability issues in data mining −
Row (Database size) Scalability − A data mining system is considered as
row scalable when the number or rows are enlarged 10 times. It takes no
more than 10 times to execute a query.
Column (Dimension) Salability − A data mining system is considered as
column scalable if the mining query execution time increases linearly with
the number of columns.

Visualization Tools − Visualization in data mining can be categorized as follows −
Data Visualization
Mining Results Visualization
Mining process visualization
Visual data mining
Data Mining query language and graphical user interface −
An easy-to-use graphical user interface is important to promote user-guided,
interactive data mining. Unlike relational database systems, data mining systems do
not share underlying data mining query language.
Trends in Data Mining
Data mining concepts are still evolving and here are the latest trends that we get
to see in this field −
Application Exploration.
Scalable and interactive data mining methods.
Integration of data mining with database systems, data warehouse systems and
web database systems.
Standardization of data mining query language.
Visual data mining.

New methods for mining complex types of data.
Biological data mining.
Data mining and software engineering.
Web mining.
Distributed data mining.
Real time data mining.
Multi database data mining.
Privacy protection and information security in data mining.

Unit i

More Related Content

What's hot (18)

Similar to Unit i (20)

Recently uploaded (20)

Unit i