SlideShare a Scribd company logo
Syllabus
Introduction to Data Science
‱ Data Science is a field that combines statistical methods,
algorithms, and technology to extract insights from structured
and unstructured data.
‱ It enables organizations to make data-driven decisions,
predict trends, and improve efficiency.
‱ Data Science is a collection of techniques used to extract value
from data.
 Data – can be Simple array of a few numeric observations,complex matrix of
millions of observations with thousands of variables.
 Science – in data science indicates that the methods are evidence based, and
are built on empirical knowledge, more specifically (on) historical
observations.
This discipline coexists and closely associated with
‱ database systems
‱ data engineering
‱ visualization
‱ data analysis
‱ experimentation and,
‱ business intelligence.
Key features
‱ 1.2.1 Extracting Meaningful Patterns
Data Science refers to the process of identifying and interpreting significant
trends, relationships, or insights from data. This involves analyzing large
datasets to discover patterns that can inform decision-making, predict future
trends, or solve specific problems.
Key Aspects of Extracting Meaningful Patterns
 Data Mining: Techniques used to discover patterns in large datasets, often
involving machine learning, statistical analysis, and database systems.
 Statistical Analysis: Using statistical methods to identify relationships and
trends within data.
 Machine Learning Models: Employing algorithms to learn from data and
make predictions or classify information.
 Visualization Tools: Creating charts, graphs, and other visual aids to help
identify patterns and make data more interpretable.
 Feature Engineering: Selecting and transforming variables in the data to
improve the performance of machine learning models.
 Pattern Recognition: Detecting regularities and irregularities in data,
which could indicate significant insights.
Examples of Extracting Meaningful Patterns:
‱ Customer Segmentation: Analyzing purchase behavior to
group customers into segments for targeted marketing.
‱ Fraud Detection: Identifying unusual transaction patterns that
may indicate fraudulent activity.
‱ Predictive Maintenance: Using sensor data to predict
equipment failures before they occur.
‱ Market Basket Analysis: Discovering product purchase
combinations to optimize inventory and cross-selling
strategies.
‱ 1.2.2 Building Representative Models
 In statistics, a model is the representation of a relationship
between variables in a dataset.
 It describes how one or more variables in the data are related
to other variables.
 Modeling is a process in which a representative abstraction is
built from the observed dataset.
 For example, based on credit score, income level, and
requested loan amount, a model can be developed to
determine the interest rate of a loan. For this task, previously
known observational data including credit score, income level,
loan amount, and interest rate are needed.
‱ Fig. 1.3 shows the process of generating a model. Once the representative
model is created, it can be used to predict the value of the interest rate,
based on all the input variables.
‱ This model serves two purposes:
– On the one hand, it predicts the output (interest rate) based on the
new and unseen set of input variables (credit score, income level, and
loan amount),
– and on the other hand, the model can be used to understand the
relationship between the output variable and all the input variables.
‱ For example, does income level really matter in determining
the interest rate of a loan? Does income level matter more
than credit score? What happens when income levels double
or if credit score drops by 10 points? A Model can be used for
both predictive and explanatory applications
‱ 1.2.3 Combination of Statistics, Machine Learning, and
Computing
Data Science refers to the integration of these three disciplines to
extract meaningful insights from data, build predictive models, and
implement solutions. Each of these fields contributes unique
methodologies and tools, which together enable comprehensive data
analysis and decision-making processes.
‱ 1.2.4 Learning Algorithms
Data Science refers to the methods and techniques used to build
models that can learn from data and make predictions or
decisions. These algorithms enable machines to automatically
improve their performance on a given task through experience,
without being explicitly programmed for every possible scenario.
‱ Types of Learning Algorithms
1. Supervised Learning:
Examples:
‱ Linear Regression:
‱ Logistic Regression
‱ Decision Trees
‱ Support Vector Machines (SVM)
‱ Neural Networks
2. Unsupervised Learning:
– Examples:
‱ K-Means Clustering
‱ Hierarchical Clustering
‱ Principal Component Analysis (PCA)
‱ Anomaly Detection
3. Reinforcement Learning:
– Examples:
‱ Q-Learning
‱ Deep Q-Networks (DQN)
‱ Policy Gradient Methods
Benefits of Learning Algorithms:
‱ Scalability: Can handle large datasets and complex problems
efficiently.
‱ Accuracy: Often provide high levels of predictive accuracy by
leveraging vast amounts of data.
‱ Continuous Improvement: Capable of learning and improving
over time with more data and feedback.
‱ Versatility: Applicable to a wide range of domains and
industries, from healthcare to finance to entertainment.
‱ 1.2.5. Associated Fields
‱ We understood, data science covers a wide set of
techniques, applications, and disciplines.
‱ There a few associated fields that data science heavily
depends on.
‱ They are
– Descriptive Statistics,
– Exploratory Visualization,
– Dimensional Slicing,
– Hypothesis Testing,
– Data Engineering and,
– Business Intelligence.
Descriptive statistics:
‱ Computing mean, standard deviation, correlation, and other
descriptive statistics, quantify the aggregate structure of a
dataset.
‱ This is essential information for understanding any dataset in
order to understand the structure of the data and the
relationships within the dataset.
‱ They are used in the exploration stage of the data science
process
‱ Exploratory visualization:
The process of expressing data in visual coordinates enables users to find patterns and
relationships in the data and to comprehend large datasets. Similar to descriptive
statistics, they are integral in the pre- and post-processing steps in data science
‱ Dimensional Slicing:
 Online analytical processing (OLAP) applications, are widespread in organizations.
 They mainly provide information on the data through dimensional slicing, filtering,
and pivoting.
 OLAP analysis is enabled by a unique database schema design where the data are
organized as dimensions (e.g., products, regions, dates) and quantitative facts or
measures (e.g., revenue, quantity).
 With a well-defined database structure, it is easy to slice the yearly revenue by
products or combination of region and products, for example.
 These techniques are extremely useful and may reveal patterns in data.
.
Hypothesis Testing:
‱ It is a kind of statistical testing.
‱ In statistics, a hypothesis is a statement about a population that we
want to verify based on information contained in the sample data.
‱In general, data science is a process where many hypotheses are
generated and tested based on observational data.
‱Since the data science algorithms are iterative, solutions can be
refined in each step.
Steps usually followed in hypothesis testing are:
1. Figure out the null hypothesis,
2. State the null hypothesis,
3. Choose what kind of test we need to perform,
4. Either support or reject the null hypothesis.
Data engineering:
 Data engineering is the process of sourcing, organizing, assembling, storing,
and distributing data for effective analysis and usage.
 Database engineering, distributed storage, and computing frameworks (e.g.,
Apache Hadoop, Spark, Kafka), parallel computing, extraction transformation
and loading processing, and data warehousing constitute data engineering
techniques.
 Data engineering helps source and prepare for data science learning
algorithms.
Business Intelligence:
 Business intelligence helps organizations consume data effectively.
 It helps query the ad hoc data use dashboards or visualizations to
communicate the facts and trends.
 Historical trends are usually reported, but in combination with data science,
both the past and the predicted future data can be combined. BI can hold and
distribute the results of data science.
DATA SCIENCE CLASSIFICATION
‱ Data science problems can be broadly categorized into
supervised or unsupervised learning models.
‱ Supervised or directed data science tries to infer a function or
relationship based on labelled training data and uses this
function to map new unlabelled data.
‱ The model generalizes the relationship between the input and
output variables and uses it to predict for a dataset where
only input variables are known. The output variable that is
being predicted is also called a class label or target variable.
‱ Unsupervised or undirected data science uncovers hidden patterns in unlabelled data.
‱ In unsupervised data science, there are no output variables to predict. The objective
of this class of data science techniques, is to find patterns in data based on the
relationship between data points themselves.
Data science problems can also be classified into tasks such as:
1. Classification
2. Regression
3. Association Analysis
4. Clustering
5. Anomaly Detection
6. Recommendation Engines
7. Feature Selection
8. Time Series Forecasting
9. Deep Learning
10. Text Mining.
data science, prior knowledge ,modeling, scatter plot
‱ Classification and regression
techniques predict a target
variable based on input variables.
The prediction is based on a
generalized model built from a
previously known dataset.
‱ In regression tasks, the output
variable is numeric (e.g.,the
mortgage interest rate on a loan).
‱ Classification tasks predict output
variables, which are categorical or
polynomial (e.g., the yes or no
decision to approve a loan).
Predict whether a customer is eligible for a loan? Predict the price of the car?
Predict the Indian team will win or lose ? Predict weather for next 24 hours?
‱ Clustering is the process of identifying the natural groupings
in a dataset. For example, clustering is helpful in finding
natural clusters in customer datasets,which can be used for
market segmentation.
‱ Since this is unsupervised datascience, it is up to the end user
to investigate why these clusters are formed in the data and
generalize the uniqueness of each cluster.
‱ Deep Learning is based on artificial neural networks used for
classification and regression problems.
‱ In retail analytics,it is common to identify pairs of items that
are purchased together, so that specific items can be bundled
or placed next to each other. This task is called market basket
analysis or association analysis, which is commonly used in
cross selling.
‱ Recommendation engines are the systems that recommend
items to the users based on individual user preference.
‱ Anomaly or outlier detection identifies the data points that
are significantly different from other data points in a dataset.
Credit card transaction fraud detection is one of the most
prolific applications of anomaly detection.
‱ Time series forecasting is the process of predicting the future
value of a variable (e.g., temperature) based on past historical
values that may exhibit a trend and seasonality.
‱ Text mining is a data science application where the input data
is text, which can be in the form of documents, messages,
emails, or web pages.
‱ To aid the data science on text data, the text files are first
converted into document vectors where each unique word is
an attribute.
‱ Once the text file is converted to document vectors, standard
data science tasks such as classification,clustering, etc., can be
applied.
‱ Feature selection is a process in which attribute in a dataset
are reduced to a few attributes that really matter.
data science, prior knowledge ,modeling, scatter plot
Data Science Process
‱ The methodical discovery of useful relationships and patterns
in data is enabled by a set of iterative activities collectively
known as the data science process.
‱ The standard data science process involves
o understanding the problem,
o preparing the data samples,
o developing the model,
o applying the model on a datasets
o deploying and maintaining the models.
‱ One of the most popular data science process frameworks is
Cross Industry Standard Process for Data Mining (CRISP-DM),
which is an acronym for Cross Industry Standard Process for
Data Mining.
‱ This framework was developed by a consortium of companies
involved in data mining.
‱ The CRISP-DM process is the most widely adopted framework
for developing data science solutions.
Fig. 2.1 provides a visual overview of the CRISP-DM framework.
‱ The problem at hand could be a segmentation of customers, a
prediction of climate patterns, or a simple data exploration.
‱ The learning algorithm used to solve the business question
could be a decision tree, an artificial neural network, or a
scatterplot.
‱ The software tool to develop and implement the data science
algorithm used could be custom coding, RapidMiner, R, Weka,
SAS, Oracle Data Miner, Python, etc., (Piatetsky, 2018) to
mention a few.
data science, prior knowledge ,modeling, scatter plot
2.1 PRIOR KNOWLEDGE
‱ The prior knowledge step in the data science process helps to define what problem is being
solved, how it fits in the business context, and what data is needed in order to solve the
problem.
– Objective
‱ The data science process starts with a need for analysis, a question, or a business
objective.This is possibly the most important step in the data science process
(Shearer, 2000). Without a well-defined statement of the problem, it is impossible
to come up with the right dataset and pick the right data science algorithm.
– Subject Area
‱ The process of data science uncovers hidden patterns in the dataset by exposing
relationships between attributes. But the problem is that it uncovers a lot of
patterns. The false or spurious signals are a major problem in the data science
process. It is up to the practitioner to sift through the exposed patterns and accept
the ones that are valid and relevant to the answer of the objective question. Hence,
it is essential to know the subject matter, the context, and the business process
generating the data.
‱ Data
– Similar to the prior knowledge in the subject area, prior knowledge in the data can also
be gathered.
– Understanding how the data is collected, stored,transformed, reported, and used is
essential to the data science process.
– There are quite a range of factors to consider: quality of the data, quantity of data,
availability of data, gaps in data, does lack of data compel the practitioner to change the
business question, etc.
– The objective of this step is to come up with a dataset to answer the business question
through the data science process.
– It is critical to recognize that an inferred model is only as good as the data used to
create it.
‱ A dataset (example set) is a collection of data with a defined structure.. It
has a well-defined structure ,This structure is also sometimes referred to
as a “data frame”.
‱ A data point (record, object or example) is a single instance in the dataset.
Each row in Table is a data point. Each instance contains the same
structure as the dataset.
‱ An attribute (feature, input, dimension, variable, or predictor) is a single
property of the dataset. Each column in Table is an attribute.
‱ Attributes can be numeric, categorical, date-time, text, or Boolean data
types. In this example, both the credit score and the interest rate are
numeric attributes
‱ A label (class label, output, prediction, target, or response) is the special
attribute to be predicted based on all the input attributes. In Table, the
interest rate is the output variable.
‱ Identifiers are special attributes that are used for locating or providing
context to individual records. For example, common attributes like names,
account numbers, and employee ID numbers are identifier attributes.
data science, prior knowledge ,modeling, scatter plot
2.2 DATA PREPARATION
‱ Preparing the dataset to suit a data science task is the most
time-consuming part of the process.
‱ It is extremely rare that datasets are available in the form
required by the data science algorithms.
‱ Most of the data science algorithms would require data to be
structured in a tabular format with records in the rows and
attributes in the columns.
‱ If the data is in any other format, the data would need to be
transformed by applying pivot, type conversion, join,or
transpose functions, etc., to condition the data into the
required structure.
2.2.1 Data Exploration
‱ Data exploration, also known as exploratory data analysis, provides a set
of simple tools to achieve basic understanding of the data.
‱ Data exploration approaches involve computing descriptive statistics and
visualization of data.
‱ They can expose the structure of the data the distribution of the values,
the presence of extreme values, and highlight the inter-relationships
within the dataset.
‱ Descriptive statistics like mean,median, mode, standard deviation, and
range for each attribute provide an easily readable summary of the key
characteristics of the distribution of data.
2.2.2 Data Quality
‱ Data quality is an ongoing concern wherever data is collected,
processed, and stored.
‱ Organizations use data alerts, cleansing, and transformation
techniques to improve and manage the quality of the data and
store them in companywide repositories called data warehouses.
‱ Data sourced from well-maintained data warehouses have higher
quality, as there are proper controls in place to ensure a level of
data accuracy for new and existing data.
‱ The data cleansing practices include elimination of duplicate
records, quarantining outlier records that exceed the bounds,
standardization of attribute values, substitution of missing values,
etc.
2.2.3 Missing Values
‱ One of the most common data quality issues is that some records have missing attribute
values.
‱ For example, a credit score may be missing in one of the records. There are several
different mitigation methods to deal with this problem, but each method has pros and
cons. The first step of managing missing values is to understand the reason behind why
the values are missing. Tracking the data lineage (provenance) of the data source can
lead to the identification of systemic issues during data capture or errors in data
transformation.
‱ Knowing the source of missing value will often guide which mitigation methodology to
use. The missing value can be substituted with a range of artificial data so that the issue
can be managed with marginal impact on the later steps in the data science process.
‱ Missing credit score values can be replaced with a credit score derived from the dataset
(mean, minimum, or maximum value, depending on the characteristics of the attribute).
This method is useful if the missing values occur randomly and the frequency of
occurrence is quite rare.
‱ Alternatively, to build the representative model, all the data records with missing values
or records with poor data quality can be ignored. This method reduces the size of the
dataset.
38
38
38
2.2.4 Data Types and Conversion
‱ The attributes in a dataset can be of different types, such as continuous numeric
(interest rate), integer numeric (credit score), or categorical. For example, the
credit score can be expressed as categorical values (poor, good, excellent) or
numeric score.
‱ Different data science algorithms impose different restrictions on the attribute
data types.
‱ In case of linear regression models, the input attributes have to be numeric. If
the available data are categorical, they must be converted to continuous numeric
attribute.
‱ A specific numeric score can be encoded for each category value, such as poor 5
400, good 5 600, excellent 5 700, etc.
‱ Similarly, numeric values can be converted to categorical data types by a
technique called binning, where a range of values are specified for each category,
for example, a score between 400 and 500 can be encoded as “low” and so on.
‱ 2.2.5 Transformation
‱ In some data science algorithms like k-NN, the input attributes are
expected to be numeric and normalized, because the algorithm
compares the values of different attributes and calculates
distance between the data points.
‱ Normalization prevents one attribute dominating the distance
results because of large values. For example, consider income
(expressed in USD, in thou-sands) and credit score (in hundreds).
‱ The distance calculation will always be dominated by slight
variations in income.
‱ One solution is to convert the range of income and credit score to
a more uniform scale from 0 to 1 by normalization. This way, a
consistent comparison can be made between the two different
attributes with different units
2.2.6 Outliers
‱ Outliers are anomalies in a given dataset.
‱ Outliers may occur because of correct data capture (few
people with income in tens of millions) or erroneous data
capture (human height as 1.73 cm instead of 1.73 m).
‱ Regardless, the presence of outliers needs to be understood
and will require special treatments.
‱ The purpose of creating a representative model is to generalize
a pattern or a relationship within a dataset and the presence
of outliers skews the representativeness of the inferred model.
‱ Detecting outliers may be the primary purpose of some data
science applications, like fraud or intrusion detection.
2.2.7 Feature Selection
Reducing the number of attributes, without significant loss in the
performance of the model, is called feature selection. It leads to
a more simplified model and helps to synthesize a more effective
explanation of the model.
2.2.8 Data Sampling
Sampling is a process of selecting a subset of records as a
representation of the original dataset for use in data analysis or
modeling. The sample data serve as a representative of the
original dataset with similar properties, such as a similar mean.
Sampling reduces the amount of data that need to be processed
and speeds up the build process of the modeling
40
40
40
40
40
40
40
2.3 Model
A model is the abstract representation of the data and the relationships in a
given dataset. A simple rule of thumb like “mortgage interest rate reduces
with increase in credit score” is a model; although there is not enough
quantitative information to use in a production scenario, it provides
directional information by abstracting the relationship between credit score
and interest rate. There are a few hundred data science algorithms in use
today, derived from statistics, machine learning, pattern recognition, and the
body of knowledge related to computer science.
41
2.3.1 Training and Testing Datasets
The modeling step creates a representative model inferred from
the data. The dataset used to create the model, with known
attributes and target, is called the training dataset. The validity
of the created model will also need to be checked with another
known dataset called the test dataset or validation dataset. To
facilitate this process, the overall known dataset can be split into
a training dataset and a test dataset. A standard rule of thumb is
two-thirds of the data are to be used as training and one-third as
a test dataset
data science, prior knowledge ,modeling, scatter plot
2.3.2 Learning Algorithms
The business question and the availability of data will dictate
what data science task (association, classification, regression,
etc.,) can to be used. The practitioner determines the
appropriate data science algorithm within the chosen category.
For example, within a classification task many algorithms can be
chosen from: decision trees, rule induction, neural networks,
Bayesian models, k-NN, etc. Likewise, within decision tree
techniques, there are quite a number of variations of learning
algorithms like classification and regression tree (CART), CHi-
squared Automatic Interaction Detector (CHAID) et
2.3.3 Evaluation of the Model
A model should not memorize and output the same values that
are in the training records. The phenomenon of a model
memorizing the training data is called overfitting. An overfitted
model just memorizes the training records and will
underperform on real unlabeled new data. The model should
generalize or learn the relationship between credit score and
interest rate. To evaluate this relationship, the validation or test
dataset, which was not previously used in building the model, is
used for evaluation.
45
45
2.3.4 Ensemble Modeling
Ensemble modeling is a process where multiple diverse base
models are used to predict an outcome. The motivation for using
ensemble models is to reduce the generalization error of the
prediction.
2.4 APPLICATION
Deployment is the stage at which the model becomes
production ready or live. In business applications, the results of
the data science process have to be assimilated into the business
process—usually in software applications. The model
deployment stage has to deal with: assessing model readiness,
technical integration, response time, model maintenance, and
assimilation.
2.4.1 Production Readiness
The production readiness part of the deployment determines the critical qualities required for
the deployment objective.
2.4.2 Technical Integration
Technical integration in the data science process involves integrating various technologies, tools,
and platforms to facilitate and streamline each stage of the process. Here's how technical
integration can be applied at each step:
2.4.3 Response Time
2.4.4 Model Refresh
2.4.5 Assimilation
these tools and technologies ensures an efficient workflow, enabling data scientists to focus on
extracting insights and building robust models.
2.5 KNOWLEDGE
‱ The data science process provides a framework to extract nontrivial information
from data. With the advent of massive storage, increased data collection, and
advanced computing paradigms, the available datasets to be utilized are only
increasing.
‱ To extract knowledge from these massive data assets, advanced approaches need to
be employed, like data science algorithms, in addition to standard business
intelligence reporting or statistical analysis.
‱ Data science, like any other technology, provides various options in terms of
algorithms and parameters within the algorithms. Using these options to extract the
right information from data is a bit of an art and can be developed with practice.
‱ The data science process starts with prior knowledge and ends with posterior
knowledge, which is the incremental insight gained
‱ It is the difference between gaining the information through the data science
process and the insights from basic data analysis. Finally, the whole data science
process is a framework to invoke the right questions (Chapman et al., 2000) and
provide guidance, through the right approaches, to solve a problem
Data Exploration
‱ Data exploration can be broadly classified into two types—
descriptive statistics and data visualization.
‱ Descriptive statistics is the process of condensing key
characteristics of the dataset into simple numeric metrics.
‱ Some of the common quantitative metrics used are mean,
standard deviation, and correlation.
‱ Visualization is the process of projecting the data, or parts of
it, into multi-dimensional space or abstract images. All the
useful(and adorable) charts fall under this category.
‱ Data exploration in the context of data science uses both
descriptive statistics and visualization techniques.
OBJECTIVES OF DATA EXPLORATION
‱ Data understanding
‱ Data preparation
‱ Data science tasks
‱ Interpreting the results
Types of Data
‱ Numeric or Continuous
‱ Categorical or Nominal
UNIVARIATE ANALYSIS
Univariate analysis is the simplest form of analyzing data. “Uni” means “one”,
so in other words your data has only one variable. It doesn’t deal with causes
or relationships (unlike regression) and it’s major purpose is to describe; it
takes data, summarizes that data and finds patterns in the data.
Ways to describe patterns found in univariate data
1. Central tendency
1. Mean
2. Mode
3. Median
2. Dispersion
1. Range
2. Variance
3. maximum, minimum,
4. Quartiles (including the interquartile range), and
5. Standard deviation
3. Count /Null count
data science, prior knowledge ,modeling, scatter plot
Multivariate Exploration
‱ Multivariate exploration is the study of more than one attribute in the data-
set simultaneously. This technique is critical to understanding the relation-
ship between the attributes, which is central to data science methods.
‱ Central Data
‱ In the Iris dataset, each data point as a set of all the four attributes can be
expressed: observation: {sepal length, sepal width, petal length, petal width}
‱ For example, observation one: {5.1, 3.5, 1.4, 0.2}. This observation point can
also be expressed in four-dimensional Cartesian coordinates and can be
plotted in a graph (although plotting more than three dimensions in a visual
graph can be challenging). In this way, all 150 observations can be expressed
in Cartesian coordinates. If the objective is to find the most “typical”
observation point, it would be a data point made up of the mean of each
attribute in the dataset independently. For the Iris dataset shown in, the
central mean point is {5.006, 3.418, 1.464, 0.244}. This data point may not be
an actual observation. It will be a hypothetical data point with the most
typical attribute values.
Correlation
‱ Correlation measures the statistical relationship between two attributes,
particularly dependence of one attribute on another attribute.
‱ When two attributes are highly correlated with each other, they both
vary at the same rate with each other either in the same or in opposite
directions.
‱ For example, consider average temperature of the day and ice cream
sales. Statistically, the two attributes that are correlated are dependent
on each other and one may be used to predict the other. If there are
sufficient data, future sales of ice cream can be predicted if the
temperature forecast is known. However, correlation between two
attributes does not imply causation, that is, one doesn’t necessarily cause
the other. The ice cream sales and the shark attacks are correlated,
however there is no causation. Both ice cream sales and shark attacks are
influenced by the third attribute—the summer season. Generally, ice
cream sales spikes as temperaures rise. As more people go to beaches
during summer, encounters with sharks become more probable.
DATA VISUALIZATION
‱ Visualizing data is one of the most important techniques of data
discovery and exploration.
‱ Data visualization is the discipline of trying to understand data by placing
it in a visual context so that patterns, trends and correlations that might
not otherwise be detected can be exposed.
‱ Vision is one of the most powerful senses in the human body. As such, it
is intimately connected with cognitive thinking . Human vision is trained
to discover patterns and anomalies even in the presence of a large
volume of data. However, the effectiveness of the pattern detection
depends on how effectively the information is visually presented. Hence,
selecting suitable visuals to explore data is critically important in
discovering and comprehending hidden patterns in the data .
‱ As with descriptive statistics, visualization techniques are also
categorized into: univariate visualization, multivariate visualization and
visualization of a large number of attributes using parallel dimensions.
Univariate Visualization
Visual exploration starts with investigating one attribute at a time using univariate charts. The
techniques discussed in this section give an idea of how the attribute values are distributed and
the shape of the distribution.
Histogram
‱ A histogram is one of the most basic visualization techniques to understand the frequency
of the occurrence of values.
‱ It shows the distribution of the data by plotting the frequency of occurrence in a range.
‱ In a histogram, the attribute under inquiry is shown on the horizontal axis and the
frequency of occurrence is on the vertical axis.
‱ For a continuous numeric data type, the range or binning value to group a range of values
need to be specified.
‱ For example, in the case of human height in centimetres, all the occurrences between
152.00 and 152.99 are grouped under 152.
‱ There is no optimal number of bins or bin width that works for all the distributions. If the
bin width is too small, the distribution becomes more precise but reveals the noise due to
sampling.
‱ A general rule of thumb is to have a number of bins equal to the square root or cube root of
the number of data points.
data science, prior knowledge ,modeling, scatter plot
Quartile
‱ A quartile is a statistical term that describes a division of observations into
four defined intervals based on the values of the data and how they compare
to the entire set of observations.
‱ A quartile divides data into three points—a lower quartile, median, and
upper quartile—to form four groups of the dataset.
‱ The lower quartile, or first quartile, is denoted as Q1 and is the middle
number that falls between the smallest value of the dataset and the median.
The second quartile, Q2, is also the median. The upper or third quartile,
denoted as Q3, is the central point that lies between the median and the
highest number of the distribution.
‱ Each quartile contains 25% of the total observations. Generally, the data is
arranged from smallest to largest:
 First quartile: the lowest 25% of numbers
 Second quartile: between 25.1% and 50% (up to the median)
 Third quartile: 50.1% to 75% (above the median)
 Fourth quartile: the highest 25% of numbers
Suppose the distribution of math scores in a class of 19 students in ascending order is:
59, 60, 65, 65, 68, 69, 70, 72, 75, 75, 76, 77, 81, 82, 84, 87, 90, 95, 98
First, mark down the median, Q2, which in this case is the 10th value: 75.Q1 is the central point
between the smallest score and the median.
In this case, Q1 falls between the first and fifth score: 68. (Note that the median can also be
included when calculating Q1 or Q3 for an odd set of values.
If we were to include the median on either side of the middle point, then Q1 will be the middle
value between the first and 10th score, which is the average of the fifth and sixth score—(fifth +
sixth)/2 = (68 + 69)/2 = 68.5).
Q3 is the middle value between Q2 and the highest score: 84. (Or if you include the median, Q3 =
(82 + 84)/2 = 83).
Now that we have our quartiles, let’s interpret their numbers. A score of 68 (Q1) represents the first
quartile and is the 25th percentile. 68 is the median of the lower half of the score set in the
available data—that is, the median of the scores from 59 to 75.
Q1 tells us that 25% of the scores are less than 68 and 75% of the class scores are greater. Q2 (the
median) is the 50th percentile and shows that 50% of the scores are less than 75, and 50% of the
scores are above 75. Finally, Q3, the 75th percentile, reveals that 25% of the scores are greater and
75% are less than 84.
data science, prior knowledge ,modeling, scatter plot
data science, prior knowledge ,modeling, scatter plot
Box plots
‱ In descriptive statistics, a box plot or boxplot (also known as box
and whisker plot) is a type of chart often used in explanatory data
analysis. Box plots visually show the distribution of numerical data
and skewness through displaying the data quartiles (or percentiles)
and averages.
‱ Box plots show the five-number summary of a set of data: including
the minimum score, first (lower) quartile, median, third (upper)
quartile, and maximum score.
‱ Minimum Score
The lowest score, excluding outliers (shown at the end of the left whisker).
‱ Lower Quartile
Twenty-five percent of scores fall below the lower quartile value (also known as the first
quartile).
‱ Median
The median marks the mid-point of the data and is shown by the line that divides the box into
two parts (sometimes known as the second quartile). Half the scores are greater than or equal
to this value and half are less.
‱ Upper Quartile
Seventy-five percent of the scores fall below the upper quartile value (also known as the third
quartile). Thus, 25% of data are above this value.
‱ Maximum Score
The highest score, excluding outliers (shown at the end of the right whisker).
‱ Whiskers
The upper and lower whiskers represent scores outside the middle 50% (i.e. the lower 25% of
scores and the upper 25% of scores).
‱ The Interquartile Range (or IQR)
This is the box plot showing the middle 50% of scores (i.e., the range between the 25th and
75th percentile).
Distribution Chart
‱ For continuous numeric attributes like petal length, instead of
visualizing the actual data in the sample, its normal
distribution function can be visualized instead. The normal
distribution function of a continuous random variable
‱ where ÎŒ is the mean of the distribution and σ is the standard
deviation of the distribution. Here an inherent assumption is
being made that the measurements of petal length (or any
continuous variable) follow the normal distribution, and
hence, its distribution can be visualized instead of the actual
values. The normal distribution is also called the Gaussian
distribution or “bell curve” due to its bell shape
data science, prior knowledge ,modeling, scatter plot
Multivariate Visualization
‱ The multivariate visual exploration considers more than one attribute
in the same visual. The techniques discussed in this section focus on
the relationship of one attribute with another attribute. The
visualizations examine two to four attributes simultaneously.
‱ Scatterplot
A scatterplot is one of the most powerful yet simple visual plots available.
In a scatterplot, the data points are marked in Cartesian space with
attributes of the dataset aligned with the coordinates. The attributes are
usually of continuous data type.
One of the key observations that can be concluded from a scatterplot is
the existence of a relationship between two attributes under inquiry. If
the attributes are linearly correlated, then the data points align closer to
an imaginary straight line; if they are not correlated, the data points are
scattered. Apart from basic correlation, scatterplots can also indicate the
existence of patterns or groups of clusters in the data and identify outliers
in the data. This is particularly useful for low-dimensional datasets.
Scatter Multiple
‱ If the dataset has more than two attributes, it is important to look at combinations of
all the attributes through a scatterplot. A scatter matrix solves this need by comparing
all combinations of attributes with individual scatterplots and arranging these plots in a
matrix.
‱ A scatter matrix for all four attributes in the Iris dataset is shown in Fig. The color of the
data point is used to indicate the species of the flower. Since there are four attributes,
there are four rows and four columns, for a total of 16 scatter charts. Charts in the
diagonal are a comparison of the attribute with itself; hence, they are eliminated. Also,
the charts below the diagonal are mirror images of the charts above the diagonal. In
effect, there are six distinct comparisons in scatter multiples of four attributes. Scatter
matrices provide an effective visualization of comparative, multivariate, and high-
density data displayed in small multiples of the similar scatterplots
bubble chart
A bubble chart is a variation of a simple scatterplot with the
addition of one more attribute, which is used to determine the
size of the data point. In the Iris dataset, petal length and petal
width are used for x and y-axis, respectively and sepal width is
used for the size of the data point. The color of the data point
represents a species class label
Density charts
Density charts are similar to the scatterplots, with one more
dimension included as a background color. The data point can
also be coloured to visualize one dimension, and hence, a total
of four dimensions can be visualized in a density.In the example
in Fig. 3.14, petal length is used for the x-axis, sepal length for
the y-axis, sepal width for the background color, and class label
for the data point color.sity chart.

More Related Content

PDF
Lesson1.2.pptx.pdf
PPTX
Predictive analytics
PPTX
7.-Data-Analytics.pptx
PPTX
Introduction to data science
PPTX
Ml leaning this ppt display number of mltypes.pptx
PPTX
MA- UNIT -1.pptx for ipu bba sem 5, complete pdf
PPTX
Fundamentals of Analytics and Statistic (1).pptx
PPTX
Data Science and Analytics Lesson 1.pptx
Lesson1.2.pptx.pdf
Predictive analytics
7.-Data-Analytics.pptx
Introduction to data science
Ml leaning this ppt display number of mltypes.pptx
MA- UNIT -1.pptx for ipu bba sem 5, complete pdf
Fundamentals of Analytics and Statistic (1).pptx
Data Science and Analytics Lesson 1.pptx

Similar to data science, prior knowledge ,modeling, scatter plot (20)

PPTX
Data Science and Analysis.pptx
PPTX
Data Science topic and introduction to basic concepts involving data manageme...
PPTX
Lesson 1 - Overview of Machine Learning and Data Analysis.pptx
PPTX
Lesson 5- Data Analysinbvs Techniques.pptx
PPTX
Lecturer3 by RamaKrishna SRU waranagal telanga
PPTX
Introduction to Data Analytics
PPTX
INTRODUCTION TO DESCRIPTIVE ANALYTICS.pptx
PPTX
Data Analytics for UG students - What is data analytics and its importance
PPTX
Unit2.pptx Statistical Interference and Exploratory Data Analysis
PPTX
Unit-V-Introduction to Data Mining.pptx
PDF
Data analysis
PDF
Data Analysis, data types and interpretation.pdf
PDF
Data Analytics Course Curriculum_ What to Expect and How to Prepare in 2023.pdf
PPTX
Additional themes of data mining for Msc CS
PDF
Introduction to Data Analysis for researcher.pdf
PPTX
Data mining
PPTX
Data mining
PPTX
Data Processing & Explain each term in details.pptx
PDF
Understanding the Step-by-Step Data Science Process for Beginners | IABAC
 
Data Science and Analysis.pptx
Data Science topic and introduction to basic concepts involving data manageme...
Lesson 1 - Overview of Machine Learning and Data Analysis.pptx
Lesson 5- Data Analysinbvs Techniques.pptx
Lecturer3 by RamaKrishna SRU waranagal telanga
Introduction to Data Analytics
INTRODUCTION TO DESCRIPTIVE ANALYTICS.pptx
Data Analytics for UG students - What is data analytics and its importance
Unit2.pptx Statistical Interference and Exploratory Data Analysis
Unit-V-Introduction to Data Mining.pptx
Data analysis
Data Analysis, data types and interpretation.pdf
Data Analytics Course Curriculum_ What to Expect and How to Prepare in 2023.pdf
Additional themes of data mining for Msc CS
Introduction to Data Analysis for researcher.pdf
Data mining
Data mining
Data Processing & Explain each term in details.pptx
Understanding the Step-by-Step Data Science Process for Beginners | IABAC
 
Ad

Recently uploaded (20)

PDF
top salesforce developer skills in 2025.pdf
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PPTX
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PDF
Softaken Excel to vCard Converter Software.pdf
PPTX
Introduction to Artificial Intelligence
PDF
Product Update: Alluxio AI 3.7 Now with Sub-Millisecond Latency
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
iTop VPN Free 5.6.0.5262 Crack latest version 2025
PPTX
Computer Software and OS of computer science of grade 11.pptx
PPTX
Reimagine Home Health with the Power of Agentic AI​
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PPTX
Why Generative AI is the Future of Content, Code & Creativity?
PDF
medical staffing services at VALiNTRY
PPTX
Transform Your Business with a Software ERP System
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
top salesforce developer skills in 2025.pdf
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Odoo Companies in India – Driving Business Transformation.pdf
Softaken Excel to vCard Converter Software.pdf
Introduction to Artificial Intelligence
Product Update: Alluxio AI 3.7 Now with Sub-Millisecond Latency
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
How to Choose the Right IT Partner for Your Business in Malaysia
iTop VPN Free 5.6.0.5262 Crack latest version 2025
Computer Software and OS of computer science of grade 11.pptx
Reimagine Home Health with the Power of Agentic AI​
Upgrade and Innovation Strategies for SAP ERP Customers
Why Generative AI is the Future of Content, Code & Creativity?
medical staffing services at VALiNTRY
Transform Your Business with a Software ERP System
Navsoft: AI-Powered Business Solutions & Custom Software Development
Ad

data science, prior knowledge ,modeling, scatter plot

  • 2. Introduction to Data Science ‱ Data Science is a field that combines statistical methods, algorithms, and technology to extract insights from structured and unstructured data. ‱ It enables organizations to make data-driven decisions, predict trends, and improve efficiency. ‱ Data Science is a collection of techniques used to extract value from data.  Data – can be Simple array of a few numeric observations,complex matrix of millions of observations with thousands of variables.  Science – in data science indicates that the methods are evidence based, and are built on empirical knowledge, more specifically (on) historical observations.
  • 3. This discipline coexists and closely associated with ‱ database systems ‱ data engineering ‱ visualization ‱ data analysis ‱ experimentation and, ‱ business intelligence.
  • 4. Key features ‱ 1.2.1 Extracting Meaningful Patterns Data Science refers to the process of identifying and interpreting significant trends, relationships, or insights from data. This involves analyzing large datasets to discover patterns that can inform decision-making, predict future trends, or solve specific problems.
  • 5. Key Aspects of Extracting Meaningful Patterns  Data Mining: Techniques used to discover patterns in large datasets, often involving machine learning, statistical analysis, and database systems.  Statistical Analysis: Using statistical methods to identify relationships and trends within data.  Machine Learning Models: Employing algorithms to learn from data and make predictions or classify information.  Visualization Tools: Creating charts, graphs, and other visual aids to help identify patterns and make data more interpretable.  Feature Engineering: Selecting and transforming variables in the data to improve the performance of machine learning models.  Pattern Recognition: Detecting regularities and irregularities in data, which could indicate significant insights.
  • 6. Examples of Extracting Meaningful Patterns: ‱ Customer Segmentation: Analyzing purchase behavior to group customers into segments for targeted marketing. ‱ Fraud Detection: Identifying unusual transaction patterns that may indicate fraudulent activity. ‱ Predictive Maintenance: Using sensor data to predict equipment failures before they occur. ‱ Market Basket Analysis: Discovering product purchase combinations to optimize inventory and cross-selling strategies.
  • 7. ‱ 1.2.2 Building Representative Models  In statistics, a model is the representation of a relationship between variables in a dataset.  It describes how one or more variables in the data are related to other variables.  Modeling is a process in which a representative abstraction is built from the observed dataset.  For example, based on credit score, income level, and requested loan amount, a model can be developed to determine the interest rate of a loan. For this task, previously known observational data including credit score, income level, loan amount, and interest rate are needed.
  • 8. ‱ Fig. 1.3 shows the process of generating a model. Once the representative model is created, it can be used to predict the value of the interest rate, based on all the input variables.
  • 9. ‱ This model serves two purposes: – On the one hand, it predicts the output (interest rate) based on the new and unseen set of input variables (credit score, income level, and loan amount), – and on the other hand, the model can be used to understand the relationship between the output variable and all the input variables. ‱ For example, does income level really matter in determining the interest rate of a loan? Does income level matter more than credit score? What happens when income levels double or if credit score drops by 10 points? A Model can be used for both predictive and explanatory applications
  • 10. ‱ 1.2.3 Combination of Statistics, Machine Learning, and Computing Data Science refers to the integration of these three disciplines to extract meaningful insights from data, build predictive models, and implement solutions. Each of these fields contributes unique methodologies and tools, which together enable comprehensive data analysis and decision-making processes.
  • 11. ‱ 1.2.4 Learning Algorithms Data Science refers to the methods and techniques used to build models that can learn from data and make predictions or decisions. These algorithms enable machines to automatically improve their performance on a given task through experience, without being explicitly programmed for every possible scenario.
  • 12. ‱ Types of Learning Algorithms 1. Supervised Learning: Examples: ‱ Linear Regression: ‱ Logistic Regression ‱ Decision Trees ‱ Support Vector Machines (SVM) ‱ Neural Networks
  • 13. 2. Unsupervised Learning: – Examples: ‱ K-Means Clustering ‱ Hierarchical Clustering ‱ Principal Component Analysis (PCA) ‱ Anomaly Detection 3. Reinforcement Learning: – Examples: ‱ Q-Learning ‱ Deep Q-Networks (DQN) ‱ Policy Gradient Methods
  • 14. Benefits of Learning Algorithms: ‱ Scalability: Can handle large datasets and complex problems efficiently. ‱ Accuracy: Often provide high levels of predictive accuracy by leveraging vast amounts of data. ‱ Continuous Improvement: Capable of learning and improving over time with more data and feedback. ‱ Versatility: Applicable to a wide range of domains and industries, from healthcare to finance to entertainment.
  • 15. ‱ 1.2.5. Associated Fields ‱ We understood, data science covers a wide set of techniques, applications, and disciplines. ‱ There a few associated fields that data science heavily depends on. ‱ They are – Descriptive Statistics, – Exploratory Visualization, – Dimensional Slicing, – Hypothesis Testing, – Data Engineering and, – Business Intelligence.
  • 16. Descriptive statistics: ‱ Computing mean, standard deviation, correlation, and other descriptive statistics, quantify the aggregate structure of a dataset. ‱ This is essential information for understanding any dataset in order to understand the structure of the data and the relationships within the dataset. ‱ They are used in the exploration stage of the data science process
  • 17. ‱ Exploratory visualization: The process of expressing data in visual coordinates enables users to find patterns and relationships in the data and to comprehend large datasets. Similar to descriptive statistics, they are integral in the pre- and post-processing steps in data science ‱ Dimensional Slicing:  Online analytical processing (OLAP) applications, are widespread in organizations.  They mainly provide information on the data through dimensional slicing, filtering, and pivoting.  OLAP analysis is enabled by a unique database schema design where the data are organized as dimensions (e.g., products, regions, dates) and quantitative facts or measures (e.g., revenue, quantity).  With a well-defined database structure, it is easy to slice the yearly revenue by products or combination of region and products, for example.  These techniques are extremely useful and may reveal patterns in data. .
  • 18. Hypothesis Testing: ‱ It is a kind of statistical testing. ‱ In statistics, a hypothesis is a statement about a population that we want to verify based on information contained in the sample data. ‱In general, data science is a process where many hypotheses are generated and tested based on observational data. ‱Since the data science algorithms are iterative, solutions can be refined in each step. Steps usually followed in hypothesis testing are: 1. Figure out the null hypothesis, 2. State the null hypothesis, 3. Choose what kind of test we need to perform, 4. Either support or reject the null hypothesis.
  • 19. Data engineering:  Data engineering is the process of sourcing, organizing, assembling, storing, and distributing data for effective analysis and usage.  Database engineering, distributed storage, and computing frameworks (e.g., Apache Hadoop, Spark, Kafka), parallel computing, extraction transformation and loading processing, and data warehousing constitute data engineering techniques.  Data engineering helps source and prepare for data science learning algorithms. Business Intelligence:  Business intelligence helps organizations consume data effectively.  It helps query the ad hoc data use dashboards or visualizations to communicate the facts and trends.  Historical trends are usually reported, but in combination with data science, both the past and the predicted future data can be combined. BI can hold and distribute the results of data science.
  • 20. DATA SCIENCE CLASSIFICATION ‱ Data science problems can be broadly categorized into supervised or unsupervised learning models. ‱ Supervised or directed data science tries to infer a function or relationship based on labelled training data and uses this function to map new unlabelled data. ‱ The model generalizes the relationship between the input and output variables and uses it to predict for a dataset where only input variables are known. The output variable that is being predicted is also called a class label or target variable.
  • 21. ‱ Unsupervised or undirected data science uncovers hidden patterns in unlabelled data. ‱ In unsupervised data science, there are no output variables to predict. The objective of this class of data science techniques, is to find patterns in data based on the relationship between data points themselves. Data science problems can also be classified into tasks such as: 1. Classification 2. Regression 3. Association Analysis 4. Clustering 5. Anomaly Detection 6. Recommendation Engines 7. Feature Selection 8. Time Series Forecasting 9. Deep Learning 10. Text Mining.
  • 23. ‱ Classification and regression techniques predict a target variable based on input variables. The prediction is based on a generalized model built from a previously known dataset. ‱ In regression tasks, the output variable is numeric (e.g.,the mortgage interest rate on a loan). ‱ Classification tasks predict output variables, which are categorical or polynomial (e.g., the yes or no decision to approve a loan).
  • 24. Predict whether a customer is eligible for a loan? Predict the price of the car? Predict the Indian team will win or lose ? Predict weather for next 24 hours?
  • 25. ‱ Clustering is the process of identifying the natural groupings in a dataset. For example, clustering is helpful in finding natural clusters in customer datasets,which can be used for market segmentation. ‱ Since this is unsupervised datascience, it is up to the end user to investigate why these clusters are formed in the data and generalize the uniqueness of each cluster.
  • 26. ‱ Deep Learning is based on artificial neural networks used for classification and regression problems. ‱ In retail analytics,it is common to identify pairs of items that are purchased together, so that specific items can be bundled or placed next to each other. This task is called market basket analysis or association analysis, which is commonly used in cross selling. ‱ Recommendation engines are the systems that recommend items to the users based on individual user preference.
  • 27. ‱ Anomaly or outlier detection identifies the data points that are significantly different from other data points in a dataset. Credit card transaction fraud detection is one of the most prolific applications of anomaly detection. ‱ Time series forecasting is the process of predicting the future value of a variable (e.g., temperature) based on past historical values that may exhibit a trend and seasonality.
  • 28. ‱ Text mining is a data science application where the input data is text, which can be in the form of documents, messages, emails, or web pages. ‱ To aid the data science on text data, the text files are first converted into document vectors where each unique word is an attribute. ‱ Once the text file is converted to document vectors, standard data science tasks such as classification,clustering, etc., can be applied. ‱ Feature selection is a process in which attribute in a dataset are reduced to a few attributes that really matter.
  • 30. Data Science Process ‱ The methodical discovery of useful relationships and patterns in data is enabled by a set of iterative activities collectively known as the data science process. ‱ The standard data science process involves o understanding the problem, o preparing the data samples, o developing the model, o applying the model on a datasets o deploying and maintaining the models.
  • 31. ‱ One of the most popular data science process frameworks is Cross Industry Standard Process for Data Mining (CRISP-DM), which is an acronym for Cross Industry Standard Process for Data Mining. ‱ This framework was developed by a consortium of companies involved in data mining. ‱ The CRISP-DM process is the most widely adopted framework for developing data science solutions.
  • 32. Fig. 2.1 provides a visual overview of the CRISP-DM framework.
  • 33. ‱ The problem at hand could be a segmentation of customers, a prediction of climate patterns, or a simple data exploration. ‱ The learning algorithm used to solve the business question could be a decision tree, an artificial neural network, or a scatterplot. ‱ The software tool to develop and implement the data science algorithm used could be custom coding, RapidMiner, R, Weka, SAS, Oracle Data Miner, Python, etc., (Piatetsky, 2018) to mention a few.
  • 35. 2.1 PRIOR KNOWLEDGE ‱ The prior knowledge step in the data science process helps to define what problem is being solved, how it fits in the business context, and what data is needed in order to solve the problem. – Objective ‱ The data science process starts with a need for analysis, a question, or a business objective.This is possibly the most important step in the data science process (Shearer, 2000). Without a well-defined statement of the problem, it is impossible to come up with the right dataset and pick the right data science algorithm. – Subject Area ‱ The process of data science uncovers hidden patterns in the dataset by exposing relationships between attributes. But the problem is that it uncovers a lot of patterns. The false or spurious signals are a major problem in the data science process. It is up to the practitioner to sift through the exposed patterns and accept the ones that are valid and relevant to the answer of the objective question. Hence, it is essential to know the subject matter, the context, and the business process generating the data.
  • 36. ‱ Data – Similar to the prior knowledge in the subject area, prior knowledge in the data can also be gathered. – Understanding how the data is collected, stored,transformed, reported, and used is essential to the data science process. – There are quite a range of factors to consider: quality of the data, quantity of data, availability of data, gaps in data, does lack of data compel the practitioner to change the business question, etc. – The objective of this step is to come up with a dataset to answer the business question through the data science process. – It is critical to recognize that an inferred model is only as good as the data used to create it.
  • 37. ‱ A dataset (example set) is a collection of data with a defined structure.. It has a well-defined structure ,This structure is also sometimes referred to as a “data frame”. ‱ A data point (record, object or example) is a single instance in the dataset. Each row in Table is a data point. Each instance contains the same structure as the dataset. ‱ An attribute (feature, input, dimension, variable, or predictor) is a single property of the dataset. Each column in Table is an attribute. ‱ Attributes can be numeric, categorical, date-time, text, or Boolean data types. In this example, both the credit score and the interest rate are numeric attributes ‱ A label (class label, output, prediction, target, or response) is the special attribute to be predicted based on all the input attributes. In Table, the interest rate is the output variable. ‱ Identifiers are special attributes that are used for locating or providing context to individual records. For example, common attributes like names, account numbers, and employee ID numbers are identifier attributes.
  • 39. 2.2 DATA PREPARATION ‱ Preparing the dataset to suit a data science task is the most time-consuming part of the process. ‱ It is extremely rare that datasets are available in the form required by the data science algorithms. ‱ Most of the data science algorithms would require data to be structured in a tabular format with records in the rows and attributes in the columns. ‱ If the data is in any other format, the data would need to be transformed by applying pivot, type conversion, join,or transpose functions, etc., to condition the data into the required structure.
  • 40. 2.2.1 Data Exploration ‱ Data exploration, also known as exploratory data analysis, provides a set of simple tools to achieve basic understanding of the data. ‱ Data exploration approaches involve computing descriptive statistics and visualization of data. ‱ They can expose the structure of the data the distribution of the values, the presence of extreme values, and highlight the inter-relationships within the dataset. ‱ Descriptive statistics like mean,median, mode, standard deviation, and range for each attribute provide an easily readable summary of the key characteristics of the distribution of data.
  • 41. 2.2.2 Data Quality ‱ Data quality is an ongoing concern wherever data is collected, processed, and stored. ‱ Organizations use data alerts, cleansing, and transformation techniques to improve and manage the quality of the data and store them in companywide repositories called data warehouses. ‱ Data sourced from well-maintained data warehouses have higher quality, as there are proper controls in place to ensure a level of data accuracy for new and existing data. ‱ The data cleansing practices include elimination of duplicate records, quarantining outlier records that exceed the bounds, standardization of attribute values, substitution of missing values, etc.
  • 42. 2.2.3 Missing Values ‱ One of the most common data quality issues is that some records have missing attribute values. ‱ For example, a credit score may be missing in one of the records. There are several different mitigation methods to deal with this problem, but each method has pros and cons. The first step of managing missing values is to understand the reason behind why the values are missing. Tracking the data lineage (provenance) of the data source can lead to the identification of systemic issues during data capture or errors in data transformation. ‱ Knowing the source of missing value will often guide which mitigation methodology to use. The missing value can be substituted with a range of artificial data so that the issue can be managed with marginal impact on the later steps in the data science process. ‱ Missing credit score values can be replaced with a credit score derived from the dataset (mean, minimum, or maximum value, depending on the characteristics of the attribute). This method is useful if the missing values occur randomly and the frequency of occurrence is quite rare. ‱ Alternatively, to build the representative model, all the data records with missing values or records with poor data quality can be ignored. This method reduces the size of the dataset. 38 38 38
  • 43. 2.2.4 Data Types and Conversion ‱ The attributes in a dataset can be of different types, such as continuous numeric (interest rate), integer numeric (credit score), or categorical. For example, the credit score can be expressed as categorical values (poor, good, excellent) or numeric score. ‱ Different data science algorithms impose different restrictions on the attribute data types. ‱ In case of linear regression models, the input attributes have to be numeric. If the available data are categorical, they must be converted to continuous numeric attribute. ‱ A specific numeric score can be encoded for each category value, such as poor 5 400, good 5 600, excellent 5 700, etc. ‱ Similarly, numeric values can be converted to categorical data types by a technique called binning, where a range of values are specified for each category, for example, a score between 400 and 500 can be encoded as “low” and so on.
  • 44. ‱ 2.2.5 Transformation ‱ In some data science algorithms like k-NN, the input attributes are expected to be numeric and normalized, because the algorithm compares the values of different attributes and calculates distance between the data points. ‱ Normalization prevents one attribute dominating the distance results because of large values. For example, consider income (expressed in USD, in thou-sands) and credit score (in hundreds). ‱ The distance calculation will always be dominated by slight variations in income. ‱ One solution is to convert the range of income and credit score to a more uniform scale from 0 to 1 by normalization. This way, a consistent comparison can be made between the two different attributes with different units
  • 45. 2.2.6 Outliers ‱ Outliers are anomalies in a given dataset. ‱ Outliers may occur because of correct data capture (few people with income in tens of millions) or erroneous data capture (human height as 1.73 cm instead of 1.73 m). ‱ Regardless, the presence of outliers needs to be understood and will require special treatments. ‱ The purpose of creating a representative model is to generalize a pattern or a relationship within a dataset and the presence of outliers skews the representativeness of the inferred model. ‱ Detecting outliers may be the primary purpose of some data science applications, like fraud or intrusion detection.
  • 46. 2.2.7 Feature Selection Reducing the number of attributes, without significant loss in the performance of the model, is called feature selection. It leads to a more simplified model and helps to synthesize a more effective explanation of the model. 2.2.8 Data Sampling Sampling is a process of selecting a subset of records as a representation of the original dataset for use in data analysis or modeling. The sample data serve as a representative of the original dataset with similar properties, such as a similar mean. Sampling reduces the amount of data that need to be processed and speeds up the build process of the modeling 40 40 40 40 40 40 40
  • 47. 2.3 Model A model is the abstract representation of the data and the relationships in a given dataset. A simple rule of thumb like “mortgage interest rate reduces with increase in credit score” is a model; although there is not enough quantitative information to use in a production scenario, it provides directional information by abstracting the relationship between credit score and interest rate. There are a few hundred data science algorithms in use today, derived from statistics, machine learning, pattern recognition, and the body of knowledge related to computer science. 41
  • 48. 2.3.1 Training and Testing Datasets The modeling step creates a representative model inferred from the data. The dataset used to create the model, with known attributes and target, is called the training dataset. The validity of the created model will also need to be checked with another known dataset called the test dataset or validation dataset. To facilitate this process, the overall known dataset can be split into a training dataset and a test dataset. A standard rule of thumb is two-thirds of the data are to be used as training and one-third as a test dataset
  • 50. 2.3.2 Learning Algorithms The business question and the availability of data will dictate what data science task (association, classification, regression, etc.,) can to be used. The practitioner determines the appropriate data science algorithm within the chosen category. For example, within a classification task many algorithms can be chosen from: decision trees, rule induction, neural networks, Bayesian models, k-NN, etc. Likewise, within decision tree techniques, there are quite a number of variations of learning algorithms like classification and regression tree (CART), CHi- squared Automatic Interaction Detector (CHAID) et
  • 51. 2.3.3 Evaluation of the Model A model should not memorize and output the same values that are in the training records. The phenomenon of a model memorizing the training data is called overfitting. An overfitted model just memorizes the training records and will underperform on real unlabeled new data. The model should generalize or learn the relationship between credit score and interest rate. To evaluate this relationship, the validation or test dataset, which was not previously used in building the model, is used for evaluation. 45 45
  • 52. 2.3.4 Ensemble Modeling Ensemble modeling is a process where multiple diverse base models are used to predict an outcome. The motivation for using ensemble models is to reduce the generalization error of the prediction. 2.4 APPLICATION Deployment is the stage at which the model becomes production ready or live. In business applications, the results of the data science process have to be assimilated into the business process—usually in software applications. The model deployment stage has to deal with: assessing model readiness, technical integration, response time, model maintenance, and assimilation.
  • 53. 2.4.1 Production Readiness The production readiness part of the deployment determines the critical qualities required for the deployment objective. 2.4.2 Technical Integration Technical integration in the data science process involves integrating various technologies, tools, and platforms to facilitate and streamline each stage of the process. Here's how technical integration can be applied at each step: 2.4.3 Response Time 2.4.4 Model Refresh 2.4.5 Assimilation these tools and technologies ensures an efficient workflow, enabling data scientists to focus on extracting insights and building robust models.
  • 54. 2.5 KNOWLEDGE ‱ The data science process provides a framework to extract nontrivial information from data. With the advent of massive storage, increased data collection, and advanced computing paradigms, the available datasets to be utilized are only increasing. ‱ To extract knowledge from these massive data assets, advanced approaches need to be employed, like data science algorithms, in addition to standard business intelligence reporting or statistical analysis. ‱ Data science, like any other technology, provides various options in terms of algorithms and parameters within the algorithms. Using these options to extract the right information from data is a bit of an art and can be developed with practice. ‱ The data science process starts with prior knowledge and ends with posterior knowledge, which is the incremental insight gained ‱ It is the difference between gaining the information through the data science process and the insights from basic data analysis. Finally, the whole data science process is a framework to invoke the right questions (Chapman et al., 2000) and provide guidance, through the right approaches, to solve a problem
  • 55. Data Exploration ‱ Data exploration can be broadly classified into two types— descriptive statistics and data visualization. ‱ Descriptive statistics is the process of condensing key characteristics of the dataset into simple numeric metrics. ‱ Some of the common quantitative metrics used are mean, standard deviation, and correlation. ‱ Visualization is the process of projecting the data, or parts of it, into multi-dimensional space or abstract images. All the useful(and adorable) charts fall under this category. ‱ Data exploration in the context of data science uses both descriptive statistics and visualization techniques.
  • 56. OBJECTIVES OF DATA EXPLORATION ‱ Data understanding ‱ Data preparation ‱ Data science tasks ‱ Interpreting the results
  • 57. Types of Data ‱ Numeric or Continuous ‱ Categorical or Nominal UNIVARIATE ANALYSIS Univariate analysis is the simplest form of analyzing data. “Uni” means “one”, so in other words your data has only one variable. It doesn’t deal with causes or relationships (unlike regression) and it’s major purpose is to describe; it takes data, summarizes that data and finds patterns in the data.
  • 58. Ways to describe patterns found in univariate data 1. Central tendency 1. Mean 2. Mode 3. Median 2. Dispersion 1. Range 2. Variance 3. maximum, minimum, 4. Quartiles (including the interquartile range), and 5. Standard deviation 3. Count /Null count
  • 60. Multivariate Exploration ‱ Multivariate exploration is the study of more than one attribute in the data- set simultaneously. This technique is critical to understanding the relation- ship between the attributes, which is central to data science methods. ‱ Central Data ‱ In the Iris dataset, each data point as a set of all the four attributes can be expressed: observation: {sepal length, sepal width, petal length, petal width} ‱ For example, observation one: {5.1, 3.5, 1.4, 0.2}. This observation point can also be expressed in four-dimensional Cartesian coordinates and can be plotted in a graph (although plotting more than three dimensions in a visual graph can be challenging). In this way, all 150 observations can be expressed in Cartesian coordinates. If the objective is to find the most “typical” observation point, it would be a data point made up of the mean of each attribute in the dataset independently. For the Iris dataset shown in, the central mean point is {5.006, 3.418, 1.464, 0.244}. This data point may not be an actual observation. It will be a hypothetical data point with the most typical attribute values.
  • 61. Correlation ‱ Correlation measures the statistical relationship between two attributes, particularly dependence of one attribute on another attribute. ‱ When two attributes are highly correlated with each other, they both vary at the same rate with each other either in the same or in opposite directions. ‱ For example, consider average temperature of the day and ice cream sales. Statistically, the two attributes that are correlated are dependent on each other and one may be used to predict the other. If there are sufficient data, future sales of ice cream can be predicted if the temperature forecast is known. However, correlation between two attributes does not imply causation, that is, one doesn’t necessarily cause the other. The ice cream sales and the shark attacks are correlated, however there is no causation. Both ice cream sales and shark attacks are influenced by the third attribute—the summer season. Generally, ice cream sales spikes as temperaures rise. As more people go to beaches during summer, encounters with sharks become more probable.
  • 62. DATA VISUALIZATION ‱ Visualizing data is one of the most important techniques of data discovery and exploration. ‱ Data visualization is the discipline of trying to understand data by placing it in a visual context so that patterns, trends and correlations that might not otherwise be detected can be exposed. ‱ Vision is one of the most powerful senses in the human body. As such, it is intimately connected with cognitive thinking . Human vision is trained to discover patterns and anomalies even in the presence of a large volume of data. However, the effectiveness of the pattern detection depends on how effectively the information is visually presented. Hence, selecting suitable visuals to explore data is critically important in discovering and comprehending hidden patterns in the data . ‱ As with descriptive statistics, visualization techniques are also categorized into: univariate visualization, multivariate visualization and visualization of a large number of attributes using parallel dimensions.
  • 63. Univariate Visualization Visual exploration starts with investigating one attribute at a time using univariate charts. The techniques discussed in this section give an idea of how the attribute values are distributed and the shape of the distribution. Histogram ‱ A histogram is one of the most basic visualization techniques to understand the frequency of the occurrence of values. ‱ It shows the distribution of the data by plotting the frequency of occurrence in a range. ‱ In a histogram, the attribute under inquiry is shown on the horizontal axis and the frequency of occurrence is on the vertical axis. ‱ For a continuous numeric data type, the range or binning value to group a range of values need to be specified. ‱ For example, in the case of human height in centimetres, all the occurrences between 152.00 and 152.99 are grouped under 152. ‱ There is no optimal number of bins or bin width that works for all the distributions. If the bin width is too small, the distribution becomes more precise but reveals the noise due to sampling. ‱ A general rule of thumb is to have a number of bins equal to the square root or cube root of the number of data points.
  • 65. Quartile ‱ A quartile is a statistical term that describes a division of observations into four defined intervals based on the values of the data and how they compare to the entire set of observations. ‱ A quartile divides data into three points—a lower quartile, median, and upper quartile—to form four groups of the dataset. ‱ The lower quartile, or first quartile, is denoted as Q1 and is the middle number that falls between the smallest value of the dataset and the median. The second quartile, Q2, is also the median. The upper or third quartile, denoted as Q3, is the central point that lies between the median and the highest number of the distribution. ‱ Each quartile contains 25% of the total observations. Generally, the data is arranged from smallest to largest:  First quartile: the lowest 25% of numbers  Second quartile: between 25.1% and 50% (up to the median)  Third quartile: 50.1% to 75% (above the median)  Fourth quartile: the highest 25% of numbers
  • 66. Suppose the distribution of math scores in a class of 19 students in ascending order is: 59, 60, 65, 65, 68, 69, 70, 72, 75, 75, 76, 77, 81, 82, 84, 87, 90, 95, 98 First, mark down the median, Q2, which in this case is the 10th value: 75.Q1 is the central point between the smallest score and the median. In this case, Q1 falls between the first and fifth score: 68. (Note that the median can also be included when calculating Q1 or Q3 for an odd set of values. If we were to include the median on either side of the middle point, then Q1 will be the middle value between the first and 10th score, which is the average of the fifth and sixth score—(fifth + sixth)/2 = (68 + 69)/2 = 68.5). Q3 is the middle value between Q2 and the highest score: 84. (Or if you include the median, Q3 = (82 + 84)/2 = 83). Now that we have our quartiles, let’s interpret their numbers. A score of 68 (Q1) represents the first quartile and is the 25th percentile. 68 is the median of the lower half of the score set in the available data—that is, the median of the scores from 59 to 75. Q1 tells us that 25% of the scores are less than 68 and 75% of the class scores are greater. Q2 (the median) is the 50th percentile and shows that 50% of the scores are less than 75, and 50% of the scores are above 75. Finally, Q3, the 75th percentile, reveals that 25% of the scores are greater and 75% are less than 84.
  • 69. Box plots ‱ In descriptive statistics, a box plot or boxplot (also known as box and whisker plot) is a type of chart often used in explanatory data analysis. Box plots visually show the distribution of numerical data and skewness through displaying the data quartiles (or percentiles) and averages. ‱ Box plots show the five-number summary of a set of data: including the minimum score, first (lower) quartile, median, third (upper) quartile, and maximum score.
  • 70. ‱ Minimum Score The lowest score, excluding outliers (shown at the end of the left whisker). ‱ Lower Quartile Twenty-five percent of scores fall below the lower quartile value (also known as the first quartile). ‱ Median The median marks the mid-point of the data and is shown by the line that divides the box into two parts (sometimes known as the second quartile). Half the scores are greater than or equal to this value and half are less. ‱ Upper Quartile Seventy-five percent of the scores fall below the upper quartile value (also known as the third quartile). Thus, 25% of data are above this value. ‱ Maximum Score The highest score, excluding outliers (shown at the end of the right whisker). ‱ Whiskers The upper and lower whiskers represent scores outside the middle 50% (i.e. the lower 25% of scores and the upper 25% of scores). ‱ The Interquartile Range (or IQR) This is the box plot showing the middle 50% of scores (i.e., the range between the 25th and 75th percentile).
  • 71. Distribution Chart ‱ For continuous numeric attributes like petal length, instead of visualizing the actual data in the sample, its normal distribution function can be visualized instead. The normal distribution function of a continuous random variable ‱ where ÎŒ is the mean of the distribution and σ is the standard deviation of the distribution. Here an inherent assumption is being made that the measurements of petal length (or any continuous variable) follow the normal distribution, and hence, its distribution can be visualized instead of the actual values. The normal distribution is also called the Gaussian distribution or “bell curve” due to its bell shape
  • 73. Multivariate Visualization ‱ The multivariate visual exploration considers more than one attribute in the same visual. The techniques discussed in this section focus on the relationship of one attribute with another attribute. The visualizations examine two to four attributes simultaneously. ‱ Scatterplot A scatterplot is one of the most powerful yet simple visual plots available. In a scatterplot, the data points are marked in Cartesian space with attributes of the dataset aligned with the coordinates. The attributes are usually of continuous data type. One of the key observations that can be concluded from a scatterplot is the existence of a relationship between two attributes under inquiry. If the attributes are linearly correlated, then the data points align closer to an imaginary straight line; if they are not correlated, the data points are scattered. Apart from basic correlation, scatterplots can also indicate the existence of patterns or groups of clusters in the data and identify outliers in the data. This is particularly useful for low-dimensional datasets.
  • 74. Scatter Multiple ‱ If the dataset has more than two attributes, it is important to look at combinations of all the attributes through a scatterplot. A scatter matrix solves this need by comparing all combinations of attributes with individual scatterplots and arranging these plots in a matrix. ‱ A scatter matrix for all four attributes in the Iris dataset is shown in Fig. The color of the data point is used to indicate the species of the flower. Since there are four attributes, there are four rows and four columns, for a total of 16 scatter charts. Charts in the diagonal are a comparison of the attribute with itself; hence, they are eliminated. Also, the charts below the diagonal are mirror images of the charts above the diagonal. In effect, there are six distinct comparisons in scatter multiples of four attributes. Scatter matrices provide an effective visualization of comparative, multivariate, and high- density data displayed in small multiples of the similar scatterplots
  • 75. bubble chart A bubble chart is a variation of a simple scatterplot with the addition of one more attribute, which is used to determine the size of the data point. In the Iris dataset, petal length and petal width are used for x and y-axis, respectively and sepal width is used for the size of the data point. The color of the data point represents a species class label
  • 76. Density charts Density charts are similar to the scatterplots, with one more dimension included as a background color. The data point can also be coloured to visualize one dimension, and hence, a total of four dimensions can be visualized in a density.In the example in Fig. 3.14, petal length is used for the x-axis, sepal length for the y-axis, sepal width for the background color, and class label for the data point color.sity chart.