data science, prior knowledge ,modeling, scatter plot

Introduction to Data Science
• Data Science is a field that combines statistical methods,
algorithms, and technology to extract insights from structured
and unstructured data.
• It enables organizations to make data-driven decisions,
predict trends, and improve efficiency.
• Data Science is a collection of techniques used to extract value
from data.
 Data – can be Simple array of a few numeric observations,complex matrix of
millions of observations with thousands of variables.
 Science – in data science indicates that the methods are evidence based, and
are built on empirical knowledge, more specifically (on) historical
observations.

This discipline coexists and closely associated with
• database systems
• data engineering
• visualization
• data analysis
• experimentation and,
• business intelligence.

Key features
• 1.2.1 Extracting Meaningful Patterns
Data Science refers to the process of identifying and interpreting significant
trends, relationships, or insights from data. This involves analyzing large
datasets to discover patterns that can inform decision-making, predict future
trends, or solve specific problems.

Key Aspects of Extracting Meaningful Patterns
 Data Mining: Techniques used to discover patterns in large datasets, often
involving machine learning, statistical analysis, and database systems.
 Statistical Analysis: Using statistical methods to identify relationships and
trends within data.
 Machine Learning Models: Employing algorithms to learn from data and
make predictions or classify information.
 Visualization Tools: Creating charts, graphs, and other visual aids to help
identify patterns and make data more interpretable.
 Feature Engineering: Selecting and transforming variables in the data to
improve the performance of machine learning models.
 Pattern Recognition: Detecting regularities and irregularities in data,
which could indicate significant insights.

Examples of Extracting Meaningful Patterns:
• Customer Segmentation: Analyzing purchase behavior to
group customers into segments for targeted marketing.
• Fraud Detection: Identifying unusual transaction patterns that
may indicate fraudulent activity.
• Predictive Maintenance: Using sensor data to predict
equipment failures before they occur.
• Market Basket Analysis: Discovering product purchase
combinations to optimize inventory and cross-selling
strategies.

• 1.2.2 Building Representative Models
 In statistics, a model is the representation of a relationship
between variables in a dataset.
 It describes how one or more variables in the data are related
to other variables.
 Modeling is a process in which a representative abstraction is
built from the observed dataset.
 For example, based on credit score, income level, and
requested loan amount, a model can be developed to
determine the interest rate of a loan. For this task, previously
known observational data including credit score, income level,
loan amount, and interest rate are needed.

• Fig. 1.3 shows the process of generating a model. Once the representative
model is created, it can be used to predict the value of the interest rate,
based on all the input variables.

• This model serves two purposes:
– On the one hand, it predicts the output (interest rate) based on the
new and unseen set of input variables (credit score, income level, and
loan amount),
– and on the other hand, the model can be used to understand the
relationship between the output variable and all the input variables.
• For example, does income level really matter in determining
the interest rate of a loan? Does income level matter more
than credit score? What happens when income levels double
or if credit score drops by 10 points? A Model can be used for
both predictive and explanatory applications

• 1.2.3 Combination of Statistics, Machine Learning, and
Computing
Data Science refers to the integration of these three disciplines to
extract meaningful insights from data, build predictive models, and
implement solutions. Each of these fields contributes unique
methodologies and tools, which together enable comprehensive data
analysis and decision-making processes.

• 1.2.4 Learning Algorithms
Data Science refers to the methods and techniques used to build
models that can learn from data and make predictions or
decisions. These algorithms enable machines to automatically
improve their performance on a given task through experience,
without being explicitly programmed for every possible scenario.

• Types of Learning Algorithms
1. Supervised Learning:
Examples:
• Linear Regression:
• Logistic Regression
• Decision Trees
• Support Vector Machines (SVM)
• Neural Networks

2. Unsupervised Learning:
– Examples:
• K-Means Clustering
• Hierarchical Clustering
• Principal Component Analysis (PCA)
• Anomaly Detection
3. Reinforcement Learning:
– Examples:
• Q-Learning
• Deep Q-Networks (DQN)
• Policy Gradient Methods

Benefits of Learning Algorithms:
• Scalability: Can handle large datasets and complex problems
efficiently.
• Accuracy: Often provide high levels of predictive accuracy by
leveraging vast amounts of data.
• Continuous Improvement: Capable of learning and improving
over time with more data and feedback.
• Versatility: Applicable to a wide range of domains and
industries, from healthcare to finance to entertainment.

• 1.2.5. Associated Fields
• We understood, data science covers a wide set of
techniques, applications, and disciplines.
• There a few associated fields that data science heavily
depends on.
• They are
– Descriptive Statistics,
– Exploratory Visualization,
– Dimensional Slicing,
– Hypothesis Testing,
– Data Engineering and,
– Business Intelligence.

Descriptive statistics:
• Computing mean, standard deviation, correlation, and other
descriptive statistics, quantify the aggregate structure of a
dataset.
• This is essential information for understanding any dataset in
order to understand the structure of the data and the
relationships within the dataset.
• They are used in the exploration stage of the data science
process

• Exploratory visualization:
The process of expressing data in visual coordinates enables users to find patterns and
relationships in the data and to comprehend large datasets. Similar to descriptive
statistics, they are integral in the pre- and post-processing steps in data science
• Dimensional Slicing:
 Online analytical processing (OLAP) applications, are widespread in organizations.
 They mainly provide information on the data through dimensional slicing, filtering,
and pivoting.
 OLAP analysis is enabled by a unique database schema design where the data are
organized as dimensions (e.g., products, regions, dates) and quantitative facts or
measures (e.g., revenue, quantity).
 With a well-defined database structure, it is easy to slice the yearly revenue by
products or combination of region and products, for example.
 These techniques are extremely useful and may reveal patterns in data.
.

Hypothesis Testing:
• It is a kind of statistical testing.
• In statistics, a hypothesis is a statement about a population that we
want to verify based on information contained in the sample data.
•In general, data science is a process where many hypotheses are
generated and tested based on observational data.
•Since the data science algorithms are iterative, solutions can be
refined in each step.
Steps usually followed in hypothesis testing are:
1. Figure out the null hypothesis,
2. State the null hypothesis,
3. Choose what kind of test we need to perform,
4. Either support or reject the null hypothesis.

Data engineering:
 Data engineering is the process of sourcing, organizing, assembling, storing,
and distributing data for effective analysis and usage.
 Database engineering, distributed storage, and computing frameworks (e.g.,
Apache Hadoop, Spark, Kafka), parallel computing, extraction transformation
and loading processing, and data warehousing constitute data engineering
techniques.
 Data engineering helps source and prepare for data science learning
algorithms.
Business Intelligence:
 Business intelligence helps organizations consume data effectively.
 It helps query the ad hoc data use dashboards or visualizations to
communicate the facts and trends.
 Historical trends are usually reported, but in combination with data science,
both the past and the predicted future data can be combined. BI can hold and
distribute the results of data science.

DATA SCIENCE CLASSIFICATION
• Data science problems can be broadly categorized into
supervised or unsupervised learning models.
• Supervised or directed data science tries to infer a function or
relationship based on labelled training data and uses this
function to map new unlabelled data.
• The model generalizes the relationship between the input and
output variables and uses it to predict for a dataset where
only input variables are known. The output variable that is
being predicted is also called a class label or target variable.

• Unsupervised or undirected data science uncovers hidden patterns in unlabelled data.
• In unsupervised data science, there are no output variables to predict. The objective
of this class of data science techniques, is to find patterns in data based on the
relationship between data points themselves.
Data science problems can also be classified into tasks such as:
1. Classification
2. Regression
3. Association Analysis
4. Clustering
5. Anomaly Detection
6. Recommendation Engines
7. Feature Selection
8. Time Series Forecasting
9. Deep Learning
10. Text Mining.

data science, prior knowledge ,modeling, scatter plot

• Classification and regression
techniques predict a target
variable based on input variables.
The prediction is based on a
generalized model built from a
previously known dataset.
• In regression tasks, the output
variable is numeric (e.g.,the
mortgage interest rate on a loan).
• Classification tasks predict output
variables, which are categorical or
polynomial (e.g., the yes or no
decision to approve a loan).

Predict whether a customer is eligible for a loan? Predict the price of the car?
Predict the Indian team will win or lose ? Predict weather for next 24 hours?

• Clustering is the process of identifying the natural groupings
in a dataset. For example, clustering is helpful in finding
natural clusters in customer datasets,which can be used for
market segmentation.
• Since this is unsupervised datascience, it is up to the end user
to investigate why these clusters are formed in the data and
generalize the uniqueness of each cluster.

• Deep Learning is based on artificial neural networks used for
classification and regression problems.
• In retail analytics,it is common to identify pairs of items that
are purchased together, so that specific items can be bundled
or placed next to each other. This task is called market basket
analysis or association analysis, which is commonly used in
cross selling.
• Recommendation engines are the systems that recommend
items to the users based on individual user preference.

• Anomaly or outlier detection identifies the data points that
are significantly different from other data points in a dataset.
Credit card transaction fraud detection is one of the most
prolific applications of anomaly detection.
• Time series forecasting is the process of predicting the future
value of a variable (e.g., temperature) based on past historical
values that may exhibit a trend and seasonality.

• Text mining is a data science application where the input data
is text, which can be in the form of documents, messages,
emails, or web pages.
• To aid the data science on text data, the text files are first
converted into document vectors where each unique word is
an attribute.
• Once the text file is converted to document vectors, standard
data science tasks such as classification,clustering, etc., can be
applied.
• Feature selection is a process in which attribute in a dataset
are reduced to a few attributes that really matter.

Data Science Process
• The methodical discovery of useful relationships and patterns
in data is enabled by a set of iterative activities collectively
known as the data science process.
• The standard data science process involves
o understanding the problem,
o preparing the data samples,
o developing the model,
o applying the model on a datasets
o deploying and maintaining the models.

• One of the most popular data science process frameworks is
Cross Industry Standard Process for Data Mining (CRISP-DM),
which is an acronym for Cross Industry Standard Process for
Data Mining.
• This framework was developed by a consortium of companies
involved in data mining.
• The CRISP-DM process is the most widely adopted framework
for developing data science solutions.

Fig. 2.1 provides a visual overview of the CRISP-DM framework.

• The problem at hand could be a segmentation of customers, a
prediction of climate patterns, or a simple data exploration.
• The learning algorithm used to solve the business question
could be a decision tree, an artificial neural network, or a
scatterplot.
• The software tool to develop and implement the data science
algorithm used could be custom coding, RapidMiner, R, Weka,
SAS, Oracle Data Miner, Python, etc., (Piatetsky, 2018) to
mention a few.

2.1 PRIOR KNOWLEDGE
• The prior knowledge step in the data science process helps to define what problem is being
solved, how it fits in the business context, and what data is needed in order to solve the
problem.
– Objective
• The data science process starts with a need for analysis, a question, or a business
objective.This is possibly the most important step in the data science process
(Shearer, 2000). Without a well-defined statement of the problem, it is impossible
to come up with the right dataset and pick the right data science algorithm.
– Subject Area
• The process of data science uncovers hidden patterns in the dataset by exposing
relationships between attributes. But the problem is that it uncovers a lot of
patterns. The false or spurious signals are a major problem in the data science
process. It is up to the practitioner to sift through the exposed patterns and accept
the ones that are valid and relevant to the answer of the objective question. Hence,
it is essential to know the subject matter, the context, and the business process
generating the data.

• Data
– Similar to the prior knowledge in the subject area, prior knowledge in the data can also
be gathered.
– Understanding how the data is collected, stored,transformed, reported, and used is
essential to the data science process.
– There are quite a range of factors to consider: quality of the data, quantity of data,
availability of data, gaps in data, does lack of data compel the practitioner to change the
business question, etc.
– The objective of this step is to come up with a dataset to answer the business question
through the data science process.
– It is critical to recognize that an inferred model is only as good as the data used to
create it.

• A dataset (example set) is a collection of data with a defined structure.. It
has a well-defined structure ,This structure is also sometimes referred to
as a “data frame”.
• A data point (record, object or example) is a single instance in the dataset.
Each row in Table is a data point. Each instance contains the same
structure as the dataset.
• An attribute (feature, input, dimension, variable, or predictor) is a single
property of the dataset. Each column in Table is an attribute.
• Attributes can be numeric, categorical, date-time, text, or Boolean data
types. In this example, both the credit score and the interest rate are
numeric attributes
• A label (class label, output, prediction, target, or response) is the special
attribute to be predicted based on all the input attributes. In Table, the
interest rate is the output variable.
• Identifiers are special attributes that are used for locating or providing
context to individual records. For example, common attributes like names,
account numbers, and employee ID numbers are identifier attributes.

2.2 DATA PREPARATION
• Preparing the dataset to suit a data science task is the most
time-consuming part of the process.
• It is extremely rare that datasets are available in the form
required by the data science algorithms.
• Most of the data science algorithms would require data to be
structured in a tabular format with records in the rows and
attributes in the columns.
• If the data is in any other format, the data would need to be
transformed by applying pivot, type conversion, join,or
transpose functions, etc., to condition the data into the
required structure.

2.2.1 Data Exploration
• Data exploration, also known as exploratory data analysis, provides a set
of simple tools to achieve basic understanding of the data.
• Data exploration approaches involve computing descriptive statistics and
visualization of data.
• They can expose the structure of the data the distribution of the values,
the presence of extreme values, and highlight the inter-relationships
within the dataset.
• Descriptive statistics like mean,median, mode, standard deviation, and
range for each attribute provide an easily readable summary of the key
characteristics of the distribution of data.

2.2.2 Data Quality
• Data quality is an ongoing concern wherever data is collected,
processed, and stored.
• Organizations use data alerts, cleansing, and transformation
techniques to improve and manage the quality of the data and
store them in companywide repositories called data warehouses.
• Data sourced from well-maintained data warehouses have higher
quality, as there are proper controls in place to ensure a level of
data accuracy for new and existing data.
• The data cleansing practices include elimination of duplicate
records, quarantining outlier records that exceed the bounds,
standardization of attribute values, substitution of missing values,
etc.

2.2.3 Missing Values
• One of the most common data quality issues is that some records have missing attribute
values.
• For example, a credit score may be missing in one of the records. There are several
different mitigation methods to deal with this problem, but each method has pros and
cons. The first step of managing missing values is to understand the reason behind why
the values are missing. Tracking the data lineage (provenance) of the data source can
lead to the identification of systemic issues during data capture or errors in data
transformation.
• Knowing the source of missing value will often guide which mitigation methodology to
use. The missing value can be substituted with a range of artificial data so that the issue
can be managed with marginal impact on the later steps in the data science process.
• Missing credit score values can be replaced with a credit score derived from the dataset
(mean, minimum, or maximum value, depending on the characteristics of the attribute).
This method is useful if the missing values occur randomly and the frequency of
occurrence is quite rare.
• Alternatively, to build the representative model, all the data records with missing values
or records with poor data quality can be ignored. This method reduces the size of the
dataset.
38
38
38

2.2.4 Data Types and Conversion
• The attributes in a dataset can be of different types, such as continuous numeric
(interest rate), integer numeric (credit score), or categorical. For example, the
credit score can be expressed as categorical values (poor, good, excellent) or
numeric score.
• Different data science algorithms impose different restrictions on the attribute
data types.
• In case of linear regression models, the input attributes have to be numeric. If
the available data are categorical, they must be converted to continuous numeric
attribute.
• A specific numeric score can be encoded for each category value, such as poor 5
400, good 5 600, excellent 5 700, etc.
• Similarly, numeric values can be converted to categorical data types by a
technique called binning, where a range of values are specified for each category,
for example, a score between 400 and 500 can be encoded as “low” and so on.

• 2.2.5 Transformation
• In some data science algorithms like k-NN, the input attributes are
expected to be numeric and normalized, because the algorithm
compares the values of different attributes and calculates
distance between the data points.
• Normalization prevents one attribute dominating the distance
results because of large values. For example, consider income
(expressed in USD, in thou-sands) and credit score (in hundreds).
• The distance calculation will always be dominated by slight
variations in income.
• One solution is to convert the range of income and credit score to
a more uniform scale from 0 to 1 by normalization. This way, a
consistent comparison can be made between the two different
attributes with different units

2.2.6 Outliers
• Outliers are anomalies in a given dataset.
• Outliers may occur because of correct data capture (few
people with income in tens of millions) or erroneous data
capture (human height as 1.73 cm instead of 1.73 m).
• Regardless, the presence of outliers needs to be understood
and will require special treatments.
• The purpose of creating a representative model is to generalize
a pattern or a relationship within a dataset and the presence
of outliers skews the representativeness of the inferred model.
• Detecting outliers may be the primary purpose of some data
science applications, like fraud or intrusion detection.

2.2.7 Feature Selection
Reducing the number of attributes, without significant loss in the
performance of the model, is called feature selection. It leads to
a more simplified model and helps to synthesize a more effective
explanation of the model.
2.2.8 Data Sampling
Sampling is a process of selecting a subset of records as a
representation of the original dataset for use in data analysis or
modeling. The sample data serve as a representative of the
original dataset with similar properties, such as a similar mean.
Sampling reduces the amount of data that need to be processed
and speeds up the build process of the modeling
40
40
40
40
40
40
40

2.3 Model
A model is the abstract representation of the data and the relationships in a
given dataset. A simple rule of thumb like “mortgage interest rate reduces
with increase in credit score” is a model; although there is not enough
quantitative information to use in a production scenario, it provides
directional information by abstracting the relationship between credit score
and interest rate. There are a few hundred data science algorithms in use
today, derived from statistics, machine learning, pattern recognition, and the
body of knowledge related to computer science.
41

2.3.1 Training and Testing Datasets
The modeling step creates a representative model inferred from
the data. The dataset used to create the model, with known
attributes and target, is called the training dataset. The validity
of the created model will also need to be checked with another
known dataset called the test dataset or validation dataset. To
facilitate this process, the overall known dataset can be split into
a training dataset and a test dataset. A standard rule of thumb is
two-thirds of the data are to be used as training and one-third as
a test dataset

2.3.2 Learning Algorithms
The business question and the availability of data will dictate
what data science task (association, classification, regression,
etc.,) can to be used. The practitioner determines the
appropriate data science algorithm within the chosen category.
For example, within a classification task many algorithms can be
chosen from: decision trees, rule induction, neural networks,
Bayesian models, k-NN, etc. Likewise, within decision tree
techniques, there are quite a number of variations of learning
algorithms like classification and regression tree (CART), CHi-
squared Automatic Interaction Detector (CHAID) et

2.3.3 Evaluation of the Model
A model should not memorize and output the same values that
are in the training records. The phenomenon of a model
memorizing the training data is called overfitting. An overfitted
model just memorizes the training records and will
underperform on real unlabeled new data. The model should
generalize or learn the relationship between credit score and
interest rate. To evaluate this relationship, the validation or test
dataset, which was not previously used in building the model, is
used for evaluation.
45
45

2.3.4 Ensemble Modeling
Ensemble modeling is a process where multiple diverse base
models are used to predict an outcome. The motivation for using
ensemble models is to reduce the generalization error of the
prediction.
2.4 APPLICATION
Deployment is the stage at which the model becomes
production ready or live. In business applications, the results of
the data science process have to be assimilated into the business
process—usually in software applications. The model
deployment stage has to deal with: assessing model readiness,
technical integration, response time, model maintenance, and
assimilation.

2.4.1 Production Readiness
The production readiness part of the deployment determines the critical qualities required for
the deployment objective.
2.4.2 Technical Integration
Technical integration in the data science process involves integrating various technologies, tools,
and platforms to facilitate and streamline each stage of the process. Here's how technical
integration can be applied at each step:
2.4.3 Response Time
2.4.4 Model Refresh
2.4.5 Assimilation
these tools and technologies ensures an efficient workflow, enabling data scientists to focus on
extracting insights and building robust models.

2.5 KNOWLEDGE
• The data science process provides a framework to extract nontrivial information
from data. With the advent of massive storage, increased data collection, and
advanced computing paradigms, the available datasets to be utilized are only
increasing.
• To extract knowledge from these massive data assets, advanced approaches need to
be employed, like data science algorithms, in addition to standard business
intelligence reporting or statistical analysis.
• Data science, like any other technology, provides various options in terms of
algorithms and parameters within the algorithms. Using these options to extract the
right information from data is a bit of an art and can be developed with practice.
• The data science process starts with prior knowledge and ends with posterior
knowledge, which is the incremental insight gained
• It is the difference between gaining the information through the data science
process and the insights from basic data analysis. Finally, the whole data science
process is a framework to invoke the right questions (Chapman et al., 2000) and
provide guidance, through the right approaches, to solve a problem

Data Exploration
• Data exploration can be broadly classified into two types—
descriptive statistics and data visualization.
• Descriptive statistics is the process of condensing key
characteristics of the dataset into simple numeric metrics.
• Some of the common quantitative metrics used are mean,
standard deviation, and correlation.
• Visualization is the process of projecting the data, or parts of
it, into multi-dimensional space or abstract images. All the
useful(and adorable) charts fall under this category.
• Data exploration in the context of data science uses both
descriptive statistics and visualization techniques.

OBJECTIVES OF DATA EXPLORATION
• Data understanding
• Data preparation
• Data science tasks
• Interpreting the results

Types of Data
• Numeric or Continuous
• Categorical or Nominal
UNIVARIATE ANALYSIS
Univariate analysis is the simplest form of analyzing data. “Uni” means “one”,
so in other words your data has only one variable. It doesn’t deal with causes
or relationships (unlike regression) and it’s major purpose is to describe; it
takes data, summarizes that data and finds patterns in the data.

Ways to describe patterns found in univariate data
1. Central tendency
1. Mean
2. Mode
3. Median
2. Dispersion
1. Range
2. Variance
3. maximum, minimum,
4. Quartiles (including the interquartile range), and
5. Standard deviation
3. Count /Null count

Multivariate Exploration
• Multivariate exploration is the study of more than one attribute in the data-
set simultaneously. This technique is critical to understanding the relation-
ship between the attributes, which is central to data science methods.
• Central Data
• In the Iris dataset, each data point as a set of all the four attributes can be
expressed: observation: {sepal length, sepal width, petal length, petal width}
• For example, observation one: {5.1, 3.5, 1.4, 0.2}. This observation point can
also be expressed in four-dimensional Cartesian coordinates and can be
plotted in a graph (although plotting more than three dimensions in a visual
graph can be challenging). In this way, all 150 observations can be expressed
in Cartesian coordinates. If the objective is to find the most “typical”
observation point, it would be a data point made up of the mean of each
attribute in the dataset independently. For the Iris dataset shown in, the
central mean point is {5.006, 3.418, 1.464, 0.244}. This data point may not be
an actual observation. It will be a hypothetical data point with the most
typical attribute values.

Correlation
• Correlation measures the statistical relationship between two attributes,
particularly dependence of one attribute on another attribute.
• When two attributes are highly correlated with each other, they both
vary at the same rate with each other either in the same or in opposite
directions.
• For example, consider average temperature of the day and ice cream
sales. Statistically, the two attributes that are correlated are dependent
on each other and one may be used to predict the other. If there are
sufficient data, future sales of ice cream can be predicted if the
temperature forecast is known. However, correlation between two
attributes does not imply causation, that is, one doesn’t necessarily cause
the other. The ice cream sales and the shark attacks are correlated,
however there is no causation. Both ice cream sales and shark attacks are
influenced by the third attribute—the summer season. Generally, ice
cream sales spikes as temperaures rise. As more people go to beaches
during summer, encounters with sharks become more probable.

DATA VISUALIZATION
• Visualizing data is one of the most important techniques of data
discovery and exploration.
• Data visualization is the discipline of trying to understand data by placing
it in a visual context so that patterns, trends and correlations that might
not otherwise be detected can be exposed.
• Vision is one of the most powerful senses in the human body. As such, it
is intimately connected with cognitive thinking . Human vision is trained
to discover patterns and anomalies even in the presence of a large
volume of data. However, the effectiveness of the pattern detection
depends on how effectively the information is visually presented. Hence,
selecting suitable visuals to explore data is critically important in
discovering and comprehending hidden patterns in the data .
• As with descriptive statistics, visualization techniques are also
categorized into: univariate visualization, multivariate visualization and
visualization of a large number of attributes using parallel dimensions.

Univariate Visualization
Visual exploration starts with investigating one attribute at a time using univariate charts. The
techniques discussed in this section give an idea of how the attribute values are distributed and
the shape of the distribution.
Histogram
• A histogram is one of the most basic visualization techniques to understand the frequency
of the occurrence of values.
• It shows the distribution of the data by plotting the frequency of occurrence in a range.
• In a histogram, the attribute under inquiry is shown on the horizontal axis and the
frequency of occurrence is on the vertical axis.
• For a continuous numeric data type, the range or binning value to group a range of values
need to be specified.
• For example, in the case of human height in centimetres, all the occurrences between
152.00 and 152.99 are grouped under 152.
• There is no optimal number of bins or bin width that works for all the distributions. If the
bin width is too small, the distribution becomes more precise but reveals the noise due to
sampling.
• A general rule of thumb is to have a number of bins equal to the square root or cube root of
the number of data points.

Quartile
• A quartile is a statistical term that describes a division of observations into
four defined intervals based on the values of the data and how they compare
to the entire set of observations.
• A quartile divides data into three points—a lower quartile, median, and
upper quartile—to form four groups of the dataset.
• The lower quartile, or first quartile, is denoted as Q1 and is the middle
number that falls between the smallest value of the dataset and the median.
The second quartile, Q2, is also the median. The upper or third quartile,
denoted as Q3, is the central point that lies between the median and the
highest number of the distribution.
• Each quartile contains 25% of the total observations. Generally, the data is
arranged from smallest to largest:
 First quartile: the lowest 25% of numbers
 Second quartile: between 25.1% and 50% (up to the median)
 Third quartile: 50.1% to 75% (above the median)
 Fourth quartile: the highest 25% of numbers

Suppose the distribution of math scores in a class of 19 students in ascending order is:
59, 60, 65, 65, 68, 69, 70, 72, 75, 75, 76, 77, 81, 82, 84, 87, 90, 95, 98
First, mark down the median, Q2, which in this case is the 10th value: 75.Q1 is the central point
between the smallest score and the median.
In this case, Q1 falls between the first and fifth score: 68. (Note that the median can also be
included when calculating Q1 or Q3 for an odd set of values.
If we were to include the median on either side of the middle point, then Q1 will be the middle
value between the first and 10th score, which is the average of the fifth and sixth score—(fifth +
sixth)/2 = (68 + 69)/2 = 68.5).
Q3 is the middle value between Q2 and the highest score: 84. (Or if you include the median, Q3 =
(82 + 84)/2 = 83).
Now that we have our quartiles, let’s interpret their numbers. A score of 68 (Q1) represents the first
quartile and is the 25th percentile. 68 is the median of the lower half of the score set in the
available data—that is, the median of the scores from 59 to 75.
Q1 tells us that 25% of the scores are less than 68 and 75% of the class scores are greater. Q2 (the
median) is the 50th percentile and shows that 50% of the scores are less than 75, and 50% of the
scores are above 75. Finally, Q3, the 75th percentile, reveals that 25% of the scores are greater and
75% are less than 84.

Box plots
• In descriptive statistics, a box plot or boxplot (also known as box
and whisker plot) is a type of chart often used in explanatory data
analysis. Box plots visually show the distribution of numerical data
and skewness through displaying the data quartiles (or percentiles)
and averages.
• Box plots show the five-number summary of a set of data: including
the minimum score, first (lower) quartile, median, third (upper)
quartile, and maximum score.

• Minimum Score
The lowest score, excluding outliers (shown at the end of the left whisker).
• Lower Quartile
Twenty-five percent of scores fall below the lower quartile value (also known as the first
quartile).
• Median
The median marks the mid-point of the data and is shown by the line that divides the box into
two parts (sometimes known as the second quartile). Half the scores are greater than or equal
to this value and half are less.
• Upper Quartile
Seventy-five percent of the scores fall below the upper quartile value (also known as the third
quartile). Thus, 25% of data are above this value.
• Maximum Score
The highest score, excluding outliers (shown at the end of the right whisker).
• Whiskers
The upper and lower whiskers represent scores outside the middle 50% (i.e. the lower 25% of
scores and the upper 25% of scores).
• The Interquartile Range (or IQR)
This is the box plot showing the middle 50% of scores (i.e., the range between the 25th and
75th percentile).

Distribution Chart
• For continuous numeric attributes like petal length, instead of
visualizing the actual data in the sample, its normal
distribution function can be visualized instead. The normal
distribution function of a continuous random variable
• where μ is the mean of the distribution and σ is the standard
deviation of the distribution. Here an inherent assumption is
being made that the measurements of petal length (or any
continuous variable) follow the normal distribution, and
hence, its distribution can be visualized instead of the actual
values. The normal distribution is also called the Gaussian
distribution or “bell curve” due to its bell shape

Multivariate Visualization
• The multivariate visual exploration considers more than one attribute
in the same visual. The techniques discussed in this section focus on
the relationship of one attribute with another attribute. The
visualizations examine two to four attributes simultaneously.
• Scatterplot
A scatterplot is one of the most powerful yet simple visual plots available.
In a scatterplot, the data points are marked in Cartesian space with
attributes of the dataset aligned with the coordinates. The attributes are
usually of continuous data type.
One of the key observations that can be concluded from a scatterplot is
the existence of a relationship between two attributes under inquiry. If
the attributes are linearly correlated, then the data points align closer to
an imaginary straight line; if they are not correlated, the data points are
scattered. Apart from basic correlation, scatterplots can also indicate the
existence of patterns or groups of clusters in the data and identify outliers
in the data. This is particularly useful for low-dimensional datasets.

Scatter Multiple
• If the dataset has more than two attributes, it is important to look at combinations of
all the attributes through a scatterplot. A scatter matrix solves this need by comparing
all combinations of attributes with individual scatterplots and arranging these plots in a
matrix.
• A scatter matrix for all four attributes in the Iris dataset is shown in Fig. The color of the
data point is used to indicate the species of the flower. Since there are four attributes,
there are four rows and four columns, for a total of 16 scatter charts. Charts in the
diagonal are a comparison of the attribute with itself; hence, they are eliminated. Also,
the charts below the diagonal are mirror images of the charts above the diagonal. In
effect, there are six distinct comparisons in scatter multiples of four attributes. Scatter
matrices provide an effective visualization of comparative, multivariate, and high-
density data displayed in small multiples of the similar scatterplots

bubble chart
A bubble chart is a variation of a simple scatterplot with the
addition of one more attribute, which is used to determine the
size of the data point. In the Iris dataset, petal length and petal
width are used for x and y-axis, respectively and sepal width is
used for the size of the data point. The color of the data point
represents a species class label

Density charts
Density charts are similar to the scatterplots, with one more
dimension included as a background color. The data point can
also be coloured to visualize one dimension, and hence, a total
of four dimensions can be visualized in a density.In the example
in Fig. 3.14, petal length is used for the x-axis, sepal length for
the y-axis, sepal width for the background color, and class label
for the data point color.sity chart.

data science, prior knowledge ,modeling, scatter plot

More Related Content

Similar to data science, prior knowledge ,modeling, scatter plot (20)

Recently uploaded (20)

data science, prior knowledge ,modeling, scatter plot