0% found this document useful (0 votes)

285 views54 pages

Data Science With Python - Lesson 02 - Data Analytics Overview

The document provides an overview of the data analytics process and techniques. It describes data analytics as combining processes to extract information from datasets to address business problems or questions. The key steps involve acquiring data from various sources, wrangling and exploring the data which includes cleaning, manipulating, aggregating and reshaping it, and using exploratory techniques like quantitative analysis, graphical visualizations, and hypothesis testing to select appropriate models and draw conclusions or predictions. Challenges in the data wrangling phase include dealing with unexpected formats, errors, large volumes, and determining relationships in the data.

Uploaded by

Swarnajyoti Mazumdar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

285 views54 pages

Data Science With Python - Lesson 02 - Data Analytics Overview

Uploaded by

Swarnajyoti Mazumdar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 54

Data Science with Python

Data Analytics Overview

Learning Objectives

By the end of this lesson, you will be able to:

Describe Data Analytics process and its steps

List the skills and tools required for data analysis

Understand the challenges of the Data Analytics process

Explain Exploratory data analysis technique

Illustrate data visualization techniques

Describe Hypothesis testing

Data Analytics
Why Data Analytics

Data by itself is just an information source. But, unless you understand it, you will not be able to use it effectively.

Date Description Deposit Withdrawal Balance

Apr 1 ATM Post Debit 100 $200,000

Apr 2 PayPal Transfer 231054 200 $202,000

Apr 3 Simplilearn course fee 150 $200,500

Information source;
Apr 4 Starbucks Café 210 $198,400 overall patterns not
clearly visible
Apr 5 Walmart TX 230 $196,100

Apr 6 eBay swiss watch 239 250 $193,600

Apr 7 Caterpillar black boots men 270 $190,900

Apr 8 Halo blue shirt 831 160 $189,300

Why Data Analytics

When the transaction details are presented as a line chart, the deposit and withdrawal patterns become apparent.

Overall pattern
Why Data Analytics

When the transaction details are presented as a line chart, the deposit and withdrawal patterns become
apparent. It helps view and analyze general trends and discrepancies.

Discrepancy
Introduction to Data Analytics

Data Analytics is a combination of processes to extract information from datasets.

Data Analytics Process
Data Analytics Process
Business Problem

The process of analytics begins with questions or business problems of stakeholders.

Sales Inventory

Customers Traffic volume

Business problems trigger the need to

analyze data and find answers.
Data Acquisition

Collect data from various sources for analysis to answer the question raised in step 1.

Twitter, Facebook,
LinkedIn, and other social
media and information
sites provide streaming
APIs.

Data Scientist Expertise:

File handling
• Database Skills Data
• File formats
• Web scraping
Server logs can be
extracted from enterprise
system servers to analyze
and optimize application
performance.
Data Wrangling and Exploration

Data wrangling is the most important phase of the data analytic process.

Data cleansing Data Data pattern

Data discovery
manipulation

Data Wrangling Data Exploration

Data Wrangling: Challenges

This phase includes data cleansing, data manipulation, data aggregation, data split, and reshaping of data.

Causes of challenges in the data wrangling phase:

• Unexpected data format

• Erroneous data
• Voluminous data to be manipulated
• Classifying data into linear or clustered
• Determining relationship between observation, feature, and response

Data wrangling is the most challenging phase and takes up 70% of the Data Scientist’s time.
Data Exploration: Model Selection

This phase includes data cleansing, data manipulation, data aggregation, data split, and reshaping of data.

Model selection
• Based on the overall data analysis process
• Should be accurate to avoid iterations
• Depends on pattern identification and algorithms
• Depends on hypothesis building and testing
• Leads to building mathematical statistical functions
Exploratory Data Analysis (EDA)

APPROACH FOCUS ASSUMPTIONS EDA TECHNIQUES

Quantitative:
EDA techniques
The focus is on data; Provides numeric
EDA approach make minimal or no
its structure, outputs for the
studies the data to assumptions. They
outliers, and models inputted data
recommend suitable present and show
suggested by the Graphical:
models that best fit all the underlying
data. Uses statistical
the data. data without any
functions for
data loss.
graphical output
EDA: Quantitative Technique

EDA: Quantitative technique has two goals, measurement of central tendency and spread of data.

Measurement of Central Tendency

Mean Mean is the point which indicates how centralized the data points are.
• Suitable for symmetric distributions

Median Median is the exact middle value.

• Suitable for skewed distributions and for catching outliers in the
dataset

Mode Mode is the most common value in the data (frequency).

Measurement of Spread
Variance Variance is approximately the mean of the squares of the deviations.

Standard deviation Standard deviation is the square root of the variance.

Inter-quartile range Inter-quartile range is the distance between the 75th and 25th percentile.
It’s essentially the middle 50% of the data.
EDA: Graphical Technique

Histograms and scatter plots are two popular graphical techniques to depict data.

Histogram graphically summarizes the distribution of

a univariate dataset.

It shows:
Frequency
20

• the center or location of data (mean, median, or mode)

• the spread of data
15

• the skewness of data

• the presence of outliers
• the presence of multiple modes in the data
10

1 1 2 2
0 5 Per 0
Miles 5
Gallon
EDA: Graphical Technique

Histograms and scatter plots are two popular graphical techniques to depict data.

Scatter plot represents relationships between two variables.

It can answer these questions visually:

• Are variables X and Y related?

• Are variables X and Y linearly related?
• Are variables X and Y non-linearly related?
• Does change in variation of Y depend on X?
• Are there outliers?
Hypothesis
Conclusion or Prediction

This step involves reaching a conclusion and making predictions based on the data analysis.

• Involves heavy use of mathematical and statistical functions

• Requires model selection, training, and testing to help in forecasting

• Is called machine learning as data analysis is fully or semi-automated with

minimal or no human intervention

Meaning of Hypothesis

Hypothesis is used to establish the relationship between dependent and independent variables.

Hypothesis building begins in the data exploration stage, but

becomes more mature in the conclusion or prediction phase.

Data Exploration Stage Conclusion and Prediction

Key Considerations of Hypothesis Building

Testable explanations of a problem or observation
Used in quantitative and qualitative analyses to provide research solutions
Involves two variables, one dependent on another
Independent variable manipulated by the researcher
Dependent variable changes when the independent variable changes
Hypothesis Building Using Feature Engineering

Domain knowledge leads to hypothesis building using feature engineering.

Feature engineering involves domain expertise to:

• Make sense of data
• Construct new features from raw data automatically
• Construct new features from raw data manually
Hypothesis Building Using a Model
There are three phases to hypothesis building, which are model building, model evaluation, and model deployment.

Phase 1: Model Building

• Identify best input variables
• Evaluate the model’s capacity to forecast with these
variables

Phase 2: Model Evaluation

• Train and test the model for accuracy
• Optimize model accuracy, performance, and
comparisons with other models

Phase 3: Model Deployment

• Use the model for prediction
• Use the model to compare actual outcome with
expectations
Hypothesis Testing

Draw two samples from the population and calculate the difference between their means.

μ1 Calculating the
difference
S1 between the two
means is
hypothesis
testing.
μ2
S2
Hypothesis Testing

Alternative Hypothesis
• Proposed model outcome is
accurate and matches the data.
• There is a difference between the
means of S1 and S2.

Null Hypothesis
• Opposite of the alternative
hypothesis.
• There is no difference between
the means of S1 and S2.
Hypothesis Testing Process

Choosing the training and test dataset, and evaluating them with the null and alternative hypothesis.

Usually the training dataset is between 60% to 80% of the big dataset and the test dataset is between
20% to 40% of the big dataset.
Data Visualization
Communication

The last step of data analysis is communication, where the analyzed data is formally presented to stakeholders.

Forms of Data analysis presentations:

• Visual graphs
• Plotting maps
• Reports
• Whitepaper reports
• PowerPoint presentations
Data Visualization

Data visualization techniques are used for effective communication of data.

Benefits of data visualization: Examples of data visualization:

• Simplifies quantitative information through visuals • Presenting information about new and existing
• Shows the relationship between data points and customers on the website and their behavior when
variables they access the website
• Identifies patterns • Representing web traffic pattern for the website,
• Establishes trends for example, more activity on the website in the
morning than in the evening
Plotting

Plotting is a data visualization technique used to represent underlying data through graphics.

Features of plotting:

• Plotting is like telling a story about data using

different colors, shapes, and sizes.
• Plotting shows the relationship between variables.
• Example:
o Change in value of Y results in change in
value of X
o X is independent of y
Data Types for Plotting

There are two types of numerical data:

Discrete Data: Distinct or counted values
Example: Number of employees in a company or number of students in a class
Continuous Data: Values within a range that can be measured
Example: Height can be measured in feet or inches and weight can be measured in pounds or
Numerical Data kilograms

There are two types of categorical data:

Cluster or group: Grouped values
Example: Students can be divided into different groups based on height: Tall, Medium, and Short
Ordinal data: Grouped values as per ranks
Example: A ranking system; a five-point scale with ranks like Agree, Strongly agree, and
Categorical Data Disagree

Data is measured in time blocks, such as, date, month, year, and time (hours, minutes, and
seconds

Time Series
Types of Plot

Different data types can be visualized using various plotting techniques.

Data Analytics: An Iterative Process
Data Analytics is an iterative process involving tracing back the steps, often to ensure that you are on the right track.

Process Result: Question is answered or business problem is solved.

Data Analytics: Skills and Tools
Skills and tools required for each step of the data analysis process.

Question or Data Data Data Conclusion or Communication or

Business Problem Acquisition Wrangling Exploration Predictions Data Visualization
• Ability to ask • BeautifulSoup • CSV or other • NumPy • Scikit-Learn – • Pandas
appropriate for web file knowledge • SciPy the main • Database
questions and scraping • NumPy • Pandas machine • Matplotlib
know the • CSV or other • Pandas • Matplotlib learning library • PPT
business file knowledge • Database • CSV or other • CSV or other
• Domain • NumPy • SciPy file knowledge file knowledge
knowledge • Pandas • NumPy
• Passion for • Database • Pandas
data • Database
• Analytical • SciPy
approach
Knowledge Check
Knowledge
Check What is the goal of data acquisition?
Select all that apply.
1

a. Collect data from various data sources

b. Answer business questions through graphics

c. Collect web server logs

d. Scrape the web through web APIs

Knowledge
Check What is the goal of data acquisition?
Select all that apply.
1

a. Collect data from various data sources

b. Answer business questions through graphics

c. Collect web server logs

d. Scrape the web through web APIs

The correct answer is a, c, d

Data acquisition is a process to collect data from various data sources, such as RDBMS, No SQL databases, web server logs and
also scrape the web through web APIs.
Knowledge
Check What is Exploratory data analysis technique?
Select all that apply.
2

a. Analysis of data using quantitative techniques

b. Conducted only on a small subset of data

c. Analysis of data using graphical techniques

d. Suggests models that best fit the data

Knowledge
Check What is Exploratory data analysis technique?
Select all that apply.
2

a. Analysis of data using quantitative techniques

b. Conducted only on a small subset of data

c. Analysis of data using graphical techniques

d. Suggests models that best fit the data

The correct answer is a, c, d

Most EDA techniques are graphical in nature with a few quantitative techniques and also suggest models that best fit the data.
They use almost the entire data with minimum and no assumptions.
Knowledge
Check Which plotting technique is used for continuous data?
Select all that apply.
3

a. Regression plot

b. Line chart

c. Histogram

d. Heat map
Knowledge
Check Which plotting technique is used for continuous data?
Select all that apply.
3

a. Regression plot

b. Line chart

c. Histogram

d. Heat map

The correct answer is b, c

Line charts and histograms are used to plot continuous data.

Knowledge
Check
Which Python library is the main machine learning library?
4

a. Pandas

b. Matplotlib

c. Scikit-learn

d. NumPy
Knowledge
Check
Which Python library is the main machine learning library?
4

a. Pandas

b. Matplotlib

c. Scikit-learn

d. NumPy

The correct answer is c

SciKit-learn is the main machine learning library in Python.

Knowledge
Check Which of the following includes data transformation, merging, aggregation, group by operation,
and reshaping?
5

a. Data acquisition

b. Data visualization

c. Data wrangling

d. Machine learning
Knowledge
Check Which of the following includes data transformation, merging, aggregation, group by operation,
and reshaping?
5

a. Data acquisition

b. Data visualization

c. Data wrangling

d. Machine learning

The correct answer is c

Data wrangling includes data transformation, merging, aggregation, group by operation, and reshaping.
Knowledge
Check
Which measure of central tendency is used to catch outliers in the data?
6

a. Mean

b. Median

c. Mode

d. Variance
Knowledge
Check
Which measure of central tendency is used to catch outliers in the data?
6

a. Mean

b. Median

c. Mode

d. Variance

The correct answer is b

Median is the exact middle value and most suitable to catch outliers.
Knowledge
Check
In hypothesis testing, the proposed model is built on:
6

a. Entire dataset

b. Test dataset

c. Small subset

d. Training dataset
Knowledge
Check
In hypothesis testing, the proposed model is built on:
6

a. Entire dataset

b. Test dataset

c. Small subset

d. Training dataset

The correct answer is d

The proposed model is built on the training dataset in hypothesis testing.

Knowledge
Check
Beautiful soup library is used for _____.
7

a. Data wrangling

b. Web scraping

c. Plotting

d. Machine learning
Knowledge
Check
Beautiful soup library is used for _____.
7