SlideShare a Scribd company logo
Statistics for
Data Scientists
Agenda
Revision
Data
Statistics -Descriptive, Central Tendency, Variation, Distributions
Data Mining
Basics of Data Science
http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
the culture of academia, which does not reward researchers for understanding technology.
DANGER ZONE- this overlap of skills gives people the ability to create what appears to be
a legitimate analysis without any understanding of how they got there or
what they have created
Being able to manipulate text files at the command-line,
understanding vectorized operations, thinking algorithmically;
these are the hacking skills that make for a successful data hacker.
data plus math and statistics only gets you machine learning,
which is great if that is what you are interested in, but not if you are doing data science
What is Business Analytics
Definition – study of business data using statistical techniques and
programming for creating decision support and insights for achieving
business goals
Predictive- To predict the future.
Descriptive- To describe the past.
Data
Data is a set of values of qualitative or quantitative variables. An example of qualitative
data would be an anthropologist's handwritten notes about her interviews. data is
collected by a huge range of organizations and institutions, including businesses (e.g.,
sales data, revenue, profits, stock price), governments (e.g., crime rates, unemployment
rates, literacy rates) and non-governmental organizations (e.g., censuses of the number
of homeless people by non-profit organizations). Data is measured, collected and
reported, and analyzed, whereupon it can be visualized using graphs, images or other
analysis tools.
https://en.wikipedia.org/wiki/Data
Data is distinct pieces of information, usually formatted in a special way. All software is
divided into two general categories: data and programs . Programs are collections of
instructions for manipulating data.Data can exist in a variety of forms -- as numbers or
text on pieces of paper, as bits and bytes stored in electronic memory, or as facts stored
in a person's mind.
http://www.webopedia.com/TERM/D/data.html
Data
https://en.oxforddictionaries.com/definition/data Definition of data in English:
data
noun
[mass noun] Facts and statistics collected together for reference or analysis:
‘there is very little data available’
The quantities, characters, or symbols on which operations are performed by a computer, which may be stored and transmitted
in the form of electrical signals and recorded on magnetic, optical, or mechanical recording media.
Philosophy Things known or assumed as facts, making the basis of reasoning or calculation.
Variable
Something that varies
Variable
Ordinal variables are variables that have two or more categories just like nominal variables only the categories can also be ordered or
ranked eg Excellent- Horrible. Dichotomous variables are nominal variables which have only two categories or levels. Nominal
variables are variables that have two or more categories, but which do not have an intrinsic order.
Interval variables are variables for which their central characteristic is that they can be measured along a continuum and they have a
numerical value (for example, temperature measured in degrees Celsius or Fahrenheit).
Ratio variables are interval variables, but with the added condition that 0 (zero) of the measurement indicates that there is none of that
variable. a distance of ten metres is twice the distance of 5 metres.
https://statistics.laerd.com/statistical-guides/types-of-variable.php
.
Central Tendency
Mean
Arithmetic Mean- the sum of the values divided by the number of values.
The geometric mean is an average that is useful for sets of positive numbers that are interpreted according to their product and
not their sum (as is the case with the arithmetic mean) e.g. rates of growth.
Median
the median is the number separating the higher half of a data sample, a population, or a probability distribution, from the lower
hal
Mode-
The "mode" is the value that occurs most often.
Dispersion
Range
the range of a set of data is the difference between the largest and smallest values.
Variance
mean of squares of differences of values from mean
Standard Deviation
square root of its variance
Frequency
a frequency distribution is a table that displays the frequency of various outcomes in a sample.
Distribution
The distribution of a statistical data set (or a population) is a listing or function showing all the possible values (or intervals) of
the data and how often they occur. When a distribution of categorical data is organized, you see the number or percentage of
individuals in each group.
http://www.dummies.com/education/math/statistics/what-the-distribution-tells-you-about-a-statistical-data-set/
Distributions
Normal
The simplest case of a normal distribution is known as the standard normal distribution. This is a special case where μ=0 and σ=1,
Skewed Distribution
Skewed Distribution
skewness is a measure of
the asymmetry of the
probability distribution of a
real-valued random variable
about its mean. The
skewness value can be
positive or negative, or even
undefined.
Image
https://en.wikipedia.org/wiki/F
ile:Negative_and_positive_sk
ew_diagrams_(English).svg
Skewed Distribution
kurtosis is a measure of the
"tailedness" of the probability distribution
of a real-valued random variable. kurtosis
is a descriptor of the shape of a probability
distribution
Image
http://www.itl.nist.gov/div898/handbook/eda/
section3/eda35b.htm
Skewed Distribution
skewness
returns value of
skewness,
kurtosis
returns value of kurtosis,
https://cran.r-project.org/
web/packages/moments
/moments.pdf
Image
http://www.janzengroup.
net/stats/lessons/descrip
tive.html
Distributions
Bernoulli
Distribution of a random variable which takes value 1 with success probability and value 0 with failure probability. It
can be used, for example, to represent the toss of a coin
Distributions
Chi Square
the distribution of a sum of the squares of k independent standard normal random variables.
Distributions
Poisson
a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time
and/or space if these events occur with a known average rate and independently of the time since the last event
Probability
Probability Distribution
The probability density function (pdf) of the normal distribution, also called Gaussian or "bell curve", the most important
continuous random distribution. As notated on the figure, the probabilities of intervals of values correspond to the area
under the curve.
Refresher in Statistics
Using RCmdr for Statistics
Using RCmdr for Statistics
Using RCmdr for Statistics
Using RCmdr
Central Limit Theorem
Central Limit Theorem -
In probability theory, the central limit theorem (CLT) states that, given certain conditions, the arithmetic mean of a sufficiently
large number of iterates of independentrandom variables, each with a well-defined expected value and well-defined variance, will
be approximately normally distributed, regardless of the underlying distribution.
Hypothesis testing
Hypothesis testing is the use of statistics to determine the probability that a given hypothesis is true. The
usual process of hypothesis testing consists of four steps.
1. Formulate the null hypothesis (commonly, that the observations are the result of pure chance) and the
alternative hypothesis (commonly, that the observations show a real effect combined with a component of
chance variation).
2. Identify a test statistic that can be used to assess the truth of the null hypothesis.
3. Compute the P-value, which is the probability that a test statistic at least as significant as the one observed
would be obtained assuming that the null hypothesis were true. The smaller the -value, the stronger the
evidence against the null hypothesis.
4. Compare the -value to an acceptable significance value (sometimes called an alpha value). If , that the
observed effect is statistically significant, the null hypothesis is ruled out, and the alternative hypothesis is
valid.
http://mathworld.wolfram.com/HypothesisTesting.html
Hypothesis testing
http://cmapskm.ihmc.us/rid=1052458963987_678930513_8647/Hypothesis%20testing.cmap
Hypothesis testing
Hypothesis testing
Hypothesis testing
T test
http://statistics.berkeley.edu/computing/r-t-tests
> x = rnorm(10)
> y = rnorm(10)
> t.test(x,y)
> ttest = t.test(x,y)
> names(ttest)
> ttest$statistic
Chi Square Distribution
Problem
Find the 95th
percentile of the Chi-Squared distribution with 7 degrees of freedom.
Solution
We apply the quantile function qchisq of the Chi-Squared distribution against the decimal values 0.95.
> qchisq(.95, df=7) # 7 degrees of freedom
[1] 14.067
http://www.r-tutor.com/elementary-statistics/probability-distributions/chi-squared-distribution
Normal Distribution
we are looking for the percentage of students scoring
higher than 84 , we apply the function pnorm of the normal
distribution with mean 72 and standard deviation 15.2. We
are interested in the upper tail of the normal distribution.
> pnorm(84, mean=72, sd=15.2, lower.tail=FALSE)
[1] 0.21492
Student T Distribution
Problem
Find the 2.5th
and 97.5th
percentiles of the Student t distribution with 5 degrees of freedom.
Solution
We apply the quantile function qt of the Student t distribution against the decimal values 0.025 and 0.975.
> qt(c(.025, .975), df=5) # 5 degrees of freedom
[1] -2.5706 2.5706
Some code
http://rpubs.com/newajay/stats1
Some code
http://rpubs.com/newajay/stats4
Bayes Theorem
https://artax.karlin.mff.cuni.cz/r-help/library/LaplacesDemon/html/BayesTheorem.html
Bayes Theorem
https://en.wikipedia.org/wiki/Bayes'_theorem

More Related Content

PDF
Statistics For Data Science | Statistics Using R Programming Language | Hypot...
PPTX
Statistics for data science
PPTX
PPT on Data Science Using Python
PPT
Introduction to statistics
PPT
Introduction To Statistics
PPTX
Parametric versus non parametric test
DOCX
E-commerce documentation
PPTX
Parametric tests
Statistics For Data Science | Statistics Using R Programming Language | Hypot...
Statistics for data science
PPT on Data Science Using Python
Introduction to statistics
Introduction To Statistics
Parametric versus non parametric test
E-commerce documentation
Parametric tests

What's hot (20)

PPTX
The Basics of Statistics for Data Science By Statisticians
PPTX
Machine learning with scikitlearn
PPTX
Exploratory data analysis
PPTX
Statistics For Data Science
PDF
Introduction to Machine Learning Classifiers
PDF
Introduction to Statistical Machine Learning
PDF
Exploratory data analysis data visualization
PPTX
Data analysis
PDF
Logistic regression in Machine Learning
PPTX
Maximum likelihood estimation
PPTX
Data Science With Python | Python For Data Science | Python Data Science Cour...
PDF
Descriptive Statistics with R
PDF
Outlier Detection
PPTX
Data Wrangling
PDF
Model selection and cross validation techniques
PDF
Missing data handling
PPT
Decision tree and random forest
PPTX
Introduction to Data Analytics
PDF
Data preprocessing using Machine Learning
PDF
Data Analysis and Visualization using Python
The Basics of Statistics for Data Science By Statisticians
Machine learning with scikitlearn
Exploratory data analysis
Statistics For Data Science
Introduction to Machine Learning Classifiers
Introduction to Statistical Machine Learning
Exploratory data analysis data visualization
Data analysis
Logistic regression in Machine Learning
Maximum likelihood estimation
Data Science With Python | Python For Data Science | Python Data Science Cour...
Descriptive Statistics with R
Outlier Detection
Data Wrangling
Model selection and cross validation techniques
Missing data handling
Decision tree and random forest
Introduction to Data Analytics
Data preprocessing using Machine Learning
Data Analysis and Visualization using Python
Ad

Similar to Statistics for data scientists (20)

PDF
Data science
PPT
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
PDF
Data Science 1.pdf
ZIP
B409 W11 Sas Collaborative Stats Guide V4.2
DOCX
UNIT-4.docx
PDF
Review of Basic Statistics and Terminology
PPTX
UNIT1-2.pptx
PPTX
Data What Type Of Data Do You Have V2.1
PPTX
Statistics
PPTX
Statistics
PPTX
Statistics
PPTX
Statistics
DOCX
Data Mining StepsProblem Definition Market AnalysisC
PPTX
Research methodology-Research Report
PPTX
Research Methodology-Data Processing
PPTX
UNIT - 5 : 20ACS04 – PROBLEM SOLVING AND PROGRAMMING USING PYTHON
PPTX
Descriptive Analysis.pptx
PPTX
MMW (Data Management)-Part 1 for ULO 2 (1).pptx
PPTX
Presentation of BRM.pptx
PDF
Descriptive Analytics: Data Reduction
Data science
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Data Science 1.pdf
B409 W11 Sas Collaborative Stats Guide V4.2
UNIT-4.docx
Review of Basic Statistics and Terminology
UNIT1-2.pptx
Data What Type Of Data Do You Have V2.1
Statistics
Statistics
Statistics
Statistics
Data Mining StepsProblem Definition Market AnalysisC
Research methodology-Research Report
Research Methodology-Data Processing
UNIT - 5 : 20ACS04 – PROBLEM SOLVING AND PROGRAMMING USING PYTHON
Descriptive Analysis.pptx
MMW (Data Management)-Part 1 for ULO 2 (1).pptx
Presentation of BRM.pptx
Descriptive Analytics: Data Reduction
Ad

More from Ajay Ohri (20)

PDF
Introduction to R ajay Ohri
PPTX
Introduction to R
PDF
Social Media and Fake News in the 2016 Election
PDF
Pyspark
PDF
Download Python for R Users pdf for free
PDF
Install spark on_windows10
DOCX
Ajay ohri Resume
PPTX
National seminar on emergence of internet of things (io t) trends and challe...
PDF
Tools and techniques for data science
PPTX
How Big Data ,Cloud Computing ,Data Science can help business
PDF
Training in Analytics and Data Science
PDF
Tradecraft
PDF
Software Testing for Data Scientists
PDF
Craps
PDF
A Data Science Tutorial in Python
PDF
How does cryptography work? by Jeroen Ooms
PDF
Using R for Social Media and Sports Analytics
PDF
Kush stats alpha
PPTX
Analyze this
PPTX
Summer school python in spanish
Introduction to R ajay Ohri
Introduction to R
Social Media and Fake News in the 2016 Election
Pyspark
Download Python for R Users pdf for free
Install spark on_windows10
Ajay ohri Resume
National seminar on emergence of internet of things (io t) trends and challe...
Tools and techniques for data science
How Big Data ,Cloud Computing ,Data Science can help business
Training in Analytics and Data Science
Tradecraft
Software Testing for Data Scientists
Craps
A Data Science Tutorial in Python
How does cryptography work? by Jeroen Ooms
Using R for Social Media and Sports Analytics
Kush stats alpha
Analyze this
Summer school python in spanish

Recently uploaded (20)

PDF
annual-report-2024-2025 original latest.
PPTX
Managing Community Partner Relationships
PPTX
SAP 2 completion done . PRESENTATION.pptx
PPT
DATA COLLECTION METHODS-ppt for nursing research
PDF
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
PDF
Lecture1 pattern recognition............
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
Microsoft Core Cloud Services powerpoint
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PDF
How to run a consulting project- client discovery
PPTX
climate analysis of Dhaka ,Banglades.pptx
PDF
Introduction to the R Programming Language
PPTX
A Complete Guide to Streamlining Business Processes
annual-report-2024-2025 original latest.
Managing Community Partner Relationships
SAP 2 completion done . PRESENTATION.pptx
DATA COLLECTION METHODS-ppt for nursing research
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
Lecture1 pattern recognition............
Qualitative Qantitative and Mixed Methods.pptx
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
STERILIZATION AND DISINFECTION-1.ppthhhbx
Introduction-to-Cloud-ComputingFinal.pptx
Microsoft Core Cloud Services powerpoint
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
How to run a consulting project- client discovery
climate analysis of Dhaka ,Banglades.pptx
Introduction to the R Programming Language
A Complete Guide to Streamlining Business Processes

Statistics for data scientists

  • 2. Agenda Revision Data Statistics -Descriptive, Central Tendency, Variation, Distributions Data Mining
  • 3. Basics of Data Science http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram the culture of academia, which does not reward researchers for understanding technology. DANGER ZONE- this overlap of skills gives people the ability to create what appears to be a legitimate analysis without any understanding of how they got there or what they have created Being able to manipulate text files at the command-line, understanding vectorized operations, thinking algorithmically; these are the hacking skills that make for a successful data hacker. data plus math and statistics only gets you machine learning, which is great if that is what you are interested in, but not if you are doing data science
  • 4. What is Business Analytics Definition – study of business data using statistical techniques and programming for creating decision support and insights for achieving business goals Predictive- To predict the future. Descriptive- To describe the past.
  • 5. Data Data is a set of values of qualitative or quantitative variables. An example of qualitative data would be an anthropologist's handwritten notes about her interviews. data is collected by a huge range of organizations and institutions, including businesses (e.g., sales data, revenue, profits, stock price), governments (e.g., crime rates, unemployment rates, literacy rates) and non-governmental organizations (e.g., censuses of the number of homeless people by non-profit organizations). Data is measured, collected and reported, and analyzed, whereupon it can be visualized using graphs, images or other analysis tools. https://en.wikipedia.org/wiki/Data Data is distinct pieces of information, usually formatted in a special way. All software is divided into two general categories: data and programs . Programs are collections of instructions for manipulating data.Data can exist in a variety of forms -- as numbers or text on pieces of paper, as bits and bytes stored in electronic memory, or as facts stored in a person's mind. http://www.webopedia.com/TERM/D/data.html
  • 6. Data https://en.oxforddictionaries.com/definition/data Definition of data in English: data noun [mass noun] Facts and statistics collected together for reference or analysis: ‘there is very little data available’ The quantities, characters, or symbols on which operations are performed by a computer, which may be stored and transmitted in the form of electrical signals and recorded on magnetic, optical, or mechanical recording media. Philosophy Things known or assumed as facts, making the basis of reasoning or calculation.
  • 8. Variable Ordinal variables are variables that have two or more categories just like nominal variables only the categories can also be ordered or ranked eg Excellent- Horrible. Dichotomous variables are nominal variables which have only two categories or levels. Nominal variables are variables that have two or more categories, but which do not have an intrinsic order. Interval variables are variables for which their central characteristic is that they can be measured along a continuum and they have a numerical value (for example, temperature measured in degrees Celsius or Fahrenheit). Ratio variables are interval variables, but with the added condition that 0 (zero) of the measurement indicates that there is none of that variable. a distance of ten metres is twice the distance of 5 metres. https://statistics.laerd.com/statistical-guides/types-of-variable.php .
  • 9. Central Tendency Mean Arithmetic Mean- the sum of the values divided by the number of values. The geometric mean is an average that is useful for sets of positive numbers that are interpreted according to their product and not their sum (as is the case with the arithmetic mean) e.g. rates of growth. Median the median is the number separating the higher half of a data sample, a population, or a probability distribution, from the lower hal Mode- The "mode" is the value that occurs most often.
  • 10. Dispersion Range the range of a set of data is the difference between the largest and smallest values. Variance mean of squares of differences of values from mean Standard Deviation square root of its variance Frequency a frequency distribution is a table that displays the frequency of various outcomes in a sample.
  • 11. Distribution The distribution of a statistical data set (or a population) is a listing or function showing all the possible values (or intervals) of the data and how often they occur. When a distribution of categorical data is organized, you see the number or percentage of individuals in each group. http://www.dummies.com/education/math/statistics/what-the-distribution-tells-you-about-a-statistical-data-set/
  • 12. Distributions Normal The simplest case of a normal distribution is known as the standard normal distribution. This is a special case where μ=0 and σ=1,
  • 14. Skewed Distribution skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. The skewness value can be positive or negative, or even undefined. Image https://en.wikipedia.org/wiki/F ile:Negative_and_positive_sk ew_diagrams_(English).svg
  • 15. Skewed Distribution kurtosis is a measure of the "tailedness" of the probability distribution of a real-valued random variable. kurtosis is a descriptor of the shape of a probability distribution Image http://www.itl.nist.gov/div898/handbook/eda/ section3/eda35b.htm
  • 16. Skewed Distribution skewness returns value of skewness, kurtosis returns value of kurtosis, https://cran.r-project.org/ web/packages/moments /moments.pdf Image http://www.janzengroup. net/stats/lessons/descrip tive.html
  • 17. Distributions Bernoulli Distribution of a random variable which takes value 1 with success probability and value 0 with failure probability. It can be used, for example, to represent the toss of a coin
  • 18. Distributions Chi Square the distribution of a sum of the squares of k independent standard normal random variables.
  • 19. Distributions Poisson a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time and/or space if these events occur with a known average rate and independently of the time since the last event
  • 20. Probability Probability Distribution The probability density function (pdf) of the normal distribution, also called Gaussian or "bell curve", the most important continuous random distribution. As notated on the figure, the probabilities of intervals of values correspond to the area under the curve.
  • 22. Using RCmdr for Statistics
  • 23. Using RCmdr for Statistics
  • 24. Using RCmdr for Statistics
  • 26. Central Limit Theorem Central Limit Theorem - In probability theory, the central limit theorem (CLT) states that, given certain conditions, the arithmetic mean of a sufficiently large number of iterates of independentrandom variables, each with a well-defined expected value and well-defined variance, will be approximately normally distributed, regardless of the underlying distribution.
  • 27. Hypothesis testing Hypothesis testing is the use of statistics to determine the probability that a given hypothesis is true. The usual process of hypothesis testing consists of four steps. 1. Formulate the null hypothesis (commonly, that the observations are the result of pure chance) and the alternative hypothesis (commonly, that the observations show a real effect combined with a component of chance variation). 2. Identify a test statistic that can be used to assess the truth of the null hypothesis. 3. Compute the P-value, which is the probability that a test statistic at least as significant as the one observed would be obtained assuming that the null hypothesis were true. The smaller the -value, the stronger the evidence against the null hypothesis. 4. Compare the -value to an acceptable significance value (sometimes called an alpha value). If , that the observed effect is statistically significant, the null hypothesis is ruled out, and the alternative hypothesis is valid. http://mathworld.wolfram.com/HypothesisTesting.html
  • 33. T test http://statistics.berkeley.edu/computing/r-t-tests > x = rnorm(10) > y = rnorm(10) > t.test(x,y) > ttest = t.test(x,y) > names(ttest) > ttest$statistic
  • 34. Chi Square Distribution Problem Find the 95th percentile of the Chi-Squared distribution with 7 degrees of freedom. Solution We apply the quantile function qchisq of the Chi-Squared distribution against the decimal values 0.95. > qchisq(.95, df=7) # 7 degrees of freedom [1] 14.067 http://www.r-tutor.com/elementary-statistics/probability-distributions/chi-squared-distribution
  • 35. Normal Distribution we are looking for the percentage of students scoring higher than 84 , we apply the function pnorm of the normal distribution with mean 72 and standard deviation 15.2. We are interested in the upper tail of the normal distribution. > pnorm(84, mean=72, sd=15.2, lower.tail=FALSE) [1] 0.21492
  • 36. Student T Distribution Problem Find the 2.5th and 97.5th percentiles of the Student t distribution with 5 degrees of freedom. Solution We apply the quantile function qt of the Student t distribution against the decimal values 0.025 and 0.975. > qt(c(.025, .975), df=5) # 5 degrees of freedom [1] -2.5706 2.5706