SlideShare a Scribd company logo
Data Science With Python
Mosky
Data Science
➤ = Extract knowledge or insights from data.
➤ Data science includes:
➤ Visualization
➤ Statistics
➤ Machine learning
➤ Deep learning
➤ Big data
➤ And related methods
➤ ≈ Data mining
2
Data Science
➤ = Extract knowledge or insights from data.
➤ Data science includes:
➤ Visualization
➤ Statistics
➤ Machine learning
➤ Deep learning
➤ Big data
➤ And related methods
➤ ≈ Data mining
3
We will introduce.
Data Science With Python
➤ It's kind of outdated, but still contains lot of keywords.
➤ MrMimic/data-scientist-roadmap – GitHub
➤ Becoming a Data Scientist – Curriculum via Metromap
5
➤ Machine learning = statistics - checking of assumptions 😆
➤ But does resolve more problems. "
➤ Statistics constructs more solid inferences.
➤ Machine learning constructs more interesting predictions.
Statistics vs. Machine Learning
6
Probability, Descriptive Statistics, and Inferential Statistics
7
Population
Sample
Probability
Descriptive

Statistics
Inferential
Statistics
➤ Deep learning is the most renowned part of machine learning.
➤ A.k.a. the “AI”.
➤ Deep learning uses artificial neural networks (NNs).
➤ Which are especially good at:
➤ Computer vision (CV) 👀
➤ Natural language processing (NLP) 📖
➤ Machine translation
➤ Speech recognition
➤ Too costly to simple problems.
Machine Learning vs. Deep Learning
8
Big Data
➤ The “size” is constantly moving.
➤ As of 2012, ranges from 10n TB to n PB, which is 100x.
➤ Has high-3Vs:
➤ Volume, amount of data.
➤ Velocity, speed of data in and out.
➤ Variety, range of data types and sources.
➤ A practical definition:
➤ A single computer can't process in a reasonable time.
➤ Distributed computing is a big deal.
9
Today,
➤ “Models” are the math models.
➤ “Statistical models”, emphasize inferences.
➤ “Machine learning models”, emphasize predictions.
➤ “Deep learning” and “big data” are gigantic subfields.
➤ We won't introduce.
➤ But the learning resources are listed at the end.
10
Mosky
➤ Python Charmer at Pinkoi.
➤ Has spoken at
➤ PyCons in 

TW, MY, KR, JP, SG, HK,

COSCUPs, and TEDx, etc.
➤ Countless hours 

on teaching Python.
➤ Own the Python packages:
➤ ZIPCodeTW, 

MoSQL, Clime, etc.
➤ http://mosky.tw/
11
The Outline
➤ “Data”
➤ The Analysis Steps
➤ Visualization
➤ Preprocessing
➤ Dimensionality Reduction
➤ Statistical Models
➤ Machine Learning Models
➤ Keep Learning
12
The Packages
➤ $ pip3 install jupyter numpy scipy sympy matplotlib
ipython pandas seaborn statsmodels scikit-learn
➤ Or
➤ > conda install jupyter numpy scipy sympy matplotlib
ipython pandas seaborn statsmodels scikit-learn
13
Common Jupyter Notebook Shortcuts
14
Esc Edit mode → command mode.
Ctrl-Enter Run the cell.
B Insert cell below.
D, D Delete the current cell.
M To Markdown cell.
Cmd-/ Comment the code.
H Show keyboard shortcuts.
P Open the command palette.
Checkpoint: The Packages
➤ Open 00_preface_the_packages.ipynb up.
➤ Run it.
➤ The notebooks are available on https://github.com/moskytw/
data-science-with-python.
15
”Data”
“Data”
➤ = Variables
➤ = Dimensions
➤ = Labels + Features
17
Data in Different Types
18
Discrete
Nominal {male, female}
Ordinal

Ranked
↑ & can be ordered. {great > good > fair}
Continuous
Interval ↑ & distance is meaningful. temperatures
Ratio ↑ & 0 is meaningful. weights
Data in the X-Y Form
19
y x
dependent variable independent variable
response variable explanatory variable
regressand regressor
endogenous variable | endog exogenous variable | exog
outcome design
label feature
➤ Confounding variables:
➤ May affect y, but not x.
➤ May lead erroneous conclusions, “garbage in, garbage out”.
➤ Controlling, e.g., fix the environment.
➤ Randomizing, e.g, choose by computer.
➤ Matching, e.g., order by gender and then assign group.
➤ Statistical control, e.g., BMI to remove height effect.
➤ Double-blind, even triple-blind trials.
20
Get the Data
➤ Logs
➤ Existent datasets
➤ The Datasets Package – StatsModels
➤ Kaggle
➤ Experiments
21
The Analysis Steps
The Three Steps
1. Define Assumption
2. Validate Assumption
3. Validated Assumption?
23
1. Define Assumption
➤ Specify a feasible objective.
➤ “Use AI to get the moon!”
➤ Write an formal assumption.
➤ “The users will buy 1% items from our recommendation.”

rather than “The users will love our recommendation!”
➤ Note the dangerous gaps.
➤ “All the items from recommendation are free!”
➤ “Correlation does not imply causation.”
➤ Consider the next actions.
➤ “Release to 100% of users.” rather than “So great!”
24
2. Validate Assumption
➤ Collect potential data.
➤ List possible methods.
➤ A plotting, median, or even mean may be good enough.
➤ Selecting Statistical Tests – Bates College
➤ Choosing a statistical test – HBS
➤ Choosing the right estimator – Scikit-Learn
➤ Evaluate the metrics of methods with data.
25
3. Validated Assumption?
➤ Yes → Congrats! Report fully and take the actions! 🎉
➤ No → Check:
➤ The hypotheses of methods.
➤ The confounding variables in data.
➤ The formality of assumption.
➤ The feasibility of objective.

26
Iterate Fast While Industry Changes Rapidly
➤ Resolve the small problems first.
➤ Resolve the high impact/effort problems first.
➤ One week to get a quick result and improve

rather than one year to get the may-be-the-best result.
➤ Fail fast!
27
Checkpoint: Pick up a Method
➤ Think of an interesting problem.
➤ E.g., revenue is higher, but is it random?
➤ Pick one method from the cheatsheets.
➤ Selecting Statistical Tests – Bates College
➤ Choosing a statistical test – HBS
➤ Choosing the right estimator – Scikit-Learn
➤ Remember the three analysis steps.
28
Visualization
Visualization
➤ Make Data Colorful – Plotting
➤ 01_1_visualization_plotting.ipynb
➤ In a Statistical Way – Descriptive Statistics
➤ 01_2_visualization_descriptive_statistics.ipynb
30
➤ Star98
➤ star98_df = sm.datasets.star98.load_pandas().data
➤ Fair
➤ fair_df = sm.datasets.fair.load_pandas().data
➤ Howell1
➤ howell1_df = pd.read_csv(

'dataset_howell1.csv', sep=';')
➤ Or your own datasets.
➤ Plot the variables that interest you.
Checkpoint: Plot the Variables
31
Preprocessing
Feed the Data That Models Like
33
➤ Preprocess data for:
➤ Hard requirements, e.g.,
➤ corpus → vectors
➤ “What kind of news will be voted down on PTT?”
➤ Soft requirements (hypotheses), e.g.,
➤ t-test: better when samples are normally distributed.
➤ SVM: better when features range from -1 to 1.
➤ More representative features, e.g., total price / units.
➤ Note that different models have different tastes.
Preprocessing
➤ The Dishes – Containers
➤ 02_1_preprocessing_containers.ipynb
➤ A Cooking Method – Standardization
➤ 02_2_preprocessing_standardization.ipynb
➤ Watch Out for Poisonous Data Points – Removing Outliers
➤ 02_3_preprocessing_removing_outliers.ipynb
34
➤ Try to standardize and compare.
➤ Try to trim the outliners.
Checkpoint: Preprocess the Variables
35
Dimensionality
Reduction
The Model Sicks Up!
➤ Let's reduce the variables.
➤ Feed a subset → feature selection.
➤ Feature selection using SelectFromModel – Scikit-Learn
➤ Feed a transformation → feature extraction.
➤ PCA, FA, etc.
➤ Another definition: non-numbers → numbers.
37
➤ Principal Component Analysis
➤ 03_1_dimensionality_reduction_principal_component_analysis.ipynb
➤ Factor Analysis
➤ 03_2_dimensionality_reduction_factor_analysis.ipynb
Dimensionality Reduction
38
➤ Try to PCA(all variables) → the better components, or FA.
➤ And then plot n-dimensional data onto 2-dimensional plane.
Checkpoint: Reduce the Variables
39
Statistical Models
Statistical Models
➤ Identify Boring or Interesting – Hypothesis Testings
➤ 04_1_statistical_models_hypothesis_testings.ipynb
➤ “Hypothesis Testing With Python”
➤ Identify X-Y Relationships – Regression
➤ 04_2_statistical_models_regression_anova.ipynb
41
More Regression Models
➤ If y is not linear,
➤ Logit or Poisson Regression | Generalized Linear Models, GLMs
➤ If y is correlated,
➤ Linear Mixed Models, LMMs | Generalized Estimating Equation, GEE
➤ If x has multicollinearity,
➤ Lasso or Ridge Regression
➤ If error term is heteroscedastic,
➤ Weighted Least Squares, WLS | Generalized Least Squares, GLS
➤ If x is time series – predict x0 from x-1, not predict y from x,
➤ Autoregressive Integrated Moving Average, ARIMA
42
➤ Try to apply the analysis steps with a statistical method.
1. Define Assumption
2. Validate Assumption
3. Validated Assumption?
Checkpoint: Apply a Statistical Method
43
Machine Learning
Models
➤ Apple or Orange? – Classification
➤ 05_1_machine_learning_models_classification.ipynb
➤ Without Labels – Clustering
➤ 05_2_machine_learning_models_clustering.ipynb
➤ Predict the Values – Regression
➤ Who Are the Best? – Model Selection
➤ sklearn.model_selection.GridSearchCV
Machine Learning Models
45
Confusion matrix, where A = 002 = C[0, 0]
46
predicted -
AC
predicted +
BD
actual -
AB
true -
A
false +
B
actual +
CD
false -
C
true +
D
➤ precision = D / BD
➤ recall = D / CD
➤ sensitivity = D / CD = recall = observed power
➤ specificity = A / AB = observed confidence level
➤ false positive rate = B / AB = observed α
➤ false negative rate= C / CD = observed β
Common “rates” in confusion matrix
47
Ensemble Models
➤ Bagging
➤ N independent models and average their output.
➤ e.g., the random forest models.
➤ Boosting
➤ N sequential models, the n model learns from n-1's error.
➤ e.g., gradient tree boosting.
48
➤ Try to apply the analysis steps with a ML method.
1. Define Assumption
2. Validate Assumption
3. Validated Assumption?
Checkpoint: Apply a Machine Learning Method
49
Keep Learning
Keep Learning
➤ Statistics
➤ Seeing Theory
➤ Biological Statistics
➤ scipy.stats + StatsModels
➤ Research Methods
➤ Machine Learning
➤ Scikit-learn Tutorials
➤ Standford CS229
➤ Hsuan-Tien Lin

➤ Deep Learning
➤ TensorFlow | PyTorch
➤ Standford CS231n
➤ Standford CS224n
➤ Big Data
➤ Dask
➤ Hive
➤ Spark
➤ HBase
➤ AWS
51
The Facts
➤ ∵
➤ You can't learn all things in the data science!
➤ ∴
➤ “Let's learn to do” ❌
➤ “Let's do to learn” ✅
52
The Learning Flow
1. Ask a question.
➤ “How to tell the differences confidently?”
2. Explore the references.
➤ “T-test, ANOVA, ...”
3. Digest into an answer.
➤ Explore by the breadth-first way.
➤ Write the code.
➤ Make it work, make it right, finally make it fast.
53
Recap
➤ Let's do to learn, not learn to do.
➤ What is your objective?
➤ For the objective, what is your assumption?
➤ For the assumption, what method may validate it?
➤ For the method, how will you evaluate it with data?
➤ Q & A
54

More Related Content

PPTX
PPT on Data Science Using Python
PPTX
Supervised learning
PDF
Python for Data Science
PPTX
Data Science With Python | Python For Data Science | Python Data Science Cour...
PDF
Introduction to Python for Data Science
PPTX
Python & jupyter notebook installation
PDF
Machine Learning Algorithms
PPTX
Introduction To Machine Learning
PPT on Data Science Using Python
Supervised learning
Python for Data Science
Data Science With Python | Python For Data Science | Python Data Science Cour...
Introduction to Python for Data Science
Python & jupyter notebook installation
Machine Learning Algorithms
Introduction To Machine Learning

What's hot (20)

PDF
Python for Data Science | Python Data Science Tutorial | Data Science Certifi...
PPTX
Programming
PPTX
Python Scipy Numpy
PPT
Introduction to Python
PDF
Python Tutorial For Beginners | Python Crash Course - Python Programming Lang...
PPTX
Beginning Python Programming
PPTX
Classification and Regression
PDF
Python Anaconda Tutorial | Edureka
PDF
Introduction to python programming
PDF
Learn Python Programming | Python Programming - Step by Step | Python for Beg...
PPTX
Introduction to python
PDF
Naive Bayes
PPTX
Data Analysis in Python-NumPy
PPT
Introduction to Python
PPTX
Python and its Applications
PPTX
How to download and install Python - lesson 2
PPTX
Python basics
PPT
Data preprocessing
PDF
Natural language processing (NLP) introduction
PPTX
Introduction to python
Python for Data Science | Python Data Science Tutorial | Data Science Certifi...
Programming
Python Scipy Numpy
Introduction to Python
Python Tutorial For Beginners | Python Crash Course - Python Programming Lang...
Beginning Python Programming
Classification and Regression
Python Anaconda Tutorial | Edureka
Introduction to python programming
Learn Python Programming | Python Programming - Step by Step | Python for Beg...
Introduction to python
Naive Bayes
Data Analysis in Python-NumPy
Introduction to Python
Python and its Applications
How to download and install Python - lesson 2
Python basics
Data preprocessing
Natural language processing (NLP) introduction
Introduction to python
Ad

Similar to Data Science With Python (20)

PDF
Python Advanced Predictive Analytics Kumar Ashish
PDF
maxbox_starter138_top7_statistical_methods.pdf
PDF
Python for Data Analysis_ Data Wrangling with Pandas, Numpy, and Ipython ( PD...
PDF
De-Cluttering-ML | TechWeekends
PDF
Data Science curriculum
PPTX
Statistics in Data Science with Python
PDF
ML MODULE 2.pdf
PDF
Python for Data Analysis Data Wrangling with Pandas NumPy and IPython Wes Mck...
PDF
Python Machine Learning Cookbook Early Release 1st Ed Chris Albon
PPTX
Ml programming with python
PDF
Python For Data Analysis 3rd Wes Mckinney
PPTX
Introduction to Fundamentals of Data Science
PPTX
Basic of python for data analysis
PPTX
DataAnalyticsIntroduction and its ci.pptx
PDF
Learn Python teaching deck, learn how to code
PPTX
Data scientist roadmap
PDF
Machine Learning Guide maXbox Starter62
PPTX
Python ml
PPTX
Building Data Scientists
PPTX
Radhika (30323U09065).pptx data science with python
Python Advanced Predictive Analytics Kumar Ashish
maxbox_starter138_top7_statistical_methods.pdf
Python for Data Analysis_ Data Wrangling with Pandas, Numpy, and Ipython ( PD...
De-Cluttering-ML | TechWeekends
Data Science curriculum
Statistics in Data Science with Python
ML MODULE 2.pdf
Python for Data Analysis Data Wrangling with Pandas NumPy and IPython Wes Mck...
Python Machine Learning Cookbook Early Release 1st Ed Chris Albon
Ml programming with python
Python For Data Analysis 3rd Wes Mckinney
Introduction to Fundamentals of Data Science
Basic of python for data analysis
DataAnalyticsIntroduction and its ci.pptx
Learn Python teaching deck, learn how to code
Data scientist roadmap
Machine Learning Guide maXbox Starter62
Python ml
Building Data Scientists
Radhika (30323U09065).pptx data science with python
Ad

More from Mosky Liu (19)

PDF
Statistical Regression With Python
PDF
Practicing Python 3
PDF
Hypothesis Testing With Python
PDF
Elegant concurrency
PDF
Boost Maintainability
PDF
Beyond the Style Guides
PDF
Simple Belief - Mosky @ TEDxNTUST 2015
PDF
Concurrency in Python
PDF
ZIPCodeTW: Find Taiwan ZIP Code by Address Fuzzily
PDF
Graph-Tool in Practice
PDF
Minimal MVC in JavaScript
PDF
Learning Git with Workflows
PDF
Dive into Pinkoi 2013
PDF
MoSQL: More than SQL, but Less than ORM @ PyCon APAC 2013
PDF
Learning Python from Data
PDF
MoSQL: More than SQL, but less than ORM
PDF
Introduction to Clime
PDF
Programming with Python - Adv.
PDF
Programming with Python - Basic
Statistical Regression With Python
Practicing Python 3
Hypothesis Testing With Python
Elegant concurrency
Boost Maintainability
Beyond the Style Guides
Simple Belief - Mosky @ TEDxNTUST 2015
Concurrency in Python
ZIPCodeTW: Find Taiwan ZIP Code by Address Fuzzily
Graph-Tool in Practice
Minimal MVC in JavaScript
Learning Git with Workflows
Dive into Pinkoi 2013
MoSQL: More than SQL, but Less than ORM @ PyCon APAC 2013
Learning Python from Data
MoSQL: More than SQL, but less than ORM
Introduction to Clime
Programming with Python - Adv.
Programming with Python - Basic

Recently uploaded (20)

PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PDF
Introduction to the R Programming Language
PDF
Business Analytics and business intelligence.pdf
PPTX
SAP 2 completion done . PRESENTATION.pptx
PDF
[EN] Industrial Machine Downtime Prediction
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
climate analysis of Dhaka ,Banglades.pptx
PDF
Lecture1 pattern recognition............
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
Computer network topology notes for revision
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
modul_python (1).pptx for professional and student
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Introduction to the R Programming Language
Business Analytics and business intelligence.pdf
SAP 2 completion done . PRESENTATION.pptx
[EN] Industrial Machine Downtime Prediction
STERILIZATION AND DISINFECTION-1.ppthhhbx
Optimise Shopper Experiences with a Strong Data Estate.pdf
Clinical guidelines as a resource for EBP(1).pdf
climate analysis of Dhaka ,Banglades.pptx
Lecture1 pattern recognition............
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Introduction to Knowledge Engineering Part 1
Computer network topology notes for revision
IBA_Chapter_11_Slides_Final_Accessible.pptx
Data_Analytics_and_PowerBI_Presentation.pptx
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
modul_python (1).pptx for professional and student
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj

Data Science With Python

  • 1. Data Science With Python Mosky
  • 2. Data Science ➤ = Extract knowledge or insights from data. ➤ Data science includes: ➤ Visualization ➤ Statistics ➤ Machine learning ➤ Deep learning ➤ Big data ➤ And related methods ➤ ≈ Data mining 2
  • 3. Data Science ➤ = Extract knowledge or insights from data. ➤ Data science includes: ➤ Visualization ➤ Statistics ➤ Machine learning ➤ Deep learning ➤ Big data ➤ And related methods ➤ ≈ Data mining 3 We will introduce.
  • 5. ➤ It's kind of outdated, but still contains lot of keywords. ➤ MrMimic/data-scientist-roadmap – GitHub ➤ Becoming a Data Scientist – Curriculum via Metromap 5
  • 6. ➤ Machine learning = statistics - checking of assumptions 😆 ➤ But does resolve more problems. " ➤ Statistics constructs more solid inferences. ➤ Machine learning constructs more interesting predictions. Statistics vs. Machine Learning 6
  • 7. Probability, Descriptive Statistics, and Inferential Statistics 7 Population Sample Probability Descriptive
 Statistics Inferential Statistics
  • 8. ➤ Deep learning is the most renowned part of machine learning. ➤ A.k.a. the “AI”. ➤ Deep learning uses artificial neural networks (NNs). ➤ Which are especially good at: ➤ Computer vision (CV) 👀 ➤ Natural language processing (NLP) 📖 ➤ Machine translation ➤ Speech recognition ➤ Too costly to simple problems. Machine Learning vs. Deep Learning 8
  • 9. Big Data ➤ The “size” is constantly moving. ➤ As of 2012, ranges from 10n TB to n PB, which is 100x. ➤ Has high-3Vs: ➤ Volume, amount of data. ➤ Velocity, speed of data in and out. ➤ Variety, range of data types and sources. ➤ A practical definition: ➤ A single computer can't process in a reasonable time. ➤ Distributed computing is a big deal. 9
  • 10. Today, ➤ “Models” are the math models. ➤ “Statistical models”, emphasize inferences. ➤ “Machine learning models”, emphasize predictions. ➤ “Deep learning” and “big data” are gigantic subfields. ➤ We won't introduce. ➤ But the learning resources are listed at the end. 10
  • 11. Mosky ➤ Python Charmer at Pinkoi. ➤ Has spoken at ➤ PyCons in 
 TW, MY, KR, JP, SG, HK,
 COSCUPs, and TEDx, etc. ➤ Countless hours 
 on teaching Python. ➤ Own the Python packages: ➤ ZIPCodeTW, 
 MoSQL, Clime, etc. ➤ http://mosky.tw/ 11
  • 12. The Outline ➤ “Data” ➤ The Analysis Steps ➤ Visualization ➤ Preprocessing ➤ Dimensionality Reduction ➤ Statistical Models ➤ Machine Learning Models ➤ Keep Learning 12
  • 13. The Packages ➤ $ pip3 install jupyter numpy scipy sympy matplotlib ipython pandas seaborn statsmodels scikit-learn ➤ Or ➤ > conda install jupyter numpy scipy sympy matplotlib ipython pandas seaborn statsmodels scikit-learn 13
  • 14. Common Jupyter Notebook Shortcuts 14 Esc Edit mode → command mode. Ctrl-Enter Run the cell. B Insert cell below. D, D Delete the current cell. M To Markdown cell. Cmd-/ Comment the code. H Show keyboard shortcuts. P Open the command palette.
  • 15. Checkpoint: The Packages ➤ Open 00_preface_the_packages.ipynb up. ➤ Run it. ➤ The notebooks are available on https://github.com/moskytw/ data-science-with-python. 15
  • 17. “Data” ➤ = Variables ➤ = Dimensions ➤ = Labels + Features 17
  • 18. Data in Different Types 18 Discrete Nominal {male, female} Ordinal
 Ranked ↑ & can be ordered. {great > good > fair} Continuous Interval ↑ & distance is meaningful. temperatures Ratio ↑ & 0 is meaningful. weights
  • 19. Data in the X-Y Form 19 y x dependent variable independent variable response variable explanatory variable regressand regressor endogenous variable | endog exogenous variable | exog outcome design label feature
  • 20. ➤ Confounding variables: ➤ May affect y, but not x. ➤ May lead erroneous conclusions, “garbage in, garbage out”. ➤ Controlling, e.g., fix the environment. ➤ Randomizing, e.g, choose by computer. ➤ Matching, e.g., order by gender and then assign group. ➤ Statistical control, e.g., BMI to remove height effect. ➤ Double-blind, even triple-blind trials. 20
  • 21. Get the Data ➤ Logs ➤ Existent datasets ➤ The Datasets Package – StatsModels ➤ Kaggle ➤ Experiments 21
  • 23. The Three Steps 1. Define Assumption 2. Validate Assumption 3. Validated Assumption? 23
  • 24. 1. Define Assumption ➤ Specify a feasible objective. ➤ “Use AI to get the moon!” ➤ Write an formal assumption. ➤ “The users will buy 1% items from our recommendation.”
 rather than “The users will love our recommendation!” ➤ Note the dangerous gaps. ➤ “All the items from recommendation are free!” ➤ “Correlation does not imply causation.” ➤ Consider the next actions. ➤ “Release to 100% of users.” rather than “So great!” 24
  • 25. 2. Validate Assumption ➤ Collect potential data. ➤ List possible methods. ➤ A plotting, median, or even mean may be good enough. ➤ Selecting Statistical Tests – Bates College ➤ Choosing a statistical test – HBS ➤ Choosing the right estimator – Scikit-Learn ➤ Evaluate the metrics of methods with data. 25
  • 26. 3. Validated Assumption? ➤ Yes → Congrats! Report fully and take the actions! 🎉 ➤ No → Check: ➤ The hypotheses of methods. ➤ The confounding variables in data. ➤ The formality of assumption. ➤ The feasibility of objective.
 26
  • 27. Iterate Fast While Industry Changes Rapidly ➤ Resolve the small problems first. ➤ Resolve the high impact/effort problems first. ➤ One week to get a quick result and improve
 rather than one year to get the may-be-the-best result. ➤ Fail fast! 27
  • 28. Checkpoint: Pick up a Method ➤ Think of an interesting problem. ➤ E.g., revenue is higher, but is it random? ➤ Pick one method from the cheatsheets. ➤ Selecting Statistical Tests – Bates College ➤ Choosing a statistical test – HBS ➤ Choosing the right estimator – Scikit-Learn ➤ Remember the three analysis steps. 28
  • 30. Visualization ➤ Make Data Colorful – Plotting ➤ 01_1_visualization_plotting.ipynb ➤ In a Statistical Way – Descriptive Statistics ➤ 01_2_visualization_descriptive_statistics.ipynb 30
  • 31. ➤ Star98 ➤ star98_df = sm.datasets.star98.load_pandas().data ➤ Fair ➤ fair_df = sm.datasets.fair.load_pandas().data ➤ Howell1 ➤ howell1_df = pd.read_csv(
 'dataset_howell1.csv', sep=';') ➤ Or your own datasets. ➤ Plot the variables that interest you. Checkpoint: Plot the Variables 31
  • 33. Feed the Data That Models Like 33 ➤ Preprocess data for: ➤ Hard requirements, e.g., ➤ corpus → vectors ➤ “What kind of news will be voted down on PTT?” ➤ Soft requirements (hypotheses), e.g., ➤ t-test: better when samples are normally distributed. ➤ SVM: better when features range from -1 to 1. ➤ More representative features, e.g., total price / units. ➤ Note that different models have different tastes.
  • 34. Preprocessing ➤ The Dishes – Containers ➤ 02_1_preprocessing_containers.ipynb ➤ A Cooking Method – Standardization ➤ 02_2_preprocessing_standardization.ipynb ➤ Watch Out for Poisonous Data Points – Removing Outliers ➤ 02_3_preprocessing_removing_outliers.ipynb 34
  • 35. ➤ Try to standardize and compare. ➤ Try to trim the outliners. Checkpoint: Preprocess the Variables 35
  • 37. The Model Sicks Up! ➤ Let's reduce the variables. ➤ Feed a subset → feature selection. ➤ Feature selection using SelectFromModel – Scikit-Learn ➤ Feed a transformation → feature extraction. ➤ PCA, FA, etc. ➤ Another definition: non-numbers → numbers. 37
  • 38. ➤ Principal Component Analysis ➤ 03_1_dimensionality_reduction_principal_component_analysis.ipynb ➤ Factor Analysis ➤ 03_2_dimensionality_reduction_factor_analysis.ipynb Dimensionality Reduction 38
  • 39. ➤ Try to PCA(all variables) → the better components, or FA. ➤ And then plot n-dimensional data onto 2-dimensional plane. Checkpoint: Reduce the Variables 39
  • 41. Statistical Models ➤ Identify Boring or Interesting – Hypothesis Testings ➤ 04_1_statistical_models_hypothesis_testings.ipynb ➤ “Hypothesis Testing With Python” ➤ Identify X-Y Relationships – Regression ➤ 04_2_statistical_models_regression_anova.ipynb 41
  • 42. More Regression Models ➤ If y is not linear, ➤ Logit or Poisson Regression | Generalized Linear Models, GLMs ➤ If y is correlated, ➤ Linear Mixed Models, LMMs | Generalized Estimating Equation, GEE ➤ If x has multicollinearity, ➤ Lasso or Ridge Regression ➤ If error term is heteroscedastic, ➤ Weighted Least Squares, WLS | Generalized Least Squares, GLS ➤ If x is time series – predict x0 from x-1, not predict y from x, ➤ Autoregressive Integrated Moving Average, ARIMA 42
  • 43. ➤ Try to apply the analysis steps with a statistical method. 1. Define Assumption 2. Validate Assumption 3. Validated Assumption? Checkpoint: Apply a Statistical Method 43
  • 45. ➤ Apple or Orange? – Classification ➤ 05_1_machine_learning_models_classification.ipynb ➤ Without Labels – Clustering ➤ 05_2_machine_learning_models_clustering.ipynb ➤ Predict the Values – Regression ➤ Who Are the Best? – Model Selection ➤ sklearn.model_selection.GridSearchCV Machine Learning Models 45
  • 46. Confusion matrix, where A = 002 = C[0, 0] 46 predicted - AC predicted + BD actual - AB true - A false + B actual + CD false - C true + D
  • 47. ➤ precision = D / BD ➤ recall = D / CD ➤ sensitivity = D / CD = recall = observed power ➤ specificity = A / AB = observed confidence level ➤ false positive rate = B / AB = observed α ➤ false negative rate= C / CD = observed β Common “rates” in confusion matrix 47
  • 48. Ensemble Models ➤ Bagging ➤ N independent models and average their output. ➤ e.g., the random forest models. ➤ Boosting ➤ N sequential models, the n model learns from n-1's error. ➤ e.g., gradient tree boosting. 48
  • 49. ➤ Try to apply the analysis steps with a ML method. 1. Define Assumption 2. Validate Assumption 3. Validated Assumption? Checkpoint: Apply a Machine Learning Method 49
  • 51. Keep Learning ➤ Statistics ➤ Seeing Theory ➤ Biological Statistics ➤ scipy.stats + StatsModels ➤ Research Methods ➤ Machine Learning ➤ Scikit-learn Tutorials ➤ Standford CS229 ➤ Hsuan-Tien Lin
 ➤ Deep Learning ➤ TensorFlow | PyTorch ➤ Standford CS231n ➤ Standford CS224n ➤ Big Data ➤ Dask ➤ Hive ➤ Spark ➤ HBase ➤ AWS 51
  • 52. The Facts ➤ ∵ ➤ You can't learn all things in the data science! ➤ ∴ ➤ “Let's learn to do” ❌ ➤ “Let's do to learn” ✅ 52
  • 53. The Learning Flow 1. Ask a question. ➤ “How to tell the differences confidently?” 2. Explore the references. ➤ “T-test, ANOVA, ...” 3. Digest into an answer. ➤ Explore by the breadth-first way. ➤ Write the code. ➤ Make it work, make it right, finally make it fast. 53
  • 54. Recap ➤ Let's do to learn, not learn to do. ➤ What is your objective? ➤ For the objective, what is your assumption? ➤ For the assumption, what method may validate it? ➤ For the method, how will you evaluate it with data? ➤ Q & A 54