SlideShare a Scribd company logo
Data Science Toolchain
presented by Jie-Han Chen
slide: https://goo.gl/1hXBGk
Language & Software
Python
R
Java
Matlab
Octave
Jupyter Notebook
Python
Open Source Community
Package
Web Service
Good Readability
Machine Learning
R
Open Source Community
Built-in Statistics Package
Standalone computing &
data analysis
Slower than Python
High Performance
Big Data
Poor Visualization,
Modeling
Java
Matlab & Octave
Powerful built-in math functions
Simple Data Visualization tool
Prototyping
-5
0
5
10
-10 -10
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
-5
0
5
10
Jupyter Notebook
Support 40+ programming language.
eg: Python, R, Scala...
Excellent for sharing your experiments
Markdown, Latex
example1
example2
Language & Software
Python
R
Java
Matlab
Octave
Jupyter Notebook
Data Science Roadmap
Data Science Toolchains
Data Collection
Data Visualization
Data Storage
Algorithm & Modeling
Data science-toolchain
Data Collection
Using API: Facebook, Wikipedia
Web Scraper
Web Scraper
Web Scraper
HTTP request + HTML Parser
HTTP: python-requests
Better than built-in urllib
Sessions with Cookie Persistence
Thread-safety
HTTP: python-requests
HTTP: python-requests
Web page
Parser!
Regular Expression?
BeautifulSoup
HTML/XML parser
BeautifulSoup
Ptt
HTML parser
More Powerful Tool?
Scrapy
An open source and collaborative framework for
extracting the data you need from websites.In a
fast, simple, yet extensible way.
Scrapy
$ scrapy startproject tutorial
Scrapy
path: /scrapy/dmoz.py
crawler name: dmoz
Scrapy
Scrapy
$ scrapy crawl dmoz
Scrapy
Data science-toolchain
robots.txt
youtube.com/robots.txt
"I believe that visualization is one of the most
powerful means of achieving personal goals."
Harvey Mackay
Data Visualization
Data Visualization
Matplotlib, ggplot2
D3.js
Bokeh
Tableau
PlotDB
Leaflet
Matplotlib
ggplot2
D3.js
Data Visualization Project
Interacive
Web frontend
example1
example2
Data science-toolchain
Bokeh
Python, R, Scala, Julia
Interactive
Jupyter Notebook
Tableau
Tableau
( )
code
Data Source
Data science-toolchain
Data Visulization
Code
Data science-toolchain
Data science-toolchain
Programming
Using GeoJSON with Leaflet
, Configurable
Using GeoJSON with Leaflet
Data science-toolchain
Data science-toolchain
Data science-toolchain
Data science-toolchain
Data science-toolchain
Data science-toolchain
Data science-toolchain
S3
1. Key-value
2. Permission
3. Data Visualization
4. Big Data (Spark)
Algorithm
&
Modeling
Algorithm & Modeling
python-numpy + python-pandas + scikit-learn
libsvm
spark-Mlib
Weka
Deep Learning
Numpy + Pandas
+ Scikit-learn
Numpy
C
Numpy - data structure
ndarray (n-dim array)
ndim
size
shape
dtype
Numpy
generate matrix
Numpy
generate matrix
Numpy
generate matrix
Numpy
generate matrix
Numpy
generate matrix
Numpy - linalg
numpy
Series, DataFrame
: csv, json ...
nan
Series -
Series -
Series -
Series -
DataFrame - Series
Pandas - import
Pandas - import
Pandas - import
Pandas - import
Pandas - NaN
Pandas - NaN
Pandas - NaN
Pandas - operation
Merge
Grouping
Reshaping
. . .
Data science-toolchain
Dataset
Feature Engineering
Modeling
Evaluation
Data science-toolchain
LIBSVM
C
Easy to use
Support many programming languages
Dataset
LIBSVM - install
$ git clone
LIBSVM - install
$ make
LIBSVM - workflow
LIBSVM - data format
label
index , attribute
value , attribute
LIBSVM - data format
LIBSVM - toy
Data science-toolchain
MLlib
MLlib
, Hadoop
Java, Scala, R, Python
MLlib
MLlib
, Hadoop
Java, Scala, R, Python
Classification: logistic regression, naive Bayes,...
Regression: generalized linear regression, survival regression,...
Decision trees, random forests, and gradient-boosted trees
Recommendation: alternating least squares (ALS)
Clustering: K-means, Gaussian mixtures (GMMs),...
Topic modeling: latent Dirichlet allocation (LDA)
Frequent itemsets, association rules, and sequential pattern
mining
MLlib
Feature transformations: standardization,
normalization, hashing,...
ML Pipeline construction
Model evaluation and hyper-parameter tuning
ML persistence: saving and loading models and
Pipelines
MLlib
MLlib
, Hadoop
Java, Scala, R, Python
MLlib
Data science-toolchain
Weka
Java library
Big Data
Support GUI
Deep Learning
Theano
Pylearn2
Keras
Tensorflow
Caffe
Deeplearning4J
...
Theano
Base on Numpy
Implemented by Cython
Dynamic C code generation
GPU & CUDA
tensor, math expression
A CPU and GPU Math Compiler in
Python
Theano tutorial:
http://www.slideshare.net/SergiiGavrylov/theano-tutorial
Keras
Theano, Tensorflow
Support GPU
prototype
High-level neural networks library
Data science-toolchain
Data science-toolchain
Data science-toolchain
Data science-toolchain
Data science-toolchain
Tool ?
Data science-toolchain
Homework
Github repo Data science
Database, Social Network Analytics, ML library, Deep
Learning Platform ...
READM.md: Repo
Demo Code
email: ita3051@gmail.com
Google
https://goo.gl/forms/PQPz8u2glyunQvfM2​

More Related Content

PDF
Data Science Toolchain 101
PPTX
Hive: Data Warehousing for Hadoop
PPTX
Big data vahidamiri-tabriz-13960226-datastack.ir
PPTX
Big Data Analytics for Non-Programmers
PPTX
Hadoop and Big Data: Revealed
PPTX
Data lake-itweekend-sharif university-vahid amiry
PDF
Big Data technology Landscape
PPTX
Intro to bigdata on gcp (1)
Data Science Toolchain 101
Hive: Data Warehousing for Hadoop
Big data vahidamiri-tabriz-13960226-datastack.ir
Big Data Analytics for Non-Programmers
Hadoop and Big Data: Revealed
Data lake-itweekend-sharif university-vahid amiry
Big Data technology Landscape
Intro to bigdata on gcp (1)

What's hot (20)

PPTX
عصر کلان داده، چرا و چگونه؟
PDF
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
PPTX
Intro to Big Data Hadoop
PPTX
Hadoop white papers
PPTX
Big Data - Part IV
PPTX
Big Data and Hadoop
PPTX
Big data and Hadoop
PDF
Hadoop/Spark Non-Technical Basics
PDF
The ABC of Big Data
PPTX
Big Data - Part III
PPTX
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
PPTX
Hadoop
PPTX
Big data technology unit 3
PPTX
Dc python meetup
PPTX
Spark - Philly JUG
PPTX
Big Data Unit 4 - Hadoop
PDF
An Introduction to Apache Spark
PPTX
Hadoop for beginners free course ppt
PPTX
Hadoop An Introduction
PPTX
Topic modeling using big data analytics
عصر کلان داده، چرا و چگونه؟
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Intro to Big Data Hadoop
Hadoop white papers
Big Data - Part IV
Big Data and Hadoop
Big data and Hadoop
Hadoop/Spark Non-Technical Basics
The ABC of Big Data
Big Data - Part III
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop
Big data technology unit 3
Dc python meetup
Spark - Philly JUG
Big Data Unit 4 - Hadoop
An Introduction to Apache Spark
Hadoop for beginners free course ppt
Hadoop An Introduction
Topic modeling using big data analytics
Ad

Viewers also liked (20)

PDF
MATLAB Fundamentals (1)
PPTX
Coastal Urban DEM project - Mapping the vulnerability of Australia's Coast
PPTX
Digital image processing - What is digital image processign
PDF
Geomagic_Control (EN)
PDF
Up and Down the Python Data & Web Visualization Stack by Rob Story PyData SV ...
PDF
Data Visualization(s) Using Python
PDF
Big data analytics 1
PPTX
Data visualization in python/Django
PPTX
Define Your Data (Science) Career
PPT
BigData Analytics with Hadoop and BIRT
PPT
Web Crawling and Data Gathering with Apache Nutch
PPTX
Data visualization with Python and SVG
PDF
莊坤達/資料科學與防疫應用的結合 : 以登革熱防治為例
PPTX
Matlab Working With Images
PPTX
Write microservice in golang
PDF
Sparkler—Crawler on Apache Spark: Spark Summit East talk by Karanjeet Singh a...
PDF
Basics of Image Processing using MATLAB
PPTX
Matlab Data And Statistics
PPTX
Getting started with image processing using Matlab
PDF
Sparkler - Spark Crawler
MATLAB Fundamentals (1)
Coastal Urban DEM project - Mapping the vulnerability of Australia's Coast
Digital image processing - What is digital image processign
Geomagic_Control (EN)
Up and Down the Python Data & Web Visualization Stack by Rob Story PyData SV ...
Data Visualization(s) Using Python
Big data analytics 1
Data visualization in python/Django
Define Your Data (Science) Career
BigData Analytics with Hadoop and BIRT
Web Crawling and Data Gathering with Apache Nutch
Data visualization with Python and SVG
莊坤達/資料科學與防疫應用的結合 : 以登革熱防治為例
Matlab Working With Images
Write microservice in golang
Sparkler—Crawler on Apache Spark: Spark Summit East talk by Karanjeet Singh a...
Basics of Image Processing using MATLAB
Matlab Data And Statistics
Getting started with image processing using Matlab
Sparkler - Spark Crawler
Ad

Similar to Data science-toolchain (20)

PDF
Keynote at Converge 2019
PDF
Travis Oliphant "Python for Speed, Scale, and Science"
PPTX
Session 2
PDF
PPT4: Frameworks & Libraries of Machine Learning & Deep Learning
PDF
PyData Boston 2013
PDF
Scientific Python
PDF
SciPy Latin America 2019
PDF
Anaconda and PyData Solutions
PPTX
Python for ML
PDF
The road ahead for scientific computing with Python
PDF
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
PDF
Python Libraries for Data Science - A Must-Know List.pdf
PDF
Python for Data Science: A Comprehensive Guide
PDF
Study of Various Tools for Data Science
PDF
Data analysis with Pandas and Spark
PDF
The Joy of SciPy
PDF
London level39
PPTX
Artificial Intelligence concepts in a Nutshell
PDF
DAVLectuer3 Exploratory data analysis .pdf
PPTX
Complete Introduction To DataScience PPT
Keynote at Converge 2019
Travis Oliphant "Python for Speed, Scale, and Science"
Session 2
PPT4: Frameworks & Libraries of Machine Learning & Deep Learning
PyData Boston 2013
Scientific Python
SciPy Latin America 2019
Anaconda and PyData Solutions
Python for ML
The road ahead for scientific computing with Python
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
Python Libraries for Data Science - A Must-Know List.pdf
Python for Data Science: A Comprehensive Guide
Study of Various Tools for Data Science
Data analysis with Pandas and Spark
The Joy of SciPy
London level39
Artificial Intelligence concepts in a Nutshell
DAVLectuer3 Exploratory data analysis .pdf
Complete Introduction To DataScience PPT

More from Jie-Han Chen (13)

PDF
Frontier in reinforcement learning
PDF
Actor critic algorithm
PDF
Temporal difference learning
PDF
Policy gradient
PDF
Deep reinforcement learning
PDF
Temporal difference learning
PDF
Markov decision process
PDF
Multi armed bandit
PDF
An introduction to reinforcement learning
PDF
Discrete sequential prediction of continuous actions for deep RL
PDF
Deep reinforcement learning from scratch
PDF
BiCNet presentation (multi-agent reinforcement learning)
PDF
The artofreadablecode
Frontier in reinforcement learning
Actor critic algorithm
Temporal difference learning
Policy gradient
Deep reinforcement learning
Temporal difference learning
Markov decision process
Multi armed bandit
An introduction to reinforcement learning
Discrete sequential prediction of continuous actions for deep RL
Deep reinforcement learning from scratch
BiCNet presentation (multi-agent reinforcement learning)
The artofreadablecode

Recently uploaded (20)

PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PPTX
Monitoring Stack: Grafana, Loki & Promtail
PDF
CCleaner Pro 6.38.11537 Crack Final Latest Version 2025
PPTX
L1 - Introduction to python Backend.pptx
PDF
Cost to Outsource Software Development in 2025
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
How to Make Money in the Metaverse_ Top Strategies for Beginners.pdf
PPTX
Reimagine Home Health with the Power of Agentic AI​
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
Nekopoi APK 2025 free lastest update
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PDF
iTop VPN Free 5.6.0.5262 Crack latest version 2025
PDF
AutoCAD Professional Crack 2025 With License Key
PDF
Product Update: Alluxio AI 3.7 Now with Sub-Millisecond Latency
DOCX
Greta — No-Code AI for Building Full-Stack Web & Mobile Apps
PDF
wealthsignaloriginal-com-DS-text-... (1).pdf
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
Wondershare Filmora 15 Crack With Activation Key [2025
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
Monitoring Stack: Grafana, Loki & Promtail
CCleaner Pro 6.38.11537 Crack Final Latest Version 2025
L1 - Introduction to python Backend.pptx
Cost to Outsource Software Development in 2025
Internet Downloader Manager (IDM) Crack 6.42 Build 41
How to Make Money in the Metaverse_ Top Strategies for Beginners.pdf
Reimagine Home Health with the Power of Agentic AI​
Odoo Companies in India – Driving Business Transformation.pdf
How to Choose the Right IT Partner for Your Business in Malaysia
Nekopoi APK 2025 free lastest update
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
iTop VPN Free 5.6.0.5262 Crack latest version 2025
AutoCAD Professional Crack 2025 With License Key
Product Update: Alluxio AI 3.7 Now with Sub-Millisecond Latency
Greta — No-Code AI for Building Full-Stack Web & Mobile Apps
wealthsignaloriginal-com-DS-text-... (1).pdf
Navsoft: AI-Powered Business Solutions & Custom Software Development

Data science-toolchain