Python for
Data Science
and Machine
Learning
Rod Salvador
Senior Data Scientist
Reed Elsevier Philippines
https://www.extremetech.com/extreme/319005-the-day-i-learned-what-data-science-is
Contents:
• Fundamentals of Data Science
• Data Science Workflow
• Applications of Data Science
• Tools for Data Science
• Hands-on Activities
Data Engineering
Exploratory Data Analysis (EDA)
Data Preprocessing/Cleansing
Machine Learning Modeling
Data Visualization
• Q&A
Fundamentals of Data Science
Data science combines multiple fields, including
mathematics, computer science, and domain expertise
to extract value from data.
It encompasses preparing data for analysis, including
cleansing, aggregating, and manipulating the data to
perform advanced data analysis, machine learning,
visualization, and deployment [1].
Applications of Data Science
Optimize campaign efforts by analyzing which platforms
are heavily used and rarely used by our end users.
Improve sales by creating targeted recommendations for
customers based on previous purchases and spending
habits.
Determine customer churn by analyzing data from
profiles, marketing interactions, sales history, and surveys
so sales and marketing can take action to retain them.
Improve events experience by analyzing the sentiment of
exhibitors, visitors, and hosted buyers based on open text
survey responses.
Applications of Data Science
Improve patient diagnoses by analyzing medical test data
and reported symptoms so doctors can diagnose diseases
earlier and treat them more effectively.
Improve efficiency by analyzing traffic patterns, weather
conditions, and other factors so logistics companies can
improve delivery speeds and reduce costs.
Forecast the growth of COVID-19 cases in a particular
region, country, continent, etc.
Detect fraud in financial services by recognizing
suspicious behaviors and anomalous actions.
Data Science Workflow
Tools for Data Science
Programming Languages
Python
Java
Javascript
C/C++
SQL
Tools for Data Science
Libraries
Data Engineering: requests, selenium, pyodbc, boto3, json5, beautifulsoup4, awswrangler, etc.
Data Analysis/Cleaning: numpy, pandas, pandas profiling, scipy, etc.
Machine Learning: scikit-learn, tensorflow, keras, pytorch, TPOT, etc.
Data Visualization: matplotlib, ggplot, d3.js, seaborn, plotly, etc.
NLP: nltk, textblob, twython, huggingface, etc.
Automation: pyautogui, selenium, etc.
Tools for Data Science
Other Tools
IDE: Jupyterlab/Jupyter notebook via Anaconda navigator, VS Code, Sublime, Pycharm, etc.
Data Sources: Kaggle, google open datasets, kdnuggets, NASA, etc.
Research: arxiv.org, paperswithcode.com, google scholar, etc.
Cloud/Distributed Computing: GCP, Azure, AWS, Databricks, Hadoop, Spark, etc.
Version Control: Git/Github, Bitbucket, subversion, etc.
Deployment: Heroku, Streamlit, Flask, Django, FastAPI, Docker, Kubernetes, Jenkins, etc.
Data Engineering
Data engineering is the practice designing and
building systems for collecting, storing, and
analyzing data at scale.
The ultimate goal is to make data accessible so that
organizations can use it to evaluate and optimize their
performance [2].
Data Engineering
What’s the difference between a data scientist/analyst
and a data engineer?
Data scientists and data analysts analyze data sets to
glean knowledge and insights.
Data engineers build systems for collecting,
validating, and preparing that high-quality data [2].
Data Engineering
DEMO 1: Collect data from an API endpoint using requests library
Exploratory Data Analysis (EDA)
Exploratory data analysis (EDA) is used to analyze and
investigate data sets and summarize their main
characteristics, often employing data visualization
methods [3].
It helps determine how to manipulate data sources to get
the answers you need, making it easier to:
1. discover patterns
2. spot anomalies
3. test a hypothesis
4. check assumptions
Exploratory Data Analysis
DEMO 2: Perform EDA using pandas profiling and create
exploratory visuals using matplotlib
Data Preprocessing/Cleansing
Data preprocessing is the process of transforming raw
data into an understandable format.
The quality of the data should be checked before applying
machine learning or data mining algorithms [4].
Data Preprocessing/Cleansing
Characteristics of a dirty data:
1. Incomplete data (e.g., missing/null values)
2. Duplicates
3. Inconsistent data (e.g., data types, data versions)
4. Outliers
5. Outdated data
6. Inaccurate data
7. Insecure data
,Data Preprocessing/Cleansing
DEMO 3: Cleanse tabular data using numpy and pandas
Break!
10 minutes
Introduction to Machine Learning
Machine learning is defined as the ability of a machine to
learn from data without being explicitly programmed.
Machine learning is best used for…
• Problems for which existing solutions require a lot of hand-tuning or long
lists of rules.
• Complex problems for which there is no good solution at all using a
traditional approach.
• Fluctuating environments: a Machine Learning system can adapt to new
data.
• Getting insights about complex problems and large amounts of data.
Introduction to Machine Learning
Machine Learning Workflow
Introduction to Machine Learning
Types of Machine Learning
Supervised learning is a type of machine learning that requires both input (features) data and
output (label) data. The goal is to find a mapping between the input and the output data.
https://ai.plainenglish.io/introduction-to-machine-learning-2316e048ade3
Introduction to Machine Learning
Types of Machine Learning
Unsupervised learning is a type of machine learning that only requires input data. The goal is to
find similarities, differences, and patterns in the data.
https://towardsdatascience.com/supervised-vs-unsupervised-learning-in-2-minutes-72dad148f242
Introduction to Machine Learning
Tasks under supervised learning
https://medium.com/big-data-at-berkeley/choosing-fine-tuning-your-machine-learning-model-8c28fc1bd2fc
Introduction to Machine Learning
Tasks under unsupervised learning
https://www.reddit.com/r/datascience/comments/d6buto/kmeans_be_like_mine_mine_mine/
https://towardsdatascience.com/dimensionality-reduction-cheatsheet-15060fee3aa
,Introduction to Machine Learning
DEMO 4: Supervised learning using scikit-learn library
,Introduction to Machine Learning
DEMO 5: Unsupervised learning using scikit-learn library
Data Visualization
Data visualization is the graphical representation of
information and data.
By using visual elements like charts, graphs, and maps, data
visualization tools provide an accessible way to see and
understand trends, outliers, and patterns in data [5].
,Data Visualization
DEMO 6: Data visualization using seaborn and plotly
Q&A
References
1. https://www.oracle.com/ph/data-science/what-is-data-science/
2. https://www.coursera.org/articles/what-does-a-data-engineer-do-and-how-do-i-
become-one
3. https://www.ibm.com/cloud/learn/exploratory-data-analysis
4. https://www.analyticsvidhya.com/blog/2021/08/data-preprocessing-in-data-mining-a-
hands-on-guide/
5. https://www.tableau.com/learn/articles/data-visualization
Thank you
[email protected]