SlideShare a Scribd company logo
DevOps for Data Engineers
Automate your Data Science pipeline with Ansible, Python and Kubernetes
Mihai Criveti
2nd September 2019
1
1 Introduction
2 Data Science Landscape
3 Process and Flow
4 The Data
5 Data Science Toolkit
6 The Big Data Challange
7 Cloud Computing Solutions
8 The rise of DevOps
10 Automate Your Infrastructure with Packer and Ansible
2
1 Introduction
3
Speaker Bio
Mihai Criveti, IBM
• Designs and builds multi-cloud customer solutions for Cloud Native applications, big data analytics and machine learning
workloads.
• Pursuing a MSc in Data Science at UCD.
• Leads the Cloud Native competency for IBM Cloud Solutioning.
4
2 Data Science Landscape
What is Data Science
Data Science
Multi-disciplinary field that brings together Computer Science, Statistics/Machine Learning, and Data Analysis to understand and
extract insights from ever-increasing amounts of data.
Machine Learning
The science of getting computers to act without being explicitly programmed.
Deep Learning
Family of machine learning methods based on learning data representations, as opposed to task-specific algorithms. Learning can be
supervised, semi-supervised or unsupervised.
AI
Intelligent machines that work and react like humans.
5
Moving Towards Big Data
Figure 1: Powerful models and big data support Machine Learning
6
Data Scientist Domain
Data Engineering:
• Linux, Cloud, Big Data Platforms.
• Streaming, big data pipelines.
Software Development
• Coding skills, such as Python, R, SQL.
• Development practices: Agile, DevOps, CI/CD and using GitOps
effectively.
Figure 2: Data Scientist Venn Diagram
“While data scientists are recognised for their brilliant algorithms, up to 80% of their time could be spent collecting, cleaning
and organising data.” 1
1
forbes.com/sites/forbestechcouncil/2019/03/01/radical-change-is-coming-to-data-science-jobs
7
Data Science Roles
Data Scientist / Analyst:
• They turn raw data into valuable insights that an organization needs in order to grow or compete.
• Analytical data experts with the skill to solve complex problems and the curiosity to explore what problems need solving.
• They use Data, visualization, machine learning, deep learning, pattern recognition, natural language processing, analytics.
• Always curious about the world: ”What can we learn from this data? What actions can we take after?
Data Engineer / Data Architect:
• Prepare the “big data” infrastructure to be analysed by Data Scientists.
• Software engineers who design, build, integrate data from various resources, and manage big data.
8
3 Process and Flow
Example ML Flow
Stage Description
01. Business Understanding Ask relevant questions, define objectives
02. Data Mining Gather the necessary data
03. Data Cleaning Scrub and fix data inconsistencies
04. Data Exploration Form hyphothesis about the data
05. Feature Engineering Select / construct important features
06. Predictive Modeling Train Machine Learning Models
07. Data Visualization Communicate findings with key stakeholder
08. Data Automation Automate and deploy ml models in production
9
Often, it’s a manual process
Figure 3: Pen and paper - planning
10
Data Science is awesome
Data Science is OSEMN (pronounced AWESOME!)2
- an interactive process that consists, largely, of the following steps:
1. Inquire: ask a meaningful question.
2. Obtain: get the required data.
3. Scrub: clean the data.
4. Explore: learn about your data, try stuff.
5. Model: create a couple of models and test them.
6. iNterpret: gain insight from data, present it in a usable form (reports, dashboards, applications, etc).
2
O’Reilly - “Data Science at the Command Line, Facing the Future with Time-Tested Tools, Jeroen Janssens”
11
CRISP-DM Process
Figure 4: CRISP-DM - widely used analytics model
12
Design Thinking
Figure 5: Design Thinking
13
Data Development Lifecycle
Figure 6: Development, Data and Analytics Lifecycle
14
4 The Data
Data is fundamental
Figure 7: Data and the AI Ladder
15
Types of Data
Data can be:
• Structured: tables, spreadsheets, relational databases.
• Unstructured: text, images, audio, video.
• Quantitative: numerical (ex: pulse)
• Categorical: ex: hair colour.
• Big Data: massive datasets that cannot fit in memory on a
single machine.
Figure 8: Different types of data
Data becomes information when viewed in context or post-analysis.
16
Obtaining Data
Private / Enterprise Data
• Private data can often be found in: Data warehouse, SQL Database.
• NoSQL store, Data Lakes or HDFS, document repositories.
• Private wiks, ERP and CRM platforms, object storage and more often then not, spreadsheets.
Public Data
• Weather, social media, location / geographical data, stock data, public internet data (scraping), wikis, Eurostat, kaggle, government
data portal - are all sources of external data.
Data compliance, governance and security are key to a successful data strategy.
17
Data Portal
Figure 9: Open Data portals such as data.gov.ie
18
Common Data Formats
• XML, JSON, YAML
• CSV, TSV, Parquet, XLSX
• Markdown, HTML, DOCX
• TXT, PDF
• Audio, Video
• Data APIs that return JSON or XML
• Streaming data
• SQL and other database formats
• HDFS and other big data stores or encapsulated data on object storage
19
Big Data
Figure 10: The 4 Vs of Big Data
20
5 Data Science Toolkit
5 Data Science Toolkit
Figure 11: Big Data Landscape by Firstmark
21
Tools Data Scientists use
• Mathematics - Linear Algebra, Statistics, Combinatorics
• Some of them use R - focusing on statistics
• A lot of them use Python - usually with Jupyter notebook as a front-end
• Libraries such as Pandas and Numpy are very handy!
• Natural Language Processing with NLTK
• or Machine Learning libraries - Scikit-Learn, Tensorflow or PyTorch
• SQL and databases tend to be quite popular. After all, where does data live?
• NoSQL databases such as MongoDB are quite useful too…
• And a whole bunch of Big Data tools: Hadoop, Spark, Kafka, etc.
• They write papers too, so Markdown and LaTeX come in handy!
• Lots of code, so typical software development tools (git, IDEs, CI/CD, etc.)
• Processes (SCRUM, Agile, Lean, CRISP-DM, Design Thinking)
22
Tools to IOSEMN process
+-----------------+ Project Management / Lifecycle
| INQUIRE | Git, Github, Gitlab (Project documentation)
+-----------------+ Documentation systems
v
+------------------+ Requests, APIs, sensors, surveys
| OBTAIN | SQL, CSV, JSON, XLS, NoSQL, Hadoop, Spark
+------------------+ Store / Cache data locally (SQLite, PostgreSQL)
v (Gather internal and external data)
+-----------------+ Jupyter Notebook
| SCRUB | Regular Expression (re), BeautifulSoup
+-----------------+ SQLite, ETL, Glue
23
Tools (continued)
+-----------------+ Jupyter Notebook
| EXPLORE | Pandas, Orange
+-----------------+ Matplotlib
^ v (Explore and understand the data)
+-----------------+ SciKit-Learn, Tensorflow
| MODEL | PyTorch, NumPy
+-----------------+ Machine Learning
RE-INQUIRE | (Model: predict, check accuracy, evaluate model)
^ +-----------------+ Jupyter Notebook, MatplotLib
+--------- | INTERPRET | Bokeh, D3.JS, XLSXWriter
+-----------------+ Dashboards, Reports, etc.
(Choose a good representation, interpret the results)
24
Jupyter Lab / Notebook
Figure 12: Jupyter Notebook
25
Graphing and Dashboards
Figure 13: Grafana: dashboard for time series analytics
26
Apache Superset Visualization
Figure 14: Apache Supserset
27
Geospacial Data Visualization
Figure 15: Visualize geospatial data with deck.gl
28
Local Cloud - Docker Compose
version: '3'
services:
jupyter:
image: cmihai/genealgo-dev:v1
container_name: genealgo-dev
volumes:
- './notebooks:/notebooks'
ports:
- '9000:9000'
29
Composable Environments
+-----------------+
| Jupyter |
PYTHON | ports:9000 +---------------------------------+
| vol: /notebooks | |
| (Anaconda 3) +-----------------+ |
+---------|-------+ | |
| | |
+---------v-------+ +-----v------+ +-----v-----+
| PostgreSQL | NOSQL: | REDIS | | MONGODB |
SQL | | | | | |
| | | | | |
| | | | | |
+-----------------+ +------------+ +-----------+
30
Machine Learning Frameworks
Figure 16: Architecture: Jupyter Notebook using Keras with Tensorflow
31
Open Data, Open Tools
Figure 17: Open tools analysing open medical data
32
6 The Big Data Challange
6 The Big Data Challange
Making sense of ever growing datasets through automation, machine learning and Big Data pipelines.
33
Gartner Hype Cycle
Figure 18: Gartner Hype Cycle 2018
34
7 Cloud Computing Solutions
Cloud Computing
Cloud Computing is a model for enabling convenient, on-demand network access to a shared pool of configurable computing
resources that can be rapidly provisioned and released with minimal management effort or service provider interaction.
35
Why Cloud in Data Science
1. Setting up environments with ease.
2. Access to virtually unlimited resources, GPU computing, etc.
3. Free access to lite tiers from most providers.
4. You’re likely already using it. Github? Kaggle notebooks? Google Docs? Dropbox? AWS Free Tier? JupyterHub?
36
Cloud Native Architectures
• Take advantage of modern techiques, PaaS, multi-cloud, microservice design, agile development, containers, CI/CD, DevOps.
• Quickly scale-out to thousands of CPU or GPU cores.
• Store limitless ammounts of data.
• Process events in real time.
• Pay only for what you consume.
37
Machine Learning as a Service
Figure 19: IBM Watson Machine Learning Services
38
Cloud, Multi-Cloud, Data Lakes
Figure 20: Data Lake Architecture on AWS
39
8 The rise of DevOps
8 The rise of DevOps
DevOps: People, Processes and Tools working together to bring continuous delivery of value to clients.
40
Collaborate to continuously deliver
Figure 21: Practices to implement DevOps
41
Cultural Transformation
• Culture: Build trust and align your team with better communication and transparency.
• Discover: Understand the problem domain and align on common goals.
• Think: Know your audience and meet its needs faster than the competition.
• Develop: Collaborate to build, continuously integrate and deliver high-quality code.
• Reason: Apply AI techniques so that you can make better decisions.
• Operate: Harness the power of the cloud to quickly get your minimum viable product (MVP) into production, and monitor and
manage your applications to a high degree of quality and meet your service level agreements. Grow or shrink your resources
based on demand.
• Learn: Gain insights from your users as they interact with your application.
42
10 Automate Your Infrastructure
with Packer and Ansible
Hashicorp Packer: Image Build Automation
packer build rhel8.json
"builders": [ {
"type": "virtualbox-iso",
"boot_command": [
"<up><wait><tab>",
" text inst.ks=http://{{ .HTTPIP }}:{{ .HTTPPort }}/{{user `vm_name`}}.cfg",
"<enter><wait>"
]}],
"provisioners": [
{
"type": "ansible",
"playbook_file": "{{user `playbook_file`}}"
}]
43
Packer building a VirtualBox image for RHEL 8 using Kickstart Automated Install
Figure 22: Image Build with Packer
44
OpenSCAP: Automatic security remediation as shell, ansible or puppet
45
Molecule: Test your Ansible Playbooks on Docker, Vagrant or Cloud
Ansible Molecule
• Test Ansible playbooks on virtual machines (Vagrant, Docker, Cloud, etc).
• Can trigger python or goss for testing
Creating a vagrant or docker machine:
molecule create -s vagrant-centos-7
molecule converge -s vagrant-centos-7
molecule login
In one step
molecule test
46
Kubernetes: Container Orchestration at Scale
Figure 24: Kubernetes is Desired State Management
47

More Related Content

What's hot (20)

PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PDF
Building Lakehouses on Delta Lake with SQL Analytics Primer
Databricks
 
PPTX
HTAP Queries
Atif Shaikh
 
PDF
NOSQL- Presentation on NoSQL
Ramakant Soni
 
PDF
Generative AI
All Things Open
 
PDF
Pourquoi Leroy Merlin a besoin d'un Knowledge Graph ?
Neo4j
 
PPTX
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Simplilearn
 
PDF
Combining Logs, Metrics, and Traces for Unified Observability
Elasticsearch
 
PPTX
Map Reduce
Prashant Gupta
 
PDF
Best Practices for Streaming IoT Data with MQTT and Apache Kafka®
confluent
 
PDF
Clean Infrastructure as Code
QAware GmbH
 
PDF
Introduction of Knowledge Graphs
Jeff Z. Pan
 
PPTX
Introduction to NoSQL
Dr-Dipali Meher
 
PPTX
DataOps introduction : DataOps is not only DevOps applied to data!
Adrien Blind
 
PDF
Apache Spark Introduction
sudhakara st
 
PDF
Observability
Diego Pacheco
 
PDF
[Giovanni Galloro] How to use machine learning on Google Cloud Platform
MeetupDataScienceRoma
 
PPTX
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
Simplilearn
 
PDF
Nadine Schöne, Dataiku. The Complete Data Value Chain in a Nutshell
IT Arena
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Databricks
 
HTAP Queries
Atif Shaikh
 
NOSQL- Presentation on NoSQL
Ramakant Soni
 
Generative AI
All Things Open
 
Pourquoi Leroy Merlin a besoin d'un Knowledge Graph ?
Neo4j
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Simplilearn
 
Combining Logs, Metrics, and Traces for Unified Observability
Elasticsearch
 
Map Reduce
Prashant Gupta
 
Best Practices for Streaming IoT Data with MQTT and Apache Kafka®
confluent
 
Clean Infrastructure as Code
QAware GmbH
 
Introduction of Knowledge Graphs
Jeff Z. Pan
 
Introduction to NoSQL
Dr-Dipali Meher
 
DataOps introduction : DataOps is not only DevOps applied to data!
Adrien Blind
 
Apache Spark Introduction
sudhakara st
 
Observability
Diego Pacheco
 
[Giovanni Galloro] How to use machine learning on Google Cloud Platform
MeetupDataScienceRoma
 
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
Simplilearn
 
Nadine Schöne, Dataiku. The Complete Data Value Chain in a Nutshell
IT Arena
 

Similar to DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible, Python and Kubernetes (20)

PDF
Data Science at Scale - The DevOps Approach
Mihai Criveti
 
PPTX
Introduction to data science
Mahir Haque
 
PDF
Data Science Introduction and Process in Data Science
Pyingkodi Maran
 
PPTX
Unit 1-FDS. .pptx
kavalishiva33
 
PDF
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
DATAVERSITY
 
PPTX
Lesson 3 ai in the enterprise
ankit_ppt
 
PPTX
Next generation of data scientist
TanujaSomvanshi1
 
PDF
Data - Science and Engineering slide at Bandungpy Sharing Session
Hendri Karisma
 
PPTX
Data Science Training in Chandigarh h
asmeerana605
 
PDF
Data science a practitioner's perspective
Amir Ziai
 
PPTX
CSE3038_Module1 - updated v1.1bvjchcghvkhvjkvjvkjvh.pptx
aryan209525
 
PPTX
In-Depth Data Analytics
YASH GAIKWAD
 
PPTX
Introducition to Data scinece compiled by hu
wekineheshete
 
PPTX
Demystifying Data Science & Analytics - 757ColorCoded 2019
Guillermo A. Fisher
 
PDF
How to Prepare for a Career in Data Science
Juuso Parkkinen
 
PPTX
NYC Open Data Meetup-- Thoughtworks chief data scientist talk
Vivian S. Zhang
 
PDF
Untitled document.pdf
MuhammadTahiriqbal13
 
PPTX
Software engineering practices for the data science and machine learning life...
DataWorks Summit
 
PPTX
The Power of Data Science by DICS INNOVATIVE.pptx
gs5545791
 
PPTX
Data Science Overview
Davide Mauri
 
Data Science at Scale - The DevOps Approach
Mihai Criveti
 
Introduction to data science
Mahir Haque
 
Data Science Introduction and Process in Data Science
Pyingkodi Maran
 
Unit 1-FDS. .pptx
kavalishiva33
 
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
DATAVERSITY
 
Lesson 3 ai in the enterprise
ankit_ppt
 
Next generation of data scientist
TanujaSomvanshi1
 
Data - Science and Engineering slide at Bandungpy Sharing Session
Hendri Karisma
 
Data Science Training in Chandigarh h
asmeerana605
 
Data science a practitioner's perspective
Amir Ziai
 
CSE3038_Module1 - updated v1.1bvjchcghvkhvjkvjvkjvh.pptx
aryan209525
 
In-Depth Data Analytics
YASH GAIKWAD
 
Introducition to Data scinece compiled by hu
wekineheshete
 
Demystifying Data Science & Analytics - 757ColorCoded 2019
Guillermo A. Fisher
 
How to Prepare for a Career in Data Science
Juuso Parkkinen
 
NYC Open Data Meetup-- Thoughtworks chief data scientist talk
Vivian S. Zhang
 
Untitled document.pdf
MuhammadTahiriqbal13
 
Software engineering practices for the data science and machine learning life...
DataWorks Summit
 
The Power of Data Science by DICS INNOVATIVE.pptx
gs5545791
 
Data Science Overview
Davide Mauri
 
Ad

More from Mihai Criveti (13)

PDF
ContextForge MCP Gateway - the missing proxy for AI Agents and Tools
Mihai Criveti
 
PDF
10 Limitations of Large Language Models and Mitigation Options
Mihai Criveti
 
PDF
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...
Mihai Criveti
 
PDF
Ansible Workshop for Pythonistas
Mihai Criveti
 
PDF
Mihai Criveti - PyCon Ireland - Automate Everything
Mihai Criveti
 
PDF
ShipItCon - Continuous Deployment and Multicloud with Ansible and Kubernetes
Mihai Criveti
 
PDF
OpenShift Virtualization - VM and OS Image Lifecycle
Mihai Criveti
 
PDF
Kubernetes Story - Day 3: Deploying and Scaling Applications on OpenShift
Mihai Criveti
 
PDF
Kubernetes Story - Day 2: Quay.io Container Registry for Publishing, Building...
Mihai Criveti
 
PDF
Kubernetes Story - Day 1: Build and Manage Containers with Podman
Mihai Criveti
 
PDF
Container Technologies and Transformational value
Mihai Criveti
 
PDF
OpenShift Commons - Adopting Podman, Skopeo and Buildah for Building and Mana...
Mihai Criveti
 
PDF
AnsibleFest 2021 - DevSecOps with Ansible, OpenShift Virtualization, Packer a...
Mihai Criveti
 
ContextForge MCP Gateway - the missing proxy for AI Agents and Tools
Mihai Criveti
 
10 Limitations of Large Language Models and Mitigation Options
Mihai Criveti
 
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...
Mihai Criveti
 
Ansible Workshop for Pythonistas
Mihai Criveti
 
Mihai Criveti - PyCon Ireland - Automate Everything
Mihai Criveti
 
ShipItCon - Continuous Deployment and Multicloud with Ansible and Kubernetes
Mihai Criveti
 
OpenShift Virtualization - VM and OS Image Lifecycle
Mihai Criveti
 
Kubernetes Story - Day 3: Deploying and Scaling Applications on OpenShift
Mihai Criveti
 
Kubernetes Story - Day 2: Quay.io Container Registry for Publishing, Building...
Mihai Criveti
 
Kubernetes Story - Day 1: Build and Manage Containers with Podman
Mihai Criveti
 
Container Technologies and Transformational value
Mihai Criveti
 
OpenShift Commons - Adopting Podman, Skopeo and Buildah for Building and Mana...
Mihai Criveti
 
AnsibleFest 2021 - DevSecOps with Ansible, OpenShift Virtualization, Packer a...
Mihai Criveti
 
Ad

Recently uploaded (20)

PDF
Best Practice for LLM Serving in the Cloud
Alluxio, Inc.
 
PDF
AWS Consulting Services: Empowering Digital Transformation with Nlineaxis
Nlineaxis IT Solutions Pvt Ltd
 
PDF
CodeCleaner: Mitigating Data Contamination for LLM Benchmarking
arabelatso
 
PPTX
Android Notifications-A Guide to User-Facing Alerts in Android .pptx
Nabin Dhakal
 
PPTX
IObit Driver Booster Pro 12.4-12.5 license keys 2025-2026
chaudhryakashoo065
 
PDF
The Next-Gen HMIS Software AI, Blockchain & Cloud for Housing.pdf
Prudence B2B
 
PDF
Building scalbale cloud native apps with .NET 8
GillesMathieu10
 
PDF
Writing Maintainable Playwright Tests with Ease
Shubham Joshi
 
PDF
Which Hiring Management Tools Offer the Best ROI?
HireME
 
PPTX
IDM Crack with Internet Download Manager 6.42 Build 41 [Latest 2025]
pcprocore
 
PDF
AI Software Development Process, Strategies and Challenges
Net-Craft.com
 
PPTX
Agentforce – TDX 2025 Hackathon Achievement
GetOnCRM Solutions
 
PDF
Code Once; Run Everywhere - A Beginner’s Journey with React Native
Hasitha Walpola
 
PDF
The Rise of Sustainable Mobile App Solutions by New York Development Firms
ostechnologies16
 
PDF
Automated Test Case Repair Using Language Models
Lionel Briand
 
DOCX
Best AI-Powered Wearable Tech for Remote Health Monitoring in 2025
SEOLIFT - SEO Company London
 
PDF
Automated Testing and Safety Analysis of Deep Neural Networks
Lionel Briand
 
PDF
Humans vs AI Call Agents - Qcall.ai's Special Report
Udit Goenka
 
PDF
Azure AI Foundry: The AI app and agent factory
Maxim Salnikov
 
PDF
Telemedicine App Development_ Key Factors to Consider for Your Healthcare Ven...
Mobilityinfotech
 
Best Practice for LLM Serving in the Cloud
Alluxio, Inc.
 
AWS Consulting Services: Empowering Digital Transformation with Nlineaxis
Nlineaxis IT Solutions Pvt Ltd
 
CodeCleaner: Mitigating Data Contamination for LLM Benchmarking
arabelatso
 
Android Notifications-A Guide to User-Facing Alerts in Android .pptx
Nabin Dhakal
 
IObit Driver Booster Pro 12.4-12.5 license keys 2025-2026
chaudhryakashoo065
 
The Next-Gen HMIS Software AI, Blockchain & Cloud for Housing.pdf
Prudence B2B
 
Building scalbale cloud native apps with .NET 8
GillesMathieu10
 
Writing Maintainable Playwright Tests with Ease
Shubham Joshi
 
Which Hiring Management Tools Offer the Best ROI?
HireME
 
IDM Crack with Internet Download Manager 6.42 Build 41 [Latest 2025]
pcprocore
 
AI Software Development Process, Strategies and Challenges
Net-Craft.com
 
Agentforce – TDX 2025 Hackathon Achievement
GetOnCRM Solutions
 
Code Once; Run Everywhere - A Beginner’s Journey with React Native
Hasitha Walpola
 
The Rise of Sustainable Mobile App Solutions by New York Development Firms
ostechnologies16
 
Automated Test Case Repair Using Language Models
Lionel Briand
 
Best AI-Powered Wearable Tech for Remote Health Monitoring in 2025
SEOLIFT - SEO Company London
 
Automated Testing and Safety Analysis of Deep Neural Networks
Lionel Briand
 
Humans vs AI Call Agents - Qcall.ai's Special Report
Udit Goenka
 
Azure AI Foundry: The AI app and agent factory
Maxim Salnikov
 
Telemedicine App Development_ Key Factors to Consider for Your Healthcare Ven...
Mobilityinfotech
 

DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible, Python and Kubernetes

  • 1. DevOps for Data Engineers Automate your Data Science pipeline with Ansible, Python and Kubernetes Mihai Criveti 2nd September 2019 1
  • 2. 1 Introduction 2 Data Science Landscape 3 Process and Flow 4 The Data 5 Data Science Toolkit 6 The Big Data Challange 7 Cloud Computing Solutions 8 The rise of DevOps 10 Automate Your Infrastructure with Packer and Ansible 2
  • 4. 3
  • 5. Speaker Bio Mihai Criveti, IBM • Designs and builds multi-cloud customer solutions for Cloud Native applications, big data analytics and machine learning workloads. • Pursuing a MSc in Data Science at UCD. • Leads the Cloud Native competency for IBM Cloud Solutioning. 4
  • 6. 2 Data Science Landscape
  • 7. What is Data Science Data Science Multi-disciplinary field that brings together Computer Science, Statistics/Machine Learning, and Data Analysis to understand and extract insights from ever-increasing amounts of data. Machine Learning The science of getting computers to act without being explicitly programmed. Deep Learning Family of machine learning methods based on learning data representations, as opposed to task-specific algorithms. Learning can be supervised, semi-supervised or unsupervised. AI Intelligent machines that work and react like humans. 5
  • 8. Moving Towards Big Data Figure 1: Powerful models and big data support Machine Learning 6
  • 9. Data Scientist Domain Data Engineering: • Linux, Cloud, Big Data Platforms. • Streaming, big data pipelines. Software Development • Coding skills, such as Python, R, SQL. • Development practices: Agile, DevOps, CI/CD and using GitOps effectively. Figure 2: Data Scientist Venn Diagram “While data scientists are recognised for their brilliant algorithms, up to 80% of their time could be spent collecting, cleaning and organising data.” 1 1 forbes.com/sites/forbestechcouncil/2019/03/01/radical-change-is-coming-to-data-science-jobs 7
  • 10. Data Science Roles Data Scientist / Analyst: • They turn raw data into valuable insights that an organization needs in order to grow or compete. • Analytical data experts with the skill to solve complex problems and the curiosity to explore what problems need solving. • They use Data, visualization, machine learning, deep learning, pattern recognition, natural language processing, analytics. • Always curious about the world: ”What can we learn from this data? What actions can we take after? Data Engineer / Data Architect: • Prepare the “big data” infrastructure to be analysed by Data Scientists. • Software engineers who design, build, integrate data from various resources, and manage big data. 8
  • 12. Example ML Flow Stage Description 01. Business Understanding Ask relevant questions, define objectives 02. Data Mining Gather the necessary data 03. Data Cleaning Scrub and fix data inconsistencies 04. Data Exploration Form hyphothesis about the data 05. Feature Engineering Select / construct important features 06. Predictive Modeling Train Machine Learning Models 07. Data Visualization Communicate findings with key stakeholder 08. Data Automation Automate and deploy ml models in production 9
  • 13. Often, it’s a manual process Figure 3: Pen and paper - planning 10
  • 14. Data Science is awesome Data Science is OSEMN (pronounced AWESOME!)2 - an interactive process that consists, largely, of the following steps: 1. Inquire: ask a meaningful question. 2. Obtain: get the required data. 3. Scrub: clean the data. 4. Explore: learn about your data, try stuff. 5. Model: create a couple of models and test them. 6. iNterpret: gain insight from data, present it in a usable form (reports, dashboards, applications, etc). 2 O’Reilly - “Data Science at the Command Line, Facing the Future with Time-Tested Tools, Jeroen Janssens” 11
  • 15. CRISP-DM Process Figure 4: CRISP-DM - widely used analytics model 12
  • 16. Design Thinking Figure 5: Design Thinking 13
  • 17. Data Development Lifecycle Figure 6: Development, Data and Analytics Lifecycle 14
  • 19. Data is fundamental Figure 7: Data and the AI Ladder 15
  • 20. Types of Data Data can be: • Structured: tables, spreadsheets, relational databases. • Unstructured: text, images, audio, video. • Quantitative: numerical (ex: pulse) • Categorical: ex: hair colour. • Big Data: massive datasets that cannot fit in memory on a single machine. Figure 8: Different types of data Data becomes information when viewed in context or post-analysis. 16
  • 21. Obtaining Data Private / Enterprise Data • Private data can often be found in: Data warehouse, SQL Database. • NoSQL store, Data Lakes or HDFS, document repositories. • Private wiks, ERP and CRM platforms, object storage and more often then not, spreadsheets. Public Data • Weather, social media, location / geographical data, stock data, public internet data (scraping), wikis, Eurostat, kaggle, government data portal - are all sources of external data. Data compliance, governance and security are key to a successful data strategy. 17
  • 22. Data Portal Figure 9: Open Data portals such as data.gov.ie 18
  • 23. Common Data Formats • XML, JSON, YAML • CSV, TSV, Parquet, XLSX • Markdown, HTML, DOCX • TXT, PDF • Audio, Video • Data APIs that return JSON or XML • Streaming data • SQL and other database formats • HDFS and other big data stores or encapsulated data on object storage 19
  • 24. Big Data Figure 10: The 4 Vs of Big Data 20
  • 25. 5 Data Science Toolkit
  • 26. 5 Data Science Toolkit Figure 11: Big Data Landscape by Firstmark 21
  • 27. Tools Data Scientists use • Mathematics - Linear Algebra, Statistics, Combinatorics • Some of them use R - focusing on statistics • A lot of them use Python - usually with Jupyter notebook as a front-end • Libraries such as Pandas and Numpy are very handy! • Natural Language Processing with NLTK • or Machine Learning libraries - Scikit-Learn, Tensorflow or PyTorch • SQL and databases tend to be quite popular. After all, where does data live? • NoSQL databases such as MongoDB are quite useful too… • And a whole bunch of Big Data tools: Hadoop, Spark, Kafka, etc. • They write papers too, so Markdown and LaTeX come in handy! • Lots of code, so typical software development tools (git, IDEs, CI/CD, etc.) • Processes (SCRUM, Agile, Lean, CRISP-DM, Design Thinking) 22
  • 28. Tools to IOSEMN process +-----------------+ Project Management / Lifecycle | INQUIRE | Git, Github, Gitlab (Project documentation) +-----------------+ Documentation systems v +------------------+ Requests, APIs, sensors, surveys | OBTAIN | SQL, CSV, JSON, XLS, NoSQL, Hadoop, Spark +------------------+ Store / Cache data locally (SQLite, PostgreSQL) v (Gather internal and external data) +-----------------+ Jupyter Notebook | SCRUB | Regular Expression (re), BeautifulSoup +-----------------+ SQLite, ETL, Glue 23
  • 29. Tools (continued) +-----------------+ Jupyter Notebook | EXPLORE | Pandas, Orange +-----------------+ Matplotlib ^ v (Explore and understand the data) +-----------------+ SciKit-Learn, Tensorflow | MODEL | PyTorch, NumPy +-----------------+ Machine Learning RE-INQUIRE | (Model: predict, check accuracy, evaluate model) ^ +-----------------+ Jupyter Notebook, MatplotLib +--------- | INTERPRET | Bokeh, D3.JS, XLSXWriter +-----------------+ Dashboards, Reports, etc. (Choose a good representation, interpret the results) 24
  • 30. Jupyter Lab / Notebook Figure 12: Jupyter Notebook 25
  • 31. Graphing and Dashboards Figure 13: Grafana: dashboard for time series analytics 26
  • 32. Apache Superset Visualization Figure 14: Apache Supserset 27
  • 33. Geospacial Data Visualization Figure 15: Visualize geospatial data with deck.gl 28
  • 34. Local Cloud - Docker Compose version: '3' services: jupyter: image: cmihai/genealgo-dev:v1 container_name: genealgo-dev volumes: - './notebooks:/notebooks' ports: - '9000:9000' 29
  • 35. Composable Environments +-----------------+ | Jupyter | PYTHON | ports:9000 +---------------------------------+ | vol: /notebooks | | | (Anaconda 3) +-----------------+ | +---------|-------+ | | | | | +---------v-------+ +-----v------+ +-----v-----+ | PostgreSQL | NOSQL: | REDIS | | MONGODB | SQL | | | | | | | | | | | | | | | | | | +-----------------+ +------------+ +-----------+ 30
  • 36. Machine Learning Frameworks Figure 16: Architecture: Jupyter Notebook using Keras with Tensorflow 31
  • 37. Open Data, Open Tools Figure 17: Open tools analysing open medical data 32
  • 38. 6 The Big Data Challange
  • 39. 6 The Big Data Challange Making sense of ever growing datasets through automation, machine learning and Big Data pipelines. 33
  • 40. Gartner Hype Cycle Figure 18: Gartner Hype Cycle 2018 34
  • 41. 7 Cloud Computing Solutions
  • 42. Cloud Computing Cloud Computing is a model for enabling convenient, on-demand network access to a shared pool of configurable computing resources that can be rapidly provisioned and released with minimal management effort or service provider interaction. 35
  • 43. Why Cloud in Data Science 1. Setting up environments with ease. 2. Access to virtually unlimited resources, GPU computing, etc. 3. Free access to lite tiers from most providers. 4. You’re likely already using it. Github? Kaggle notebooks? Google Docs? Dropbox? AWS Free Tier? JupyterHub? 36
  • 44. Cloud Native Architectures • Take advantage of modern techiques, PaaS, multi-cloud, microservice design, agile development, containers, CI/CD, DevOps. • Quickly scale-out to thousands of CPU or GPU cores. • Store limitless ammounts of data. • Process events in real time. • Pay only for what you consume. 37
  • 45. Machine Learning as a Service Figure 19: IBM Watson Machine Learning Services 38
  • 46. Cloud, Multi-Cloud, Data Lakes Figure 20: Data Lake Architecture on AWS 39
  • 47. 8 The rise of DevOps
  • 48. 8 The rise of DevOps DevOps: People, Processes and Tools working together to bring continuous delivery of value to clients. 40
  • 49. Collaborate to continuously deliver Figure 21: Practices to implement DevOps 41
  • 50. Cultural Transformation • Culture: Build trust and align your team with better communication and transparency. • Discover: Understand the problem domain and align on common goals. • Think: Know your audience and meet its needs faster than the competition. • Develop: Collaborate to build, continuously integrate and deliver high-quality code. • Reason: Apply AI techniques so that you can make better decisions. • Operate: Harness the power of the cloud to quickly get your minimum viable product (MVP) into production, and monitor and manage your applications to a high degree of quality and meet your service level agreements. Grow or shrink your resources based on demand. • Learn: Gain insights from your users as they interact with your application. 42
  • 51. 10 Automate Your Infrastructure with Packer and Ansible
  • 52. Hashicorp Packer: Image Build Automation packer build rhel8.json "builders": [ { "type": "virtualbox-iso", "boot_command": [ "<up><wait><tab>", " text inst.ks=http://{{ .HTTPIP }}:{{ .HTTPPort }}/{{user `vm_name`}}.cfg", "<enter><wait>" ]}], "provisioners": [ { "type": "ansible", "playbook_file": "{{user `playbook_file`}}" }] 43
  • 53. Packer building a VirtualBox image for RHEL 8 using Kickstart Automated Install Figure 22: Image Build with Packer 44
  • 54. OpenSCAP: Automatic security remediation as shell, ansible or puppet 45
  • 55. Molecule: Test your Ansible Playbooks on Docker, Vagrant or Cloud Ansible Molecule • Test Ansible playbooks on virtual machines (Vagrant, Docker, Cloud, etc). • Can trigger python or goss for testing Creating a vagrant or docker machine: molecule create -s vagrant-centos-7 molecule converge -s vagrant-centos-7 molecule login In one step molecule test 46
  • 56. Kubernetes: Container Orchestration at Scale Figure 24: Kubernetes is Desired State Management 47