SlideShare a Scribd company logo
Creating #serverless data analytics
system on GCP using BigQuery
Márton Kodok / @martonkodok
Google Developer Expert at REEA.net
March 2018 - Tirgu Mures, Romania
● Geek. Hiker. Do-er.
● Among the Top3 romanians on Stackoverflow 120k reputation
● Google Developer Expert on Cloud technologies
● Crafting Web/Mobile backends at REEA.net
● BigQuery/Redis and database engine expert
● Active in mentoring and IT community
Twitter: @martonkodok
StackOverflow: pentium10
Slideshare: martonkodok
GitHub: pentium10
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
About me
REEA.net uses GCP
Build on the same infrastructure
that powers Google
Google Cloud Platform (GCP)
Compute Big Data
BigQuery
Cloud
Dataflow
Cloud
Dataproc
Cloud
Datalab
Cloud
Pub/Sub
Genomics
Storage & Databases
Cloud
Storage
Cloud
Bigtable
Cloud
Datastore
Cloud SQL
Cloud
Spanner
Persistent
Disk
Machine Learning
Cloud Machine
Learning
Cloud
Vision API
Cloud
Speech API
Cloud Natural
Language API
Cloud
Translation
API
Cloud
Jobs API
Data
Studio
Cloud
Dataprep
Cloud Video
Intelligence
API
Advanced
Solutions Lab
Compute
Engine
App
Engine
Kubernetes
Engine
GPU
Cloud
Functions
Container-
Optimized OS
Identity & Security
Cloud IAM
Cloud Resource
Manager
Cloud Security
Scanner
Key
Management
Service
BeyondCorp
Data Loss
Prevention API
Identity-Aware
Proxy
Security Key
Enforcement
Internet of Things
Cloud IoT
Core
Transfer
Appliance
Google Cloud Platform (GCP)
Developer Tools
Cloud SDK
Cloud
Deployment
Manager
Cloud Source
Repositories
Cloud
Tools for
Android Studio
Cloud Tools
for IntelliJ
Cloud
Tools for
PowerShell
Cloud
Tools for
Visual Studio
Container
Registry
Google Plug-in
for Eclipse
Cloud Test
Lab
Networking
Virtual
Private Cloud
Cloud Load
Balancing
Cloud
CDN
Cloud
Interconnect
Cloud DNS
Cloud
Network
Cloud
External IP
Addresses
Cloud
Firewall Rules
Cloud
Routes
Cloud VPN
Management Tools
Stackdriver Monitoring Logging
Error
Reporting
Trace
Debugger
Cloud
Deployment
Manager
Cloud
Endpoints
Cloud
Console
Cloud
Shell
Cloud Mobile
App
Cloud
Billing API
Cloud
APIs
Cloud
Router
Dedicated
Interconnect
Container
Builder
Meet Serverless
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
Meet Serverless
serverless data center depicted
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
Event-driven serverless compute platform
Cloud
Services
Changes in data state
Business logic events
Integrations
Event Router
Gateway
HTTPS
Event Source
Multiple Platforms
Data Warehouse
Pub/Sub
Cloud Functions
Streaming
Business Value
Application
Task
Analysis
Serverless is about maximizing elasticity, cost
savings, and agility of cloud computing.
@martonkodok
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
Crafting a solution for building high-performance,
petabyte scale data analytics, serverless
reporting system on Google Cloud Platform
Goal today
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
Legacy Reporting System
App
Cloud Load
Balancing
NGINX
Compute Engine
10GB PD
2 1
Database Service (Master/Slave)
Compute Engine
10GB PD
4 1
Compute Engine
10GB PD
4 1
Compute Engine
10GB PD
4 1
Report & Share
Business Analysis
Scheduled
Tasks
Batch Processing
Compute Engine
Multiple Instances
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
Serverless Reporting System
App
Cloud Load
Balancing
NGINX
Compute Engine
10GB PD
2 1
Database Service (Master/Slave)
Compute Engine
10GB PD
4 1
Compute Engine
10GB PD
4 1
Compute Engine
10GB PD
4 1
Report & Share
Business Analysis
Scheduled
Tasks
Batch Processing
Compute Engine
Multiple Instances
BigQuery Data Studio
Report & Share
Business Analysis
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
Analytics-as-a-Service - Data Warehouse in the Cloud
Scales into Petabytes on Managed Google Infrastructure (US or EU zone)
SQL 2011 + Javascript UDF (User Defined Functions)
Familiar DB Structure (table, views, struct, nested, JSON)
Integrates with Google Sheets + Cloud Storage + Pub/Sub connectors
Decent pricing (queries $5/TB, storage: $20/TB cold: $10/TB) *Mar 2018
Open Interfaces (Web UI, BQ command line tool, REST, ODBC)
What is BigQuery?
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
Columnar storage (max 10 000 columns in table)
Large files for loading: 5TB (CSV or JSON)
UDF in Javascript or SQL
Rich SQL 2011: JSON,IP,Math,RegExp,Geocode,Window functions
Modern data types: Record, Nested, Struct, Array.
Append-only tables prefered (DML syntax available)
Day column partitioned tables (select * from t where day=’2018-01-01’)
BigQuery: Convenience of SQL
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
Architecting for The Cloud
BigQuery
On-Premises Servers
Pipelines
ETL
Engine
Event Sourcing
Frontend
Platform Services
Metrics / Logs/
Streaming
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
“ Our project generates many/big files.
How can I seamlessly ingest them?
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
Serverless file ingest
BigQuery
On-Premises Servers
ApplicationEvent Sourcing
Frontend
Platform Services
Metrics / Logs/
Streaming
Cloud
Storage
Cloud
Functions
Triggered Code
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
“ Data needs to be processed in
multiple services.
How can we pipe to multiple places?
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
Architecting for The Cloud
On-Premises Servers
Event Sourcing
Frontend
Platform Services
Analyze
Metrics / Logs/
Streaming
Cloud Storage
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
Cloud
Dataflow
Process
BigQuery
Cloud SQL
Stream
Batch
Data
Studio
Third-Party
Tools
“ We have our app outside of GCP.
How can we use the benefits of BigQuery?
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
Data Pipeline Integration at REEA.net
Analytics Backend
BigQuery
On-Premises Servers
Pipelines
FluentD
Event Sourcing
Frontend
Platform Services
Metrics / Logs/
Streaming
Development
Team
Data Analysts
Report & Share
Business Analysis
Tools
Tableau
QlikView
Data Studio
Internal
Dashboard
Database
SQL
Application
ServersServers
Cloud Storage
archive
Load
Export
Replay
Standard
Devices
HTTPS
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
The following slides will present a sample Fluentd configuration to:
1. Transform a record
2. Copy event to multiple outputs
3. Store event data in File (for backup/log purposes)
4. Stream to BigQuery (for immediate analyses)
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
<filter frontend.user.*>
@type record_transformer
</filter>
<match frontend.user.*>
@type copy
<store>
@type forest
subtype file
</store>
<store>
@type bigquery
</store>
…
</match>
Filter plugin mutates incoming data. Add/modify/delete
event data transform attributes without a code deploy.1
2
3
4
The copy output plugin copies events to multiple outputs.
File(s), multiple databases, DB engines.
Great to ship same event to multiple subsystems.
The Bigquery output plugin on the fly streams the event to
the BigQuery warehouse. No need to write integration.
Data is available immediately for querying.
Whenever needed other output plugins can be wired in:
Kafka, Google Cloud Storage output plugin.
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
record_transformer copy file BigQuery
<filter frontend.user.*>
@type record_transformer
enable_ruby
remove_keys host
<record>
bq {"insert_id":"${uid}","host":"${host}",
"created":"${time.to_i}"}
avg ${record["total"] / record["count"]}
</record>
</filter>
syntax: Ruby, easy to use.
Great for:
- date transformation,
- quick normalizations,
- calculating something on the fly,
and store in clear log/analytics db
- renaming without code deploy.
1 2 3 4
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
record_transformer copy file BigQuery
<match frontend.user.*>
@type copy
<store>
@type forest
subtype file
<template>
path /tank/storage/${tag}.*.log
time_slice_format %Y%m%d
</template>
</store>
</match>
1 2 3 4
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
record_transformer copy file BigQuery
<match frontend.user.*>
@type bigquery
method insert
auth_method json_key
json_key /etc/td-agent/keys/key-31da042be48c.json
time_field timestamp
time_slice_format %Y%m%d
table user$%{time_slice}
ignore_unknown_values
schema_path /etc/td-agent/schema/user_login.json
</match>
1 2 3 4
Connector uses:
- JSON key auth file
- JSON table schema
Pro features:
- streaming to Partitioned tables
- ignore unknown values
(not reflected in schema)
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
● On data that it is difficult to process/analyze using traditional databases
● Not a replacement to traditional DBs, but it compliments the system
● Major strength is handling Large datasets
● Applying Javascript UDF on columnar storage to resolve complex tasks
(eg: JS for natural language processing)
● On streams (forms, IoT, Kafka)
● On exploring unstructured data
Where to use BigQuery?
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
➢ Optimize product pages
➢ Email engagement
➢ Funnel Analysis
Achievements - goal reached by measuring everything
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
● Funnel Analysis
Achievements
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
Funnel analysis: Time on upsell pages
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
Example HITS chain:
● article1 -> page2 -> page3 -> page4 -> orderpage1 -> thankyoupage1
● page1 -> article2-> page3 -> orderpage2 -> ...
Attribute credit to first article visited on purchase
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
● Funnel Analysis
● Email URL click heatmap
● Email Health Dashboard (SPAM, ISP deferral, content
A/B split tests, trends or low open rate campaigns)
● Advanced segmentation (all raw data stored)
● Behavioral analytics - engaged users etc...
Achievements Continued
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
● SQL language to run BigData queries
● run raw ad-hoc queries (either by analysts/sales or Devs)
● no more throwing away-, expiring-, aggregating old data
● no provisioning/deploy
● no running out of resources
● no more focus on large scale execution plan
● no need to re-implement tricky concepts
(time windows / join streams)
Our benefits
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
● No manual sharding
● No capacity guessing
● No idle resources
● No maintenance windows
● No manual scaling
● No file mgmt
BigQuery: Serverless Data Warehouse
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
● No servers to provision or manage
● Abstract away the complexity
● Scales with usage (ready every time for viral spikes or #BlackFriday)
● Availability and fault tolerance built in
● No orchestration in code
● Never pay for idle
● Cost savings (ps: we don’t have the same budget for security like GCP or AWS)
● Decoupled: APIs as contracts
● Monitored: Metrics and logging are a universal right
● Think concurrent, stateless, queue, stream based.
Serverlessmeans
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
Easily Build Custom Reports and Dashboards
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
Thank you.
Slides available on:
slideshare.net/martonkodok
Reea.net - Integrated web solutions driven by
creativity to deliver projects.
Ad

Recommended

PDF
GDG DevFest Ukraine - Powering Interactive Data Analysis with Google BigQuery
Márton Kodok
 
PDF
Voxxed Days Cluj - Powering interactive data analysis with Google BigQuery
Márton Kodok
 
PDF
An indepth look at Google BigQuery Architecture by Felipe Hoffa of Google
Data Con LA
 
PDF
Complex realtime event analytics using BigQuery @Crunch Warmup
Márton Kodok
 
PDF
Google Big Query UDFs
David Gloyn-Cox
 
PDF
Supercharge your data analytics with BigQuery
Márton Kodok
 
PDF
BigQuery ML - Machine learning at scale using SQL
Márton Kodok
 
PPTX
Brandon obrien streaming_data
Nitin Kumar
 
PDF
Google Cloud Dataflow
Alex Van Boxel
 
PDF
GDG DevFest Romania - Architecting for the Google Cloud Platform
Márton Kodok
 
PDF
Google Cloud Platform as a Backend Solution for your Product
Sergey Smetanin
 
PDF
GDG Heraklion - Architecting for the Google Cloud Platform
Márton Kodok
 
PDF
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Shirshanka Das
 
PDF
Google Cloud Technologies Overview
Chris Schalk
 
PDF
DevFest Romania 2020 Keynote: Bringing the Cloud to you.
Márton Kodok
 
PPTX
Introduction to Google Cloud Platform for Big Data - Trusted Conf
In Marketing We Trust
 
PPTX
30 days of google cloud event
PreetyKhatkar
 
PDF
Firebase Realtime Database and Remote Config in Practice - DroidCon Moscow 2016
Sergey Smetanin
 
PDF
Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012
Big Data Spain
 
PDF
Spark and MongoDB
Norberto Leite
 
ODP
Big Data Analytics with Google BigQuery. By Javier Ramirez. All your base Co...
javier ramirez
 
PPTX
Big Data Best Practices on GCP
AllCloud
 
PDF
#SlimScalding - Less Memory is More Capacity
Gera Shegalov
 
PDF
Data Ingestion in Big Data and IoT platforms
Guido Schmutz
 
PDF
PGConf APAC 2018 - Lightening Talk #3: How To Contribute to PostgreSQL
PGConf APAC
 
PDF
Big query the first step - (MOSG)
Soshi Nemoto
 
PPTX
MongoDB and Spark
Norberto Leite
 
PPTX
MongoDB and Hadoop: Driving Business Insights
MongoDB
 
PDF
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...
Márton Kodok
 
PDF
DevTalks Keynote Powering interactive data analysis with Google BigQuery
Márton Kodok
 

More Related Content

What's hot (20)

PDF
Google Cloud Dataflow
Alex Van Boxel
 
PDF
GDG DevFest Romania - Architecting for the Google Cloud Platform
Márton Kodok
 
PDF
Google Cloud Platform as a Backend Solution for your Product
Sergey Smetanin
 
PDF
GDG Heraklion - Architecting for the Google Cloud Platform
Márton Kodok
 
PDF
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Shirshanka Das
 
PDF
Google Cloud Technologies Overview
Chris Schalk
 
PDF
DevFest Romania 2020 Keynote: Bringing the Cloud to you.
Márton Kodok
 
PPTX
Introduction to Google Cloud Platform for Big Data - Trusted Conf
In Marketing We Trust
 
PPTX
30 days of google cloud event
PreetyKhatkar
 
PDF
Firebase Realtime Database and Remote Config in Practice - DroidCon Moscow 2016
Sergey Smetanin
 
PDF
Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012
Big Data Spain
 
PDF
Spark and MongoDB
Norberto Leite
 
ODP
Big Data Analytics with Google BigQuery. By Javier Ramirez. All your base Co...
javier ramirez
 
PPTX
Big Data Best Practices on GCP
AllCloud
 
PDF
#SlimScalding - Less Memory is More Capacity
Gera Shegalov
 
PDF
Data Ingestion in Big Data and IoT platforms
Guido Schmutz
 
PDF
PGConf APAC 2018 - Lightening Talk #3: How To Contribute to PostgreSQL
PGConf APAC
 
PDF
Big query the first step - (MOSG)
Soshi Nemoto
 
PPTX
MongoDB and Spark
Norberto Leite
 
PPTX
MongoDB and Hadoop: Driving Business Insights
MongoDB
 
Google Cloud Dataflow
Alex Van Boxel
 
GDG DevFest Romania - Architecting for the Google Cloud Platform
Márton Kodok
 
Google Cloud Platform as a Backend Solution for your Product
Sergey Smetanin
 
GDG Heraklion - Architecting for the Google Cloud Platform
Márton Kodok
 
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Shirshanka Das
 
Google Cloud Technologies Overview
Chris Schalk
 
DevFest Romania 2020 Keynote: Bringing the Cloud to you.
Márton Kodok
 
Introduction to Google Cloud Platform for Big Data - Trusted Conf
In Marketing We Trust
 
30 days of google cloud event
PreetyKhatkar
 
Firebase Realtime Database and Remote Config in Practice - DroidCon Moscow 2016
Sergey Smetanin
 
Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012
Big Data Spain
 
Spark and MongoDB
Norberto Leite
 
Big Data Analytics with Google BigQuery. By Javier Ramirez. All your base Co...
javier ramirez
 
Big Data Best Practices on GCP
AllCloud
 
#SlimScalding - Less Memory is More Capacity
Gera Shegalov
 
Data Ingestion in Big Data and IoT platforms
Guido Schmutz
 
PGConf APAC 2018 - Lightening Talk #3: How To Contribute to PostgreSQL
PGConf APAC
 
Big query the first step - (MOSG)
Soshi Nemoto
 
MongoDB and Spark
Norberto Leite
 
MongoDB and Hadoop: Driving Business Insights
MongoDB
 

Similar to CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery (20)

PDF
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...
Márton Kodok
 
PDF
DevTalks Keynote Powering interactive data analysis with Google BigQuery
Márton Kodok
 
PDF
Google BigQuery for Everyday Developer
Márton Kodok
 
PDF
Making advanced analytics accessible to more companies
Márton Kodok
 
PDF
Bridge to Cloud: Using Apache Kafka to Migrate to GCP
confluent
 
PDF
Exploring BigData with Google BigQuery
Dharmesh Vaya
 
PPTX
Eric Andersen Keynote
Data Con LA
 
PPTX
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020
Mariano Gonzalez
 
PDF
Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...
Edwin Poot
 
PDF
Google BigQuery - Features & Benefits
Andreas Raible
 
PDF
Lunch & Learn BigQuery & Firebase from other Google Cloud customers
Daniel Zivkovic
 
PDF
BigQuery ML - Machine learning at scale using SQL
Márton Kodok
 
PDF
Modern Thinking área digital MSKM 21/09/2017
MSMK - Madrid School of Marketing
 
PDF
GDSC Google Cloud Study jam Web Bootcamp - Day-4 Session 4
SahithiGurlinka
 
PDF
GCSJ Session 4.pdf
SahithiGurlinka
 
PDF
Serverless Big Data Architecture on Google Cloud Platform at Credit OK
Kriangkrai Chaonithi
 
PDF
Data Platform on GCP
Patrick Alexander
 
PPTX
BigQuery_Architecture_Componaaaents.pptx
abhinandan chivate
 
PDF
An overview of BigQuery
GirdhareeSaran
 
PDF
Getting more into GCP.pdf
Knoldus Inc.
 
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...
Márton Kodok
 
DevTalks Keynote Powering interactive data analysis with Google BigQuery
Márton Kodok
 
Google BigQuery for Everyday Developer
Márton Kodok
 
Making advanced analytics accessible to more companies
Márton Kodok
 
Bridge to Cloud: Using Apache Kafka to Migrate to GCP
confluent
 
Exploring BigData with Google BigQuery
Dharmesh Vaya
 
Eric Andersen Keynote
Data Con LA
 
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020
Mariano Gonzalez
 
Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...
Edwin Poot
 
Google BigQuery - Features & Benefits
Andreas Raible
 
Lunch & Learn BigQuery & Firebase from other Google Cloud customers
Daniel Zivkovic
 
BigQuery ML - Machine learning at scale using SQL
Márton Kodok
 
Modern Thinking área digital MSKM 21/09/2017
MSMK - Madrid School of Marketing
 
GDSC Google Cloud Study jam Web Bootcamp - Day-4 Session 4
SahithiGurlinka
 
GCSJ Session 4.pdf
SahithiGurlinka
 
Serverless Big Data Architecture on Google Cloud Platform at Credit OK
Kriangkrai Chaonithi
 
Data Platform on GCP
Patrick Alexander
 
BigQuery_Architecture_Componaaaents.pptx
abhinandan chivate
 
An overview of BigQuery
GirdhareeSaran
 
Getting more into GCP.pdf
Knoldus Inc.
 
Ad

More from Márton Kodok (20)

PDF
AI Agents with Gemini 2.0 - Beyond the Chatbot
Márton Kodok
 
PDF
Gemini 2.0 and Vertex AI for Innovation Workshop
Márton Kodok
 
PDF
Function Calling with the Vertex AI Gemini API
Márton Kodok
 
PDF
Vector search and multimodal embeddings in BigQuery
Márton Kodok
 
PDF
BigQuery Remote Functions for Dynamic Mapping of E-mobility Charging Networks
Márton Kodok
 
PDF
Build applications with generative AI on Google Cloud
Márton Kodok
 
PDF
Gen Apps on Google Cloud PaLM2 and Codey APIs in Action
Márton Kodok
 
PDF
DevBCN Vertex AI - Pipelines for your MLOps workflows
Márton Kodok
 
PDF
Discover BigQuery ML, build your own CREATE MODEL statement
Márton Kodok
 
PDF
Cloud Run - the rise of serverless and containerization
Márton Kodok
 
PDF
BigQuery best practices and recommendations to reduce costs with BI Engine, S...
Márton Kodok
 
PDF
Vertex AI - Unified ML Platform for the entire AI workflow on Google Cloud
Márton Kodok
 
PDF
Vertex AI: Pipelines for your MLOps workflows
Márton Kodok
 
PDF
Cloud Workflows What's new in serverless orchestration and automation
Márton Kodok
 
PDF
Serverless orchestration and automation with Cloud Workflows
Márton Kodok
 
PDF
Serverless orchestration and automation with Cloud Workflows
Márton Kodok
 
PDF
Serverless orchestration and automation with Cloud Workflows
Márton Kodok
 
PDF
BigdataConference Europe - BigQuery ML
Márton Kodok
 
PDF
Applying BigQuery ML on e-commerce data analytics
Márton Kodok
 
PDF
Vibe Koli 2019 - Utazás az egyetem padjaitól a Google Developer Expertig
Márton Kodok
 
AI Agents with Gemini 2.0 - Beyond the Chatbot
Márton Kodok
 
Gemini 2.0 and Vertex AI for Innovation Workshop
Márton Kodok
 
Function Calling with the Vertex AI Gemini API
Márton Kodok
 
Vector search and multimodal embeddings in BigQuery
Márton Kodok
 
BigQuery Remote Functions for Dynamic Mapping of E-mobility Charging Networks
Márton Kodok
 
Build applications with generative AI on Google Cloud
Márton Kodok
 
Gen Apps on Google Cloud PaLM2 and Codey APIs in Action
Márton Kodok
 
DevBCN Vertex AI - Pipelines for your MLOps workflows
Márton Kodok
 
Discover BigQuery ML, build your own CREATE MODEL statement
Márton Kodok
 
Cloud Run - the rise of serverless and containerization
Márton Kodok
 
BigQuery best practices and recommendations to reduce costs with BI Engine, S...
Márton Kodok
 
Vertex AI - Unified ML Platform for the entire AI workflow on Google Cloud
Márton Kodok
 
Vertex AI: Pipelines for your MLOps workflows
Márton Kodok
 
Cloud Workflows What's new in serverless orchestration and automation
Márton Kodok
 
Serverless orchestration and automation with Cloud Workflows
Márton Kodok
 
Serverless orchestration and automation with Cloud Workflows
Márton Kodok
 
Serverless orchestration and automation with Cloud Workflows
Márton Kodok
 
BigdataConference Europe - BigQuery ML
Márton Kodok
 
Applying BigQuery ML on e-commerce data analytics
Márton Kodok
 
Vibe Koli 2019 - Utazás az egyetem padjaitól a Google Developer Expertig
Márton Kodok
 
Ad

Recently uploaded (20)

PPTX
arctitecture application system design os dsa
za241967
 
PPTX
Threat Modeling a Batch Job Framework - Teri Radichel - AWS re:Inforce 2025
2nd Sight Lab
 
PDF
Canva Pro Crack Free Download 2025-FREE LATEST
grete1122g
 
PDF
Humans vs AI Call Agents - Qcall.ai's Special Report
Udit Goenka
 
PDF
Building Geospatial Data Warehouse for GIS by GIS with FME
Safe Software
 
PPTX
Test Case Design Techniques – Practical Examples & Best Practices in Software...
Muhammad Fahad Bashir
 
PDF
CodeCleaner: Mitigating Data Contamination for LLM Benchmarking
arabelatso
 
PDF
Heat Treatment Process Automation in India
Reckers Mechatronics
 
PDF
Digital Transformation: Automating the Placement of Medical Interns
Safe Software
 
PDF
Azure AI Foundry: The AI app and agent factory
Maxim Salnikov
 
PPTX
Top Time Tracking Solutions for Accountants
oliviareed320
 
PPTX
HYBRIDIZATION OF ALKANES AND ALKENES ...
karishmaduhijod1
 
PDF
Modern Platform Engineering with Choreo - The AI-Native Internal Developer Pl...
WSO2
 
PDF
Y - Recursion The Hard Way GopherCon EU 2025
Eleanor McHugh
 
PDF
Best Practice for LLM Serving in the Cloud
Alluxio, Inc.
 
PDF
How Automation in Claims Handling Streamlined Operations
Insurance Tech Services
 
PPTX
Key Challenges in Troubleshooting Customer On-Premise Applications
Tier1 app
 
PDF
CodeCleaner: Mitigating Data Contamination for LLM Benchmarking
arabelatso
 
PPTX
From Code to Commerce, a Backend Java Developer's Galactic Journey into Ecomm...
Jamie Coleman
 
PDF
Automated Testing and Safety Analysis of Deep Neural Networks
Lionel Briand
 
arctitecture application system design os dsa
za241967
 
Threat Modeling a Batch Job Framework - Teri Radichel - AWS re:Inforce 2025
2nd Sight Lab
 
Canva Pro Crack Free Download 2025-FREE LATEST
grete1122g
 
Humans vs AI Call Agents - Qcall.ai's Special Report
Udit Goenka
 
Building Geospatial Data Warehouse for GIS by GIS with FME
Safe Software
 
Test Case Design Techniques – Practical Examples & Best Practices in Software...
Muhammad Fahad Bashir
 
CodeCleaner: Mitigating Data Contamination for LLM Benchmarking
arabelatso
 
Heat Treatment Process Automation in India
Reckers Mechatronics
 
Digital Transformation: Automating the Placement of Medical Interns
Safe Software
 
Azure AI Foundry: The AI app and agent factory
Maxim Salnikov
 
Top Time Tracking Solutions for Accountants
oliviareed320
 
HYBRIDIZATION OF ALKANES AND ALKENES ...
karishmaduhijod1
 
Modern Platform Engineering with Choreo - The AI-Native Internal Developer Pl...
WSO2
 
Y - Recursion The Hard Way GopherCon EU 2025
Eleanor McHugh
 
Best Practice for LLM Serving in the Cloud
Alluxio, Inc.
 
How Automation in Claims Handling Streamlined Operations
Insurance Tech Services
 
Key Challenges in Troubleshooting Customer On-Premise Applications
Tier1 app
 
CodeCleaner: Mitigating Data Contamination for LLM Benchmarking
arabelatso
 
From Code to Commerce, a Backend Java Developer's Galactic Journey into Ecomm...
Jamie Coleman
 
Automated Testing and Safety Analysis of Deep Neural Networks
Lionel Briand
 

CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery

  • 1. Creating #serverless data analytics system on GCP using BigQuery Márton Kodok / @martonkodok Google Developer Expert at REEA.net March 2018 - Tirgu Mures, Romania
  • 2. ● Geek. Hiker. Do-er. ● Among the Top3 romanians on Stackoverflow 120k reputation ● Google Developer Expert on Cloud technologies ● Crafting Web/Mobile backends at REEA.net ● BigQuery/Redis and database engine expert ● Active in mentoring and IT community Twitter: @martonkodok StackOverflow: pentium10 Slideshare: martonkodok GitHub: pentium10 Creating #serverless data analytics system on GCP using BigQuery @martonkodok About me
  • 3. REEA.net uses GCP Build on the same infrastructure that powers Google
  • 4. Google Cloud Platform (GCP) Compute Big Data BigQuery Cloud Dataflow Cloud Dataproc Cloud Datalab Cloud Pub/Sub Genomics Storage & Databases Cloud Storage Cloud Bigtable Cloud Datastore Cloud SQL Cloud Spanner Persistent Disk Machine Learning Cloud Machine Learning Cloud Vision API Cloud Speech API Cloud Natural Language API Cloud Translation API Cloud Jobs API Data Studio Cloud Dataprep Cloud Video Intelligence API Advanced Solutions Lab Compute Engine App Engine Kubernetes Engine GPU Cloud Functions Container- Optimized OS Identity & Security Cloud IAM Cloud Resource Manager Cloud Security Scanner Key Management Service BeyondCorp Data Loss Prevention API Identity-Aware Proxy Security Key Enforcement Internet of Things Cloud IoT Core Transfer Appliance
  • 5. Google Cloud Platform (GCP) Developer Tools Cloud SDK Cloud Deployment Manager Cloud Source Repositories Cloud Tools for Android Studio Cloud Tools for IntelliJ Cloud Tools for PowerShell Cloud Tools for Visual Studio Container Registry Google Plug-in for Eclipse Cloud Test Lab Networking Virtual Private Cloud Cloud Load Balancing Cloud CDN Cloud Interconnect Cloud DNS Cloud Network Cloud External IP Addresses Cloud Firewall Rules Cloud Routes Cloud VPN Management Tools Stackdriver Monitoring Logging Error Reporting Trace Debugger Cloud Deployment Manager Cloud Endpoints Cloud Console Cloud Shell Cloud Mobile App Cloud Billing API Cloud APIs Cloud Router Dedicated Interconnect Container Builder
  • 6. Meet Serverless Creating #serverless data analytics system on GCP using BigQuery @martonkodok
  • 7. Meet Serverless serverless data center depicted Creating #serverless data analytics system on GCP using BigQuery @martonkodok
  • 8. Event-driven serverless compute platform Cloud Services Changes in data state Business logic events Integrations Event Router Gateway HTTPS Event Source Multiple Platforms Data Warehouse Pub/Sub Cloud Functions Streaming Business Value Application Task Analysis
  • 9. Serverless is about maximizing elasticity, cost savings, and agility of cloud computing. @martonkodok Creating #serverless data analytics system on GCP using BigQuery @martonkodok
  • 10. Crafting a solution for building high-performance, petabyte scale data analytics, serverless reporting system on Google Cloud Platform Goal today Creating #serverless data analytics system on GCP using BigQuery @martonkodok
  • 11. Legacy Reporting System App Cloud Load Balancing NGINX Compute Engine 10GB PD 2 1 Database Service (Master/Slave) Compute Engine 10GB PD 4 1 Compute Engine 10GB PD 4 1 Compute Engine 10GB PD 4 1 Report & Share Business Analysis Scheduled Tasks Batch Processing Compute Engine Multiple Instances Creating #serverless data analytics system on GCP using BigQuery @martonkodok
  • 12. Serverless Reporting System App Cloud Load Balancing NGINX Compute Engine 10GB PD 2 1 Database Service (Master/Slave) Compute Engine 10GB PD 4 1 Compute Engine 10GB PD 4 1 Compute Engine 10GB PD 4 1 Report & Share Business Analysis Scheduled Tasks Batch Processing Compute Engine Multiple Instances BigQuery Data Studio Report & Share Business Analysis Creating #serverless data analytics system on GCP using BigQuery @martonkodok
  • 13. Creating #serverless data analytics system on GCP using BigQuery @martonkodok
  • 14. Analytics-as-a-Service - Data Warehouse in the Cloud Scales into Petabytes on Managed Google Infrastructure (US or EU zone) SQL 2011 + Javascript UDF (User Defined Functions) Familiar DB Structure (table, views, struct, nested, JSON) Integrates with Google Sheets + Cloud Storage + Pub/Sub connectors Decent pricing (queries $5/TB, storage: $20/TB cold: $10/TB) *Mar 2018 Open Interfaces (Web UI, BQ command line tool, REST, ODBC) What is BigQuery? Creating #serverless data analytics system on GCP using BigQuery @martonkodok
  • 15. Columnar storage (max 10 000 columns in table) Large files for loading: 5TB (CSV or JSON) UDF in Javascript or SQL Rich SQL 2011: JSON,IP,Math,RegExp,Geocode,Window functions Modern data types: Record, Nested, Struct, Array. Append-only tables prefered (DML syntax available) Day column partitioned tables (select * from t where day=’2018-01-01’) BigQuery: Convenience of SQL Creating #serverless data analytics system on GCP using BigQuery @martonkodok
  • 16. Architecting for The Cloud BigQuery On-Premises Servers Pipelines ETL Engine Event Sourcing Frontend Platform Services Metrics / Logs/ Streaming Creating #serverless data analytics system on GCP using BigQuery @martonkodok
  • 17. “ Our project generates many/big files. How can I seamlessly ingest them? Creating #serverless data analytics system on GCP using BigQuery @martonkodok
  • 18. Serverless file ingest BigQuery On-Premises Servers ApplicationEvent Sourcing Frontend Platform Services Metrics / Logs/ Streaming Cloud Storage Cloud Functions Triggered Code Creating #serverless data analytics system on GCP using BigQuery @martonkodok
  • 19. “ Data needs to be processed in multiple services. How can we pipe to multiple places? Creating #serverless data analytics system on GCP using BigQuery @martonkodok
  • 20. Architecting for The Cloud On-Premises Servers Event Sourcing Frontend Platform Services Analyze Metrics / Logs/ Streaming Cloud Storage Creating #serverless data analytics system on GCP using BigQuery @martonkodok Cloud Dataflow Process BigQuery Cloud SQL Stream Batch Data Studio Third-Party Tools
  • 21. “ We have our app outside of GCP. How can we use the benefits of BigQuery? Creating #serverless data analytics system on GCP using BigQuery @martonkodok
  • 22. Data Pipeline Integration at REEA.net Analytics Backend BigQuery On-Premises Servers Pipelines FluentD Event Sourcing Frontend Platform Services Metrics / Logs/ Streaming Development Team Data Analysts Report & Share Business Analysis Tools Tableau QlikView Data Studio Internal Dashboard Database SQL Application ServersServers Cloud Storage archive Load Export Replay Standard Devices HTTPS Creating #serverless data analytics system on GCP using BigQuery @martonkodok
  • 23. The following slides will present a sample Fluentd configuration to: 1. Transform a record 2. Copy event to multiple outputs 3. Store event data in File (for backup/log purposes) 4. Stream to BigQuery (for immediate analyses) Creating #serverless data analytics system on GCP using BigQuery @martonkodok
  • 24. <filter frontend.user.*> @type record_transformer </filter> <match frontend.user.*> @type copy <store> @type forest subtype file </store> <store> @type bigquery </store> … </match> Filter plugin mutates incoming data. Add/modify/delete event data transform attributes without a code deploy.1 2 3 4 The copy output plugin copies events to multiple outputs. File(s), multiple databases, DB engines. Great to ship same event to multiple subsystems. The Bigquery output plugin on the fly streams the event to the BigQuery warehouse. No need to write integration. Data is available immediately for querying. Whenever needed other output plugins can be wired in: Kafka, Google Cloud Storage output plugin. Creating #serverless data analytics system on GCP using BigQuery @martonkodok
  • 25. record_transformer copy file BigQuery <filter frontend.user.*> @type record_transformer enable_ruby remove_keys host <record> bq {"insert_id":"${uid}","host":"${host}", "created":"${time.to_i}"} avg ${record["total"] / record["count"]} </record> </filter> syntax: Ruby, easy to use. Great for: - date transformation, - quick normalizations, - calculating something on the fly, and store in clear log/analytics db - renaming without code deploy. 1 2 3 4 Creating #serverless data analytics system on GCP using BigQuery @martonkodok
  • 26. record_transformer copy file BigQuery <match frontend.user.*> @type copy <store> @type forest subtype file <template> path /tank/storage/${tag}.*.log time_slice_format %Y%m%d </template> </store> </match> 1 2 3 4 Creating #serverless data analytics system on GCP using BigQuery @martonkodok
  • 27. record_transformer copy file BigQuery <match frontend.user.*> @type bigquery method insert auth_method json_key json_key /etc/td-agent/keys/key-31da042be48c.json time_field timestamp time_slice_format %Y%m%d table user$%{time_slice} ignore_unknown_values schema_path /etc/td-agent/schema/user_login.json </match> 1 2 3 4 Connector uses: - JSON key auth file - JSON table schema Pro features: - streaming to Partitioned tables - ignore unknown values (not reflected in schema) Creating #serverless data analytics system on GCP using BigQuery @martonkodok
  • 28. ● On data that it is difficult to process/analyze using traditional databases ● Not a replacement to traditional DBs, but it compliments the system ● Major strength is handling Large datasets ● Applying Javascript UDF on columnar storage to resolve complex tasks (eg: JS for natural language processing) ● On streams (forms, IoT, Kafka) ● On exploring unstructured data Where to use BigQuery? Creating #serverless data analytics system on GCP using BigQuery @martonkodok
  • 29. ➢ Optimize product pages ➢ Email engagement ➢ Funnel Analysis Achievements - goal reached by measuring everything Creating #serverless data analytics system on GCP using BigQuery @martonkodok
  • 30. ● Funnel Analysis Achievements Creating #serverless data analytics system on GCP using BigQuery @martonkodok
  • 31. Funnel analysis: Time on upsell pages Creating #serverless data analytics system on GCP using BigQuery @martonkodok
  • 32. Example HITS chain: ● article1 -> page2 -> page3 -> page4 -> orderpage1 -> thankyoupage1 ● page1 -> article2-> page3 -> orderpage2 -> ... Attribute credit to first article visited on purchase Creating #serverless data analytics system on GCP using BigQuery @martonkodok
  • 33. ● Funnel Analysis ● Email URL click heatmap ● Email Health Dashboard (SPAM, ISP deferral, content A/B split tests, trends or low open rate campaigns) ● Advanced segmentation (all raw data stored) ● Behavioral analytics - engaged users etc... Achievements Continued Creating #serverless data analytics system on GCP using BigQuery @martonkodok
  • 34. ● SQL language to run BigData queries ● run raw ad-hoc queries (either by analysts/sales or Devs) ● no more throwing away-, expiring-, aggregating old data ● no provisioning/deploy ● no running out of resources ● no more focus on large scale execution plan ● no need to re-implement tricky concepts (time windows / join streams) Our benefits Creating #serverless data analytics system on GCP using BigQuery @martonkodok
  • 35. ● No manual sharding ● No capacity guessing ● No idle resources ● No maintenance windows ● No manual scaling ● No file mgmt BigQuery: Serverless Data Warehouse Creating #serverless data analytics system on GCP using BigQuery @martonkodok
  • 36. ● No servers to provision or manage ● Abstract away the complexity ● Scales with usage (ready every time for viral spikes or #BlackFriday) ● Availability and fault tolerance built in ● No orchestration in code ● Never pay for idle ● Cost savings (ps: we don’t have the same budget for security like GCP or AWS) ● Decoupled: APIs as contracts ● Monitored: Metrics and logging are a universal right ● Think concurrent, stateless, queue, stream based. Serverlessmeans Creating #serverless data analytics system on GCP using BigQuery @martonkodok
  • 37. Easily Build Custom Reports and Dashboards Creating #serverless data analytics system on GCP using BigQuery @martonkodok
  • 38. Thank you. Slides available on: slideshare.net/martonkodok Reea.net - Integrated web solutions driven by creativity to deliver projects.