Creating #serverless data analytics
system on GCP using BigQuery
Márton Kodok / @martonkodok
Google Developer Expert at REEA.net
March 2018 - Tirgu Mures, Romania
● Geek. Hiker. Do-er.
● Among the Top3 romanians on Stackoverflow 120k reputation
● Google Developer Expert on Cloud technologies
● Crafting Web/Mobile backends at REEA.net
● BigQuery/Redis and database engine expert
● Active in mentoring and IT community
Twitter: @martonkodok
StackOverflow: pentium10
Slideshare: martonkodok
GitHub: pentium10
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
About me
REEA.net uses GCP
Build on the same infrastructure
that powers Google
Google Cloud Platform (GCP)
Compute Big Data
BigQuery
Cloud
Dataflow
Cloud
Dataproc
Cloud
Datalab
Cloud
Pub/Sub
Genomics
Storage & Databases
Cloud
Storage
Cloud
Bigtable
Cloud
Datastore
Cloud SQL
Cloud
Spanner
Persistent
Disk
Machine Learning
Cloud Machine
Learning
Cloud
Vision API
Cloud
Speech API
Cloud Natural
Language API
Cloud
Translation
API
Cloud
Jobs API
Data
Studio
Cloud
Dataprep
Cloud Video
Intelligence
API
Advanced
Solutions Lab
Compute
Engine
App
Engine
Kubernetes
Engine
GPU
Cloud
Functions
Container-
Optimized OS
Identity & Security
Cloud IAM
Cloud Resource
Manager
Cloud Security
Scanner
Key
Management
Service
BeyondCorp
Data Loss
Prevention API
Identity-Aware
Proxy
Security Key
Enforcement
Internet of Things
Cloud IoT
Core
Transfer
Appliance
Google Cloud Platform (GCP)
Developer Tools
Cloud SDK
Cloud
Deployment
Manager
Cloud Source
Repositories
Cloud
Tools for
Android Studio
Cloud Tools
for IntelliJ
Cloud
Tools for
PowerShell
Cloud
Tools for
Visual Studio
Container
Registry
Google Plug-in
for Eclipse
Cloud Test
Lab
Networking
Virtual
Private Cloud
Cloud Load
Balancing
Cloud
CDN
Cloud
Interconnect
Cloud DNS
Cloud
Network
Cloud
External IP
Addresses
Cloud
Firewall Rules
Cloud
Routes
Cloud VPN
Management Tools
Stackdriver Monitoring Logging
Error
Reporting
Trace
Debugger
Cloud
Deployment
Manager
Cloud
Endpoints
Cloud
Console
Cloud
Shell
Cloud Mobile
App
Cloud
Billing API
Cloud
APIs
Cloud
Router
Dedicated
Interconnect
Container
Builder
Meet Serverless
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
Meet Serverless
serverless data center depicted
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
Event-driven serverless compute platform
Cloud
Services
Changes in data state
Business logic events
Integrations
Event Router
Gateway
HTTPS
Event Source
Multiple Platforms
Data Warehouse
Pub/Sub
Cloud Functions
Streaming
Business Value
Application
Task
Analysis
Serverless is about maximizing elasticity, cost
savings, and agility of cloud computing.
@martonkodok
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
Crafting a solution for building high-performance,
petabyte scale data analytics, serverless
reporting system on Google Cloud Platform
Goal today
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
Legacy Reporting System
App
Cloud Load
Balancing
NGINX
Compute Engine
10GB PD
2 1
Database Service (Master/Slave)
Compute Engine
10GB PD
4 1
Compute Engine
10GB PD
4 1
Compute Engine
10GB PD
4 1
Report & Share
Business Analysis
Scheduled
Tasks
Batch Processing
Compute Engine
Multiple Instances
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
Serverless Reporting System
App
Cloud Load
Balancing
NGINX
Compute Engine
10GB PD
2 1
Database Service (Master/Slave)
Compute Engine
10GB PD
4 1
Compute Engine
10GB PD
4 1
Compute Engine
10GB PD
4 1
Report & Share
Business Analysis
Scheduled
Tasks
Batch Processing
Compute Engine
Multiple Instances
BigQuery Data Studio
Report & Share
Business Analysis
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
Analytics-as-a-Service - Data Warehouse in the Cloud
Scales into Petabytes on Managed Google Infrastructure (US or EU zone)
SQL 2011 + Javascript UDF (User Defined Functions)
Familiar DB Structure (table, views, struct, nested, JSON)
Integrates with Google Sheets + Cloud Storage + Pub/Sub connectors
Decent pricing (queries $5/TB, storage: $20/TB cold: $10/TB) *Mar 2018
Open Interfaces (Web UI, BQ command line tool, REST, ODBC)
What is BigQuery?
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
Columnar storage (max 10 000 columns in table)
Large files for loading: 5TB (CSV or JSON)
UDF in Javascript or SQL
Rich SQL 2011: JSON,IP,Math,RegExp,Geocode,Window functions
Modern data types: Record, Nested, Struct, Array.
Append-only tables prefered (DML syntax available)
Day column partitioned tables (select * from t where day=’2018-01-01’)
BigQuery: Convenience of SQL
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
Architecting for The Cloud
BigQuery
On-Premises Servers
Pipelines
ETL
Engine
Event Sourcing
Frontend
Platform Services
Metrics / Logs/
Streaming
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
“ Our project generates many/big files.
How can I seamlessly ingest them?
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
Serverless file ingest
BigQuery
On-Premises Servers
ApplicationEvent Sourcing
Frontend
Platform Services
Metrics / Logs/
Streaming
Cloud
Storage
Cloud
Functions
Triggered Code
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
“ Data needs to be processed in
multiple services.
How can we pipe to multiple places?
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
Architecting for The Cloud
On-Premises Servers
Event Sourcing
Frontend
Platform Services
Analyze
Metrics / Logs/
Streaming
Cloud Storage
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
Cloud
Dataflow
Process
BigQuery
Cloud SQL
Stream
Batch
Data
Studio
Third-Party
Tools
“ We have our app outside of GCP.
How can we use the benefits of BigQuery?
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
Data Pipeline Integration at REEA.net
Analytics Backend
BigQuery
On-Premises Servers
Pipelines
FluentD
Event Sourcing
Frontend
Platform Services
Metrics / Logs/
Streaming
Development
Team
Data Analysts
Report & Share
Business Analysis
Tools
Tableau
QlikView
Data Studio
Internal
Dashboard
Database
SQL
Application
ServersServers
Cloud Storage
archive
Load
Export
Replay
Standard
Devices
HTTPS
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
The following slides will present a sample Fluentd configuration to:
1. Transform a record
2. Copy event to multiple outputs
3. Store event data in File (for backup/log purposes)
4. Stream to BigQuery (for immediate analyses)
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
<filter frontend.user.*>
@type record_transformer
</filter>
<match frontend.user.*>
@type copy
<store>
@type forest
subtype file
</store>
<store>
@type bigquery
</store>
…
</match>
Filter plugin mutates incoming data. Add/modify/delete
event data transform attributes without a code deploy.1
2
3
4
The copy output plugin copies events to multiple outputs.
File(s), multiple databases, DB engines.
Great to ship same event to multiple subsystems.
The Bigquery output plugin on the fly streams the event to
the BigQuery warehouse. No need to write integration.
Data is available immediately for querying.
Whenever needed other output plugins can be wired in:
Kafka, Google Cloud Storage output plugin.
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
record_transformer copy file BigQuery
<filter frontend.user.*>
@type record_transformer
enable_ruby
remove_keys host
<record>
bq {"insert_id":"${uid}","host":"${host}",
"created":"${time.to_i}"}
avg ${record["total"] / record["count"]}
</record>
</filter>
syntax: Ruby, easy to use.
Great for:
- date transformation,
- quick normalizations,
- calculating something on the fly,
and store in clear log/analytics db
- renaming without code deploy.
1 2 3 4
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
record_transformer copy file BigQuery
<match frontend.user.*>
@type copy
<store>
@type forest
subtype file
<template>
path /tank/storage/${tag}.*.log
time_slice_format %Y%m%d
</template>
</store>
</match>
1 2 3 4
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
record_transformer copy file BigQuery
<match frontend.user.*>
@type bigquery
method insert
auth_method json_key
json_key /etc/td-agent/keys/key-31da042be48c.json
time_field timestamp
time_slice_format %Y%m%d
table user$%{time_slice}
ignore_unknown_values
schema_path /etc/td-agent/schema/user_login.json
</match>
1 2 3 4
Connector uses:
- JSON key auth file
- JSON table schema
Pro features:
- streaming to Partitioned tables
- ignore unknown values
(not reflected in schema)
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
● On data that it is difficult to process/analyze using traditional databases
● Not a replacement to traditional DBs, but it compliments the system
● Major strength is handling Large datasets
● Applying Javascript UDF on columnar storage to resolve complex tasks
(eg: JS for natural language processing)
● On streams (forms, IoT, Kafka)
● On exploring unstructured data
Where to use BigQuery?
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
➢ Optimize product pages
➢ Email engagement
➢ Funnel Analysis
Achievements - goal reached by measuring everything
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
● Funnel Analysis
Achievements
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
Funnel analysis: Time on upsell pages
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
Example HITS chain:
● article1 -> page2 -> page3 -> page4 -> orderpage1 -> thankyoupage1
● page1 -> article2-> page3 -> orderpage2 -> ...
Attribute credit to first article visited on purchase
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
● Funnel Analysis
● Email URL click heatmap
● Email Health Dashboard (SPAM, ISP deferral, content
A/B split tests, trends or low open rate campaigns)
● Advanced segmentation (all raw data stored)
● Behavioral analytics - engaged users etc...
Achievements Continued
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
● SQL language to run BigData queries
● run raw ad-hoc queries (either by analysts/sales or Devs)
● no more throwing away-, expiring-, aggregating old data
● no provisioning/deploy
● no running out of resources
● no more focus on large scale execution plan
● no need to re-implement tricky concepts
(time windows / join streams)
Our benefits
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
● No manual sharding
● No capacity guessing
● No idle resources
● No maintenance windows
● No manual scaling
● No file mgmt
BigQuery: Serverless Data Warehouse
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
● No servers to provision or manage
● Abstract away the complexity
● Scales with usage (ready every time for viral spikes or #BlackFriday)
● Availability and fault tolerance built in
● No orchestration in code
● Never pay for idle
● Cost savings (ps: we don’t have the same budget for security like GCP or AWS)
● Decoupled: APIs as contracts
● Monitored: Metrics and logging are a universal right
● Think concurrent, stateless, queue, stream based.
Serverlessmeans
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
Easily Build Custom Reports and Dashboards
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
Thank you.
Slides available on:
slideshare.net/martonkodok
Reea.net - Integrated web solutions driven by
creativity to deliver projects.

More Related Content

PDF
GDG DevFest Ukraine - Powering Interactive Data Analysis with Google BigQuery
PDF
Voxxed Days Cluj - Powering interactive data analysis with Google BigQuery
PDF
An indepth look at Google BigQuery Architecture by Felipe Hoffa of Google
PDF
Complex realtime event analytics using BigQuery @Crunch Warmup
PDF
Google Big Query UDFs
PDF
Supercharge your data analytics with BigQuery
PDF
BigQuery ML - Machine learning at scale using SQL
PPTX
Brandon obrien streaming_data
GDG DevFest Ukraine - Powering Interactive Data Analysis with Google BigQuery
Voxxed Days Cluj - Powering interactive data analysis with Google BigQuery
An indepth look at Google BigQuery Architecture by Felipe Hoffa of Google
Complex realtime event analytics using BigQuery @Crunch Warmup
Google Big Query UDFs
Supercharge your data analytics with BigQuery
BigQuery ML - Machine learning at scale using SQL
Brandon obrien streaming_data

What's hot (20)

PDF
Google Cloud Dataflow
PDF
GDG DevFest Romania - Architecting for the Google Cloud Platform
PDF
Google Cloud Platform as a Backend Solution for your Product
PDF
GDG Heraklion - Architecting for the Google Cloud Platform
PDF
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
PDF
Google Cloud Technologies Overview
PDF
DevFest Romania 2020 Keynote: Bringing the Cloud to you.
PPTX
Introduction to Google Cloud Platform for Big Data - Trusted Conf
PPTX
30 days of google cloud event
PDF
Firebase Realtime Database and Remote Config in Practice - DroidCon Moscow 2016
PDF
Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012
PDF
Spark and MongoDB
ODP
Big Data Analytics with Google BigQuery. By Javier Ramirez. All your base Co...
PPTX
Big Data Best Practices on GCP
PDF
#SlimScalding - Less Memory is More Capacity
PDF
Data Ingestion in Big Data and IoT platforms
PDF
PGConf APAC 2018 - Lightening Talk #3: How To Contribute to PostgreSQL
PDF
Big query the first step - (MOSG)
PPTX
MongoDB and Spark
PPTX
MongoDB and Hadoop: Driving Business Insights
Google Cloud Dataflow
GDG DevFest Romania - Architecting for the Google Cloud Platform
Google Cloud Platform as a Backend Solution for your Product
GDG Heraklion - Architecting for the Google Cloud Platform
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Google Cloud Technologies Overview
DevFest Romania 2020 Keynote: Bringing the Cloud to you.
Introduction to Google Cloud Platform for Big Data - Trusted Conf
30 days of google cloud event
Firebase Realtime Database and Remote Config in Practice - DroidCon Moscow 2016
Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012
Spark and MongoDB
Big Data Analytics with Google BigQuery. By Javier Ramirez. All your base Co...
Big Data Best Practices on GCP
#SlimScalding - Less Memory is More Capacity
Data Ingestion in Big Data and IoT platforms
PGConf APAC 2018 - Lightening Talk #3: How To Contribute to PostgreSQL
Big query the first step - (MOSG)
MongoDB and Spark
MongoDB and Hadoop: Driving Business Insights
Ad

Similar to CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery (20)

PDF
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...
PDF
DevTalks Keynote Powering interactive data analysis with Google BigQuery
PDF
Google BigQuery for Everyday Developer
PDF
Making advanced analytics accessible to more companies
PDF
Bridge to Cloud: Using Apache Kafka to Migrate to GCP
PDF
Exploring BigData with Google BigQuery
PPTX
Eric Andersen Keynote
PPTX
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020
PDF
Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...
PDF
Google BigQuery - Features & Benefits
PDF
Lunch & Learn BigQuery & Firebase from other Google Cloud customers
PDF
BigQuery ML - Machine learning at scale using SQL
PDF
Modern Thinking área digital MSKM 21/09/2017
PDF
GDSC Google Cloud Study jam Web Bootcamp - Day-4 Session 4
PDF
GCSJ Session 4.pdf
PDF
Serverless Big Data Architecture on Google Cloud Platform at Credit OK
PDF
Data Platform on GCP
PPTX
BigQuery_Architecture_Componaaaents.pptx
PDF
An overview of BigQuery
PDF
Getting more into GCP.pdf
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...
DevTalks Keynote Powering interactive data analysis with Google BigQuery
Google BigQuery for Everyday Developer
Making advanced analytics accessible to more companies
Bridge to Cloud: Using Apache Kafka to Migrate to GCP
Exploring BigData with Google BigQuery
Eric Andersen Keynote
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020
Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...
Google BigQuery - Features & Benefits
Lunch & Learn BigQuery & Firebase from other Google Cloud customers
BigQuery ML - Machine learning at scale using SQL
Modern Thinking área digital MSKM 21/09/2017
GDSC Google Cloud Study jam Web Bootcamp - Day-4 Session 4
GCSJ Session 4.pdf
Serverless Big Data Architecture on Google Cloud Platform at Credit OK
Data Platform on GCP
BigQuery_Architecture_Componaaaents.pptx
An overview of BigQuery
Getting more into GCP.pdf
Ad

More from Márton Kodok (20)

PDF
AI Agents with Gemini 2.0 - Beyond the Chatbot
PDF
Gemini 2.0 and Vertex AI for Innovation Workshop
PDF
Function Calling with the Vertex AI Gemini API
PDF
Vector search and multimodal embeddings in BigQuery
PDF
BigQuery Remote Functions for Dynamic Mapping of E-mobility Charging Networks
PDF
Build applications with generative AI on Google Cloud
PDF
Gen Apps on Google Cloud PaLM2 and Codey APIs in Action
PDF
DevBCN Vertex AI - Pipelines for your MLOps workflows
PDF
Discover BigQuery ML, build your own CREATE MODEL statement
PDF
Cloud Run - the rise of serverless and containerization
PDF
BigQuery best practices and recommendations to reduce costs with BI Engine, S...
PDF
Vertex AI - Unified ML Platform for the entire AI workflow on Google Cloud
PDF
Vertex AI: Pipelines for your MLOps workflows
PDF
Cloud Workflows What's new in serverless orchestration and automation
PDF
Serverless orchestration and automation with Cloud Workflows
PDF
Serverless orchestration and automation with Cloud Workflows
PDF
Serverless orchestration and automation with Cloud Workflows
PDF
BigdataConference Europe - BigQuery ML
PDF
Applying BigQuery ML on e-commerce data analytics
PDF
Vibe Koli 2019 - Utazás az egyetem padjaitól a Google Developer Expertig
AI Agents with Gemini 2.0 - Beyond the Chatbot
Gemini 2.0 and Vertex AI for Innovation Workshop
Function Calling with the Vertex AI Gemini API
Vector search and multimodal embeddings in BigQuery
BigQuery Remote Functions for Dynamic Mapping of E-mobility Charging Networks
Build applications with generative AI on Google Cloud
Gen Apps on Google Cloud PaLM2 and Codey APIs in Action
DevBCN Vertex AI - Pipelines for your MLOps workflows
Discover BigQuery ML, build your own CREATE MODEL statement
Cloud Run - the rise of serverless and containerization
BigQuery best practices and recommendations to reduce costs with BI Engine, S...
Vertex AI - Unified ML Platform for the entire AI workflow on Google Cloud
Vertex AI: Pipelines for your MLOps workflows
Cloud Workflows What's new in serverless orchestration and automation
Serverless orchestration and automation with Cloud Workflows
Serverless orchestration and automation with Cloud Workflows
Serverless orchestration and automation with Cloud Workflows
BigdataConference Europe - BigQuery ML
Applying BigQuery ML on e-commerce data analytics
Vibe Koli 2019 - Utazás az egyetem padjaitól a Google Developer Expertig

Recently uploaded (20)

PDF
Microsoft Office 365 Crack Download Free
PDF
MCP Security Tutorial - Beginner to Advanced
PDF
CCleaner 6.39.11548 Crack 2025 License Key
PPTX
Computer Software - Technology and Livelihood Education
PDF
How Tridens DevSecOps Ensures Compliance, Security, and Agility
PDF
Multiverse AI Review 2025: Access All TOP AI Model-Versions!
PDF
The Dynamic Duo Transforming Financial Accounting Systems Through Modern Expe...
DOCX
How to Use SharePoint as an ISO-Compliant Document Management System
PDF
Introduction to Ragic - #1 No Code Tool For Digitalizing Your Business Proces...
PPTX
Full-Stack Developer Courses That Actually Land You Jobs
PDF
Topaz Photo AI Crack New Download (Latest 2025)
PDF
EaseUS PDF Editor Pro 6.2.0.2 Crack with License Key 2025
PPTX
4Seller: The All-in-One Multi-Channel E-Commerce Management Platform for Glob...
PDF
E-Commerce Website Development Companyin india
PDF
BoxLang Dynamic AWS Lambda - Japan Edition
PPTX
GSA Content Generator Crack (2025 Latest)
PDF
Ableton Live Suite for MacOS Crack Full Download (Latest 2025)
PDF
Top 10 Software Development Trends to Watch in 2025 🚀.pdf
PDF
AI/ML Infra Meetup | Beyond S3's Basics: Architecting for AI-Native Data Access
PDF
AI Guide for Business Growth - Arna Softech
Microsoft Office 365 Crack Download Free
MCP Security Tutorial - Beginner to Advanced
CCleaner 6.39.11548 Crack 2025 License Key
Computer Software - Technology and Livelihood Education
How Tridens DevSecOps Ensures Compliance, Security, and Agility
Multiverse AI Review 2025: Access All TOP AI Model-Versions!
The Dynamic Duo Transforming Financial Accounting Systems Through Modern Expe...
How to Use SharePoint as an ISO-Compliant Document Management System
Introduction to Ragic - #1 No Code Tool For Digitalizing Your Business Proces...
Full-Stack Developer Courses That Actually Land You Jobs
Topaz Photo AI Crack New Download (Latest 2025)
EaseUS PDF Editor Pro 6.2.0.2 Crack with License Key 2025
4Seller: The All-in-One Multi-Channel E-Commerce Management Platform for Glob...
E-Commerce Website Development Companyin india
BoxLang Dynamic AWS Lambda - Japan Edition
GSA Content Generator Crack (2025 Latest)
Ableton Live Suite for MacOS Crack Full Download (Latest 2025)
Top 10 Software Development Trends to Watch in 2025 🚀.pdf
AI/ML Infra Meetup | Beyond S3's Basics: Architecting for AI-Native Data Access
AI Guide for Business Growth - Arna Softech

CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery

  • 1. Creating #serverless data analytics system on GCP using BigQuery Márton Kodok / @martonkodok Google Developer Expert at REEA.net March 2018 - Tirgu Mures, Romania
  • 2. ● Geek. Hiker. Do-er. ● Among the Top3 romanians on Stackoverflow 120k reputation ● Google Developer Expert on Cloud technologies ● Crafting Web/Mobile backends at REEA.net ● BigQuery/Redis and database engine expert ● Active in mentoring and IT community Twitter: @martonkodok StackOverflow: pentium10 Slideshare: martonkodok GitHub: pentium10 Creating #serverless data analytics system on GCP using BigQuery @martonkodok About me
  • 3. REEA.net uses GCP Build on the same infrastructure that powers Google
  • 4. Google Cloud Platform (GCP) Compute Big Data BigQuery Cloud Dataflow Cloud Dataproc Cloud Datalab Cloud Pub/Sub Genomics Storage & Databases Cloud Storage Cloud Bigtable Cloud Datastore Cloud SQL Cloud Spanner Persistent Disk Machine Learning Cloud Machine Learning Cloud Vision API Cloud Speech API Cloud Natural Language API Cloud Translation API Cloud Jobs API Data Studio Cloud Dataprep Cloud Video Intelligence API Advanced Solutions Lab Compute Engine App Engine Kubernetes Engine GPU Cloud Functions Container- Optimized OS Identity & Security Cloud IAM Cloud Resource Manager Cloud Security Scanner Key Management Service BeyondCorp Data Loss Prevention API Identity-Aware Proxy Security Key Enforcement Internet of Things Cloud IoT Core Transfer Appliance
  • 5. Google Cloud Platform (GCP) Developer Tools Cloud SDK Cloud Deployment Manager Cloud Source Repositories Cloud Tools for Android Studio Cloud Tools for IntelliJ Cloud Tools for PowerShell Cloud Tools for Visual Studio Container Registry Google Plug-in for Eclipse Cloud Test Lab Networking Virtual Private Cloud Cloud Load Balancing Cloud CDN Cloud Interconnect Cloud DNS Cloud Network Cloud External IP Addresses Cloud Firewall Rules Cloud Routes Cloud VPN Management Tools Stackdriver Monitoring Logging Error Reporting Trace Debugger Cloud Deployment Manager Cloud Endpoints Cloud Console Cloud Shell Cloud Mobile App Cloud Billing API Cloud APIs Cloud Router Dedicated Interconnect Container Builder
  • 6. Meet Serverless Creating #serverless data analytics system on GCP using BigQuery @martonkodok
  • 7. Meet Serverless serverless data center depicted Creating #serverless data analytics system on GCP using BigQuery @martonkodok
  • 8. Event-driven serverless compute platform Cloud Services Changes in data state Business logic events Integrations Event Router Gateway HTTPS Event Source Multiple Platforms Data Warehouse Pub/Sub Cloud Functions Streaming Business Value Application Task Analysis
  • 9. Serverless is about maximizing elasticity, cost savings, and agility of cloud computing. @martonkodok Creating #serverless data analytics system on GCP using BigQuery @martonkodok
  • 10. Crafting a solution for building high-performance, petabyte scale data analytics, serverless reporting system on Google Cloud Platform Goal today Creating #serverless data analytics system on GCP using BigQuery @martonkodok
  • 11. Legacy Reporting System App Cloud Load Balancing NGINX Compute Engine 10GB PD 2 1 Database Service (Master/Slave) Compute Engine 10GB PD 4 1 Compute Engine 10GB PD 4 1 Compute Engine 10GB PD 4 1 Report & Share Business Analysis Scheduled Tasks Batch Processing Compute Engine Multiple Instances Creating #serverless data analytics system on GCP using BigQuery @martonkodok
  • 12. Serverless Reporting System App Cloud Load Balancing NGINX Compute Engine 10GB PD 2 1 Database Service (Master/Slave) Compute Engine 10GB PD 4 1 Compute Engine 10GB PD 4 1 Compute Engine 10GB PD 4 1 Report & Share Business Analysis Scheduled Tasks Batch Processing Compute Engine Multiple Instances BigQuery Data Studio Report & Share Business Analysis Creating #serverless data analytics system on GCP using BigQuery @martonkodok
  • 13. Creating #serverless data analytics system on GCP using BigQuery @martonkodok
  • 14. Analytics-as-a-Service - Data Warehouse in the Cloud Scales into Petabytes on Managed Google Infrastructure (US or EU zone) SQL 2011 + Javascript UDF (User Defined Functions) Familiar DB Structure (table, views, struct, nested, JSON) Integrates with Google Sheets + Cloud Storage + Pub/Sub connectors Decent pricing (queries $5/TB, storage: $20/TB cold: $10/TB) *Mar 2018 Open Interfaces (Web UI, BQ command line tool, REST, ODBC) What is BigQuery? Creating #serverless data analytics system on GCP using BigQuery @martonkodok
  • 15. Columnar storage (max 10 000 columns in table) Large files for loading: 5TB (CSV or JSON) UDF in Javascript or SQL Rich SQL 2011: JSON,IP,Math,RegExp,Geocode,Window functions Modern data types: Record, Nested, Struct, Array. Append-only tables prefered (DML syntax available) Day column partitioned tables (select * from t where day=’2018-01-01’) BigQuery: Convenience of SQL Creating #serverless data analytics system on GCP using BigQuery @martonkodok
  • 16. Architecting for The Cloud BigQuery On-Premises Servers Pipelines ETL Engine Event Sourcing Frontend Platform Services Metrics / Logs/ Streaming Creating #serverless data analytics system on GCP using BigQuery @martonkodok
  • 17. “ Our project generates many/big files. How can I seamlessly ingest them? Creating #serverless data analytics system on GCP using BigQuery @martonkodok
  • 18. Serverless file ingest BigQuery On-Premises Servers ApplicationEvent Sourcing Frontend Platform Services Metrics / Logs/ Streaming Cloud Storage Cloud Functions Triggered Code Creating #serverless data analytics system on GCP using BigQuery @martonkodok
  • 19. “ Data needs to be processed in multiple services. How can we pipe to multiple places? Creating #serverless data analytics system on GCP using BigQuery @martonkodok
  • 20. Architecting for The Cloud On-Premises Servers Event Sourcing Frontend Platform Services Analyze Metrics / Logs/ Streaming Cloud Storage Creating #serverless data analytics system on GCP using BigQuery @martonkodok Cloud Dataflow Process BigQuery Cloud SQL Stream Batch Data Studio Third-Party Tools
  • 21. “ We have our app outside of GCP. How can we use the benefits of BigQuery? Creating #serverless data analytics system on GCP using BigQuery @martonkodok
  • 22. Data Pipeline Integration at REEA.net Analytics Backend BigQuery On-Premises Servers Pipelines FluentD Event Sourcing Frontend Platform Services Metrics / Logs/ Streaming Development Team Data Analysts Report & Share Business Analysis Tools Tableau QlikView Data Studio Internal Dashboard Database SQL Application ServersServers Cloud Storage archive Load Export Replay Standard Devices HTTPS Creating #serverless data analytics system on GCP using BigQuery @martonkodok
  • 23. The following slides will present a sample Fluentd configuration to: 1. Transform a record 2. Copy event to multiple outputs 3. Store event data in File (for backup/log purposes) 4. Stream to BigQuery (for immediate analyses) Creating #serverless data analytics system on GCP using BigQuery @martonkodok
  • 24. <filter frontend.user.*> @type record_transformer </filter> <match frontend.user.*> @type copy <store> @type forest subtype file </store> <store> @type bigquery </store> … </match> Filter plugin mutates incoming data. Add/modify/delete event data transform attributes without a code deploy.1 2 3 4 The copy output plugin copies events to multiple outputs. File(s), multiple databases, DB engines. Great to ship same event to multiple subsystems. The Bigquery output plugin on the fly streams the event to the BigQuery warehouse. No need to write integration. Data is available immediately for querying. Whenever needed other output plugins can be wired in: Kafka, Google Cloud Storage output plugin. Creating #serverless data analytics system on GCP using BigQuery @martonkodok
  • 25. record_transformer copy file BigQuery <filter frontend.user.*> @type record_transformer enable_ruby remove_keys host <record> bq {"insert_id":"${uid}","host":"${host}", "created":"${time.to_i}"} avg ${record["total"] / record["count"]} </record> </filter> syntax: Ruby, easy to use. Great for: - date transformation, - quick normalizations, - calculating something on the fly, and store in clear log/analytics db - renaming without code deploy. 1 2 3 4 Creating #serverless data analytics system on GCP using BigQuery @martonkodok
  • 26. record_transformer copy file BigQuery <match frontend.user.*> @type copy <store> @type forest subtype file <template> path /tank/storage/${tag}.*.log time_slice_format %Y%m%d </template> </store> </match> 1 2 3 4 Creating #serverless data analytics system on GCP using BigQuery @martonkodok
  • 27. record_transformer copy file BigQuery <match frontend.user.*> @type bigquery method insert auth_method json_key json_key /etc/td-agent/keys/key-31da042be48c.json time_field timestamp time_slice_format %Y%m%d table user$%{time_slice} ignore_unknown_values schema_path /etc/td-agent/schema/user_login.json </match> 1 2 3 4 Connector uses: - JSON key auth file - JSON table schema Pro features: - streaming to Partitioned tables - ignore unknown values (not reflected in schema) Creating #serverless data analytics system on GCP using BigQuery @martonkodok
  • 28. ● On data that it is difficult to process/analyze using traditional databases ● Not a replacement to traditional DBs, but it compliments the system ● Major strength is handling Large datasets ● Applying Javascript UDF on columnar storage to resolve complex tasks (eg: JS for natural language processing) ● On streams (forms, IoT, Kafka) ● On exploring unstructured data Where to use BigQuery? Creating #serverless data analytics system on GCP using BigQuery @martonkodok
  • 29. ➢ Optimize product pages ➢ Email engagement ➢ Funnel Analysis Achievements - goal reached by measuring everything Creating #serverless data analytics system on GCP using BigQuery @martonkodok
  • 30. ● Funnel Analysis Achievements Creating #serverless data analytics system on GCP using BigQuery @martonkodok
  • 31. Funnel analysis: Time on upsell pages Creating #serverless data analytics system on GCP using BigQuery @martonkodok
  • 32. Example HITS chain: ● article1 -> page2 -> page3 -> page4 -> orderpage1 -> thankyoupage1 ● page1 -> article2-> page3 -> orderpage2 -> ... Attribute credit to first article visited on purchase Creating #serverless data analytics system on GCP using BigQuery @martonkodok
  • 33. ● Funnel Analysis ● Email URL click heatmap ● Email Health Dashboard (SPAM, ISP deferral, content A/B split tests, trends or low open rate campaigns) ● Advanced segmentation (all raw data stored) ● Behavioral analytics - engaged users etc... Achievements Continued Creating #serverless data analytics system on GCP using BigQuery @martonkodok
  • 34. ● SQL language to run BigData queries ● run raw ad-hoc queries (either by analysts/sales or Devs) ● no more throwing away-, expiring-, aggregating old data ● no provisioning/deploy ● no running out of resources ● no more focus on large scale execution plan ● no need to re-implement tricky concepts (time windows / join streams) Our benefits Creating #serverless data analytics system on GCP using BigQuery @martonkodok
  • 35. ● No manual sharding ● No capacity guessing ● No idle resources ● No maintenance windows ● No manual scaling ● No file mgmt BigQuery: Serverless Data Warehouse Creating #serverless data analytics system on GCP using BigQuery @martonkodok
  • 36. ● No servers to provision or manage ● Abstract away the complexity ● Scales with usage (ready every time for viral spikes or #BlackFriday) ● Availability and fault tolerance built in ● No orchestration in code ● Never pay for idle ● Cost savings (ps: we don’t have the same budget for security like GCP or AWS) ● Decoupled: APIs as contracts ● Monitored: Metrics and logging are a universal right ● Think concurrent, stateless, queue, stream based. Serverlessmeans Creating #serverless data analytics system on GCP using BigQuery @martonkodok
  • 37. Easily Build Custom Reports and Dashboards Creating #serverless data analytics system on GCP using BigQuery @martonkodok
  • 38. Thank you. Slides available on: slideshare.net/martonkodok Reea.net - Integrated web solutions driven by creativity to deliver projects.