SlideShare a Scribd company logo
JRuby with Java Code
in Data Processing World
JRubyConf.EU at 31 Jul 2015
Satoshi Tagomori (@tagomoris)
Satoshi "Moris" Tagomori
(@tagomoris)
Fluentd, Norikra, MessagePack-Ruby,...
Docker logging driver for Fluentd (docker v1.8)
Treasure Data, Inc.
https://jobs.lever.co/treasure-data
We're hiring!
OSS team (developer / community manager)
Distributed system engineer (Hadoop, queue/workers)
Front-end engineer (RoR)
Data Processing World
Data Processing World
Java
Data Processing World
Data Processing World
Hadoop, Spark, Tez, Flink, Storm, Kafka, ...
Hive, Pig, Drill, Impala, Presto, ....
Java + Scala, Clojure + C++, ....
Data Processing World
on JVM
Data Processing World
Many CPU cores, Large memory, High rate Disk I/O, ...
High throughput data processing
Hadoop YARN/MapReduce/HDFS API compatibility
Two OSS using Java&JRuby
Norikra:
Stream Processing with SQL for everybody
Server software, written in JRuby, runs on JVM
Open source software (GPLv2)
http://norikra.github.io/
https://github.com/norikra/norikra
Distributed on rubygems.org
"gem i norikra"
What Norikra does:
SELECT path, SUM(bytes) AS s
FROM www_access_logs.win:length_batch(10)
WHERE status=200
GROUP BY path ORDER BY s DESC
SELECT path, SUM(bytes) AS s
FROM www_access_logs.win:length_batch(10)
WHERE status=200
GROUP BY path ORDER BY s DESC
{"path":"/", "status":200,
"bytes":300, "duration":0.03,
"referer":"...", "user-agent":"...."
path:"/", s:301
1
SELECT path, SUM(bytes) AS s
FROM www_access_logs.win:length_batch(10)
WHERE status=200
GROUP BY path ORDER BY s DESC
{"path":"/download/a", "status":200,
"bytes":10240, "duration":0.53,
"referer":"...", "user-agent":"...."
path:"/", s:301
path:"/download/a", s:10240
2
SELECT path, SUM(bytes) AS s
FROM www_access_logs.win:length_batch(10)
WHERE status=200
GROUP BY path ORDER BY s DESC
{"path":"/", "status":404,
"bytes":0, "duration":0.08,
"referer":"...", "user-agent":"...."
path:"/", s:301
path:"/download/a", s:10240
3
SELECT path, SUM(bytes) AS s
FROM www_access_logs.win:length_batch(10)
WHERE status=200
GROUP BY path ORDER BY s DESC
{"path":"/", "status":200,
"bytes":301, "duration":0.01,
"referer":"...", "user-agent":"...."
path:"/", s:602
path:"/download/a", s:10240
4
SELECT path, SUM(bytes) AS s
FROM www_access_logs.win:length_batch(10)
WHERE status=200
GROUP BY path ORDER BY s DESC
{"path":"/download/b", "status":200,
"bytes":678, "duration":0.11,
"referer":"...", "user-agent":"...."
path:"/", s:602
path:"/download/a", s:10240
path:"/download/b", s:678
5
SELECT path, SUM(bytes) AS s
FROM www_access_logs.win:length_batch(10)
WHERE status=200
GROUP BY path ORDER BY s DESC
{"path":"/download/b", "status":200,
"bytes":678, "duration":0.13,
"referer":"...", "user-agent":"...."
path:"/", s:602
path:"/download/a", s:10240
path:"/download/b", s:1356
6
SELECT path, SUM(bytes) AS s
FROM www_access_logs.win:length_batch(10)
WHERE status=200
GROUP BY path ORDER BY s DESC
{"path":"/", "status":200,
"bytes":301, "duration":0.02,
"referer":"...", "user-agent":"...."
path:"/", s:903
path:"/download/a", s:10240
path:"/download/b", s:1356
7
SELECT path, SUM(bytes) AS s
FROM www_access_logs.win:length_batch(10)
WHERE status=200
GROUP BY path ORDER BY s DESC
{"path":"/", "status":200,
"bytes":301, "duration":0.09,
"referer":"...", "user-agent":"...."
path:"/", s:1204
path:"/download/a", s:10240
path:"/download/b", s:1356
8
SELECT path, SUM(bytes) AS s
FROM www_access_logs.win:length_batch(10)
WHERE status=200
GROUP BY path ORDER BY s DESC
{"path":"/download/a", "status":200,
"bytes":10240, "duration":1.1,
"referer":"...", "user-agent":"...."
path:"/", s:1204
path:"/download/a", s:20480
path:"/download/b", s:1356
9
SELECT path, SUM(bytes) AS s
FROM www_access_logs.win:length_batch(10)
WHERE status=200
GROUP BY path ORDER BY s DESC
{"path":"/", "status":200,
"bytes":301, "duration":0.05,
"referer":"...", "user-agent":"...."
path:"/", s:1505
path:"/download/a", s:20480
path:"/download/b", s:1356
10
SELECT path, SUM(bytes) AS s
FROM www_access_logs.win:length_batch(10)
WHERE status=200
GROUP BY path ORDER BY s DESC
10
{"path":"/download/a", "s":20480}
{"path":"/", "s":1505}
{"path":"/download/b", "s":1356}
Norikra and Java
Norikra is written in JRuby, and using Esper
Key factor: productivity (33days until first release)
Esper:Java library, provides Complex Event Processing
SQL parser, executor
Many features and good performance
Licensed under GPLv2
Plugins
as rubygems
Norikra Server (on JVM)
Esper (Query Engine)
Type Definition

Manager
Output Event
Pool
Norikra Engine
RPC Server

mizuno (Jetty + Rack)
Rack RPC Handler
Listener
UDF
UDF
User-Defined Functions
"gem i norikra-udf-xxx"
written in Java, or JRuby (compiled to Java)
works in Esper instance: must be a Java class
Listener
handler for output data of queries, written in JRuby
"gem i norikra-listener-xxx"
Embulk
"Embulk is a open-source bulk data loader
that helps data transfer between various
databases, storages, file formats, and
cloud services."
http://www.embulk.org/docs/
Embulk:
makes painful data integration work relaxed
Plugin-based parallel bulk data loader
Open source software (Apache License v2.0)
http://www.embulk.org/
https://github.com/embulk/embulk
Distributed as .jar or on rubygems.org
Plugins are on rubygems.org
http://www.slideshare.net/frsyuki/embuk-making-data-integration-works-relaxed
http://www.slideshare.net/HiroshiNakamura/embulk-20150411
HDFS
MySQL
Amazon S3
Embulk
CSV Files
SequenceFile
Salesforce.com
Elasticsearch
Cassandra
Hive
Redis
✓ Parallel execution
✓ Data validation
✓ Error recovery
✓ Deterministic behavior
✓ Idempotet retrying
Plugins Plugins
bulk load
#ccc_cd4 / #embulk
InputPlugin OutputPlugin
Executor plugin
Filter plugin
Filter plugin
Filter plugins
records
Threads,
MapReduce
records
convert, …
input, … output.
29
records
config
#ccc_cd4 / #embulk
InputPlugin
FileInput plugin
OutputPlugin
FileOutput plugin
Encoder plugin
Formatter plugin
Decoder plugin
Parser plugin
HDFS, S3,

Riak CS, …
gzip, bzip2,

aes, …
CSV, JSON,

pcap, …
buffer
buffer
buffer
buffer
Filter plugin
Filter plugin
Filter plugins
recordsrecords
Executor plugin
30
records
config
Embulk and Java
Embulk core is written in Java
mainly for performance
Embulk plugins:
are loaded over API based on JRuby
are written in JRuby or Java
JRuby for early release
Java for performance
InputPlugin
module Embulk
class InputExample < InputPlugin
Plugin.register_input('example', self)
def self.transaction(config, &control)
# read config
task = {
'message' =>
config.param('message', :string, default: nil)
}
threads = config.param('threads', :int, default:
2)
columns = [
Column.new(0, 'col0', :long),
Column.new(1, 'col1', :double),
Column.new(2, 'col2', :string),
]
# BEGIN here
commit_reports = yield(task, columns, threads)
# COMMIT here
puts "Example input finished"
return {}
end
def run(task, schema, index, page_builder)
puts "Example input thread #{@index}…"
10.times do |i|
@page_builder.add([i, 10.0, "example"])
end
@page_builder.finish
commit_report = { }
return commit_report
end
end
end
OutputPlugin
module Embulk
class OutputExample < OutputPlugin
Plugin.register_output('example', self)
def self.transaction(
config, schema,
processor_count, &control)
# read config
task = {
'message' =>
config.param('message', :string, default: "record")
}
puts "Example output started."
commit_reports = yield(task)
puts "Example output finished. Commit
reports = #{commit_reports.to_json}"
return {}
end
def initialize(task, schema, index)
puts "Example output thread #{index}..."
super
@message = task.prop('message', :string)
@records = 0
end
def add(page)
page.each do |record|
hash = Hash[schema.names.zip(record)]
puts "#{@message}: #{hash.to_json}"
@records += 1
end
end
def finish
end
def abort
end
def commit
commit_report = {
"records" => @records
}
return commit_report
end
end
end
Plugin management: Norikra
Esper instance
Engine
Plugin management
UDF Listener
plugins as gems
plugin loader written in
JRuby
Java JRuby
Plugin management: Embulk
Embulk core
Plugin management
input/output/filter
parser/formatter
Java JRuby
decoder/encoder
file-input/output
executor
plugins as gems
plugin loader written in
JRuby
Pluggable software
on JVM & Java API
Java? Scala? Clojure? JRuby?: JRuby
Plugin packaging: jar? gem?: gem
rubygem.org >>> maven central (or others)
especially for plugin authors
Plugin loader: Class Loader? "require"?: require
JRuby in Japan
Not so many users :(
CRuby is super major software in Japan
Java -> Ruby -> Scala? Golang?
Make your software pluggable.
Make eco-system&community.
with JRuby!
Thanks!
Ad

Recommended

Fluentd Overview, Now and Then
Fluentd Overview, Now and Then
SATOSHI TAGOMORI
 
Fluentd at HKOScon
Fluentd at HKOScon
N Masahiro
 
Fluentd v0.14 Plugin API Details
Fluentd v0.14 Plugin API Details
SATOSHI TAGOMORI
 
Dive into Fluentd plugin v0.12
Dive into Fluentd plugin v0.12
N Masahiro
 
Fluentd meetup in japan
Fluentd meetup in japan
Treasure Data, Inc.
 
Fluentd v1.0 in a nutshell
Fluentd v1.0 in a nutshell
N Masahiro
 
Fluentd - Set Up Once, Collect More
Fluentd - Set Up Once, Collect More
Sadayuki Furuhashi
 
Fluentd vs. Logstash for OpenStack Log Management
Fluentd vs. Logstash for OpenStack Log Management
NTT Communications Technology Development
 
The Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and Containers
SATOSHI TAGOMORI
 
Fluentd v1.0 in a nutshell
Fluentd v1.0 in a nutshell
N Masahiro
 
Docker and Fluentd
Docker and Fluentd
N Masahiro
 
The basics of fluentd
The basics of fluentd
Treasure Data, Inc.
 
Fluentd 101
Fluentd 101
SATOSHI TAGOMORI
 
Docker.io
Docker.io
Ladislav Prskavec
 
Fluentd introduction at ipros
Fluentd introduction at ipros
Treasure Data, Inc.
 
Fluentd v1 and future at techtalk
Fluentd v1 and future at techtalk
N Masahiro
 
The basics of fluentd
The basics of fluentd
Treasure Data, Inc.
 
Distributed Stream Processing on Fluentd / #fluentd
Distributed Stream Processing on Fluentd / #fluentd
SATOSHI TAGOMORI
 
Async and Non-blocking IO w/ JRuby
Async and Non-blocking IO w/ JRuby
Joe Kutner
 
Modern Black Mages Fighting in the Real World
Modern Black Mages Fighting in the Real World
SATOSHI TAGOMORI
 
Fluentd meetup #2
Fluentd meetup #2
Treasure Data, Inc.
 
Fluentd unified logging layer
Fluentd unified logging layer
Kiyoto Tamura
 
Treasure Data Summer Internship 2016
Treasure Data Summer Internship 2016
Yuta Iwama
 
Fluentd meetup
Fluentd meetup
Sadayuki Furuhashi
 
On Centralizing Logs
On Centralizing Logs
Sematext Group, Inc.
 
Fluentd meetup dive into fluent plugin (outdated)
Fluentd meetup dive into fluent plugin (outdated)
N Masahiro
 
Plugin-based software design with Ruby and RubyGems
Plugin-based software design with Ruby and RubyGems
Sadayuki Furuhashi
 
Dexador Rises
Dexador Rises
fukamachi
 
Treasure Data and OSS
Treasure Data and OSS
N Masahiro
 
Fluentd Unified Logging Layer At Fossasia
Fluentd Unified Logging Layer At Fossasia
N Masahiro
 

More Related Content

What's hot (20)

The Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and Containers
SATOSHI TAGOMORI
 
Fluentd v1.0 in a nutshell
Fluentd v1.0 in a nutshell
N Masahiro
 
Docker and Fluentd
Docker and Fluentd
N Masahiro
 
The basics of fluentd
The basics of fluentd
Treasure Data, Inc.
 
Fluentd 101
Fluentd 101
SATOSHI TAGOMORI
 
Docker.io
Docker.io
Ladislav Prskavec
 
Fluentd introduction at ipros
Fluentd introduction at ipros
Treasure Data, Inc.
 
Fluentd v1 and future at techtalk
Fluentd v1 and future at techtalk
N Masahiro
 
The basics of fluentd
The basics of fluentd
Treasure Data, Inc.
 
Distributed Stream Processing on Fluentd / #fluentd
Distributed Stream Processing on Fluentd / #fluentd
SATOSHI TAGOMORI
 
Async and Non-blocking IO w/ JRuby
Async and Non-blocking IO w/ JRuby
Joe Kutner
 
Modern Black Mages Fighting in the Real World
Modern Black Mages Fighting in the Real World
SATOSHI TAGOMORI
 
Fluentd meetup #2
Fluentd meetup #2
Treasure Data, Inc.
 
Fluentd unified logging layer
Fluentd unified logging layer
Kiyoto Tamura
 
Treasure Data Summer Internship 2016
Treasure Data Summer Internship 2016
Yuta Iwama
 
Fluentd meetup
Fluentd meetup
Sadayuki Furuhashi
 
On Centralizing Logs
On Centralizing Logs
Sematext Group, Inc.
 
Fluentd meetup dive into fluent plugin (outdated)
Fluentd meetup dive into fluent plugin (outdated)
N Masahiro
 
Plugin-based software design with Ruby and RubyGems
Plugin-based software design with Ruby and RubyGems
Sadayuki Furuhashi
 
Dexador Rises
Dexador Rises
fukamachi
 
The Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and Containers
SATOSHI TAGOMORI
 
Fluentd v1.0 in a nutshell
Fluentd v1.0 in a nutshell
N Masahiro
 
Docker and Fluentd
Docker and Fluentd
N Masahiro
 
Fluentd v1 and future at techtalk
Fluentd v1 and future at techtalk
N Masahiro
 
Distributed Stream Processing on Fluentd / #fluentd
Distributed Stream Processing on Fluentd / #fluentd
SATOSHI TAGOMORI
 
Async and Non-blocking IO w/ JRuby
Async and Non-blocking IO w/ JRuby
Joe Kutner
 
Modern Black Mages Fighting in the Real World
Modern Black Mages Fighting in the Real World
SATOSHI TAGOMORI
 
Fluentd unified logging layer
Fluentd unified logging layer
Kiyoto Tamura
 
Treasure Data Summer Internship 2016
Treasure Data Summer Internship 2016
Yuta Iwama
 
Fluentd meetup dive into fluent plugin (outdated)
Fluentd meetup dive into fluent plugin (outdated)
N Masahiro
 
Plugin-based software design with Ruby and RubyGems
Plugin-based software design with Ruby and RubyGems
Sadayuki Furuhashi
 
Dexador Rises
Dexador Rises
fukamachi
 

Similar to JRuby with Java Code in Data Processing World (20)

Treasure Data and OSS
Treasure Data and OSS
N Masahiro
 
Fluentd Unified Logging Layer At Fossasia
Fluentd Unified Logging Layer At Fossasia
N Masahiro
 
Fighting Against Chaotically Separated Values with Embulk
Fighting Against Chaotically Separated Values with Embulk
Sadayuki Furuhashi
 
Zero mq logs
Zero mq logs
Tomas Doran
 
Fluentd and Embulk Game Server 4
Fluentd and Embulk Game Server 4
N Masahiro
 
Ruby - The Hard Bits
Ruby - The Hard Bits
Paul Gallagher
 
Norikra: SQL Stream Processing In Ruby
Norikra: SQL Stream Processing In Ruby
SATOSHI TAGOMORI
 
The Enterprise Strikes Back
The Enterprise Strikes Back
Burke Libbey
 
Fluentd - RubyKansai 65
Fluentd - RubyKansai 65
N Masahiro
 
Rollin onj Rubyv3
Rollin onj Rubyv3
Oracle
 
First Day With J Ruby
First Day With J Ruby
Praveen Kumar Sinha
 
Embulk, an open-source plugin-based parallel bulk data loader
Embulk, an open-source plugin-based parallel bulk data loader
Sadayuki Furuhashi
 
Insight Data Engineering: Open source data ingestion
Insight Data Engineering: Open source data ingestion
Treasure Data, Inc.
 
Open source data ingestion
Open source data ingestion
Treasure Data, Inc.
 
20081022cca
20081022cca
Jeff Hammerbacher
 
An Open Source NoSQL solution for Internet Access Logs Analysis
An Open Source NoSQL solution for Internet Access Logs Analysis
José Manuel Ciges Regueiro
 
Let's reconsider about collecting logs. Plus, visiting elastic@Moutain View!
Let's reconsider about collecting logs. Plus, visiting elastic@Moutain View!
心 谷本
 
DataEngConf SF16 - Collecting and Moving Data at Scale
DataEngConf SF16 - Collecting and Moving Data at Scale
Hakka Labs
 
Gofer 200707
Gofer 200707
oscon2007
 
Ruby on Rails (RoR) as a back-end processor for Apex
Ruby on Rails (RoR) as a back-end processor for Apex
Espen Brækken
 
Treasure Data and OSS
Treasure Data and OSS
N Masahiro
 
Fluentd Unified Logging Layer At Fossasia
Fluentd Unified Logging Layer At Fossasia
N Masahiro
 
Fighting Against Chaotically Separated Values with Embulk
Fighting Against Chaotically Separated Values with Embulk
Sadayuki Furuhashi
 
Fluentd and Embulk Game Server 4
Fluentd and Embulk Game Server 4
N Masahiro
 
Norikra: SQL Stream Processing In Ruby
Norikra: SQL Stream Processing In Ruby
SATOSHI TAGOMORI
 
The Enterprise Strikes Back
The Enterprise Strikes Back
Burke Libbey
 
Fluentd - RubyKansai 65
Fluentd - RubyKansai 65
N Masahiro
 
Rollin onj Rubyv3
Rollin onj Rubyv3
Oracle
 
Embulk, an open-source plugin-based parallel bulk data loader
Embulk, an open-source plugin-based parallel bulk data loader
Sadayuki Furuhashi
 
Insight Data Engineering: Open source data ingestion
Insight Data Engineering: Open source data ingestion
Treasure Data, Inc.
 
An Open Source NoSQL solution for Internet Access Logs Analysis
An Open Source NoSQL solution for Internet Access Logs Analysis
José Manuel Ciges Regueiro
 
Let's reconsider about collecting logs. Plus, visiting elastic@Moutain View!
Let's reconsider about collecting logs. Plus, visiting elastic@Moutain View!
心 谷本
 
DataEngConf SF16 - Collecting and Moving Data at Scale
DataEngConf SF16 - Collecting and Moving Data at Scale
Hakka Labs
 
Gofer 200707
Gofer 200707
oscon2007
 
Ruby on Rails (RoR) as a back-end processor for Apex
Ruby on Rails (RoR) as a back-end processor for Apex
Espen Brækken
 
Ad

More from SATOSHI TAGOMORI (20)

Ractor's speed is not light-speed
Ractor's speed is not light-speed
SATOSHI TAGOMORI
 
Good Things and Hard Things of SaaS Development/Operations
Good Things and Hard Things of SaaS Development/Operations
SATOSHI TAGOMORI
 
Maccro Strikes Back
Maccro Strikes Back
SATOSHI TAGOMORI
 
Invitation to the dark side of Ruby
Invitation to the dark side of Ruby
SATOSHI TAGOMORI
 
Hijacking Ruby Syntax in Ruby (RubyConf 2018)
Hijacking Ruby Syntax in Ruby (RubyConf 2018)
SATOSHI TAGOMORI
 
Make Your Ruby Script Confusing
Make Your Ruby Script Confusing
SATOSHI TAGOMORI
 
Hijacking Ruby Syntax in Ruby
Hijacking Ruby Syntax in Ruby
SATOSHI TAGOMORI
 
Lock, Concurrency and Throughput of Exclusive Operations
Lock, Concurrency and Throughput of Exclusive Operations
SATOSHI TAGOMORI
 
Data Processing and Ruby in the World
Data Processing and Ruby in the World
SATOSHI TAGOMORI
 
Planet-scale Data Ingestion Pipeline: Bigdam
Planet-scale Data Ingestion Pipeline: Bigdam
SATOSHI TAGOMORI
 
Technologies, Data Analytics Service and Enterprise Business
Technologies, Data Analytics Service and Enterprise Business
SATOSHI TAGOMORI
 
Ruby and Distributed Storage Systems
Ruby and Distributed Storage Systems
SATOSHI TAGOMORI
 
Perfect Norikra 2nd Season
Perfect Norikra 2nd Season
SATOSHI TAGOMORI
 
To Have Own Data Analytics Platform, Or NOT To
To Have Own Data Analytics Platform, Or NOT To
SATOSHI TAGOMORI
 
How To Write Middleware In Ruby
How To Write Middleware In Ruby
SATOSHI TAGOMORI
 
Open Source Software, Distributed Systems, Database as a Cloud Service
Open Source Software, Distributed Systems, Database as a Cloud Service
SATOSHI TAGOMORI
 
How to Make Norikra Perfect
How to Make Norikra Perfect
SATOSHI TAGOMORI
 
Distributed Logging Architecture in Container Era
Distributed Logging Architecture in Container Era
SATOSHI TAGOMORI
 
Fighting API Compatibility On Fluentd Using "Black Magic"
Fighting API Compatibility On Fluentd Using "Black Magic"
SATOSHI TAGOMORI
 
Overview of data analytics service: Treasure Data Service
Overview of data analytics service: Treasure Data Service
SATOSHI TAGOMORI
 
Ractor's speed is not light-speed
Ractor's speed is not light-speed
SATOSHI TAGOMORI
 
Good Things and Hard Things of SaaS Development/Operations
Good Things and Hard Things of SaaS Development/Operations
SATOSHI TAGOMORI
 
Invitation to the dark side of Ruby
Invitation to the dark side of Ruby
SATOSHI TAGOMORI
 
Hijacking Ruby Syntax in Ruby (RubyConf 2018)
Hijacking Ruby Syntax in Ruby (RubyConf 2018)
SATOSHI TAGOMORI
 
Make Your Ruby Script Confusing
Make Your Ruby Script Confusing
SATOSHI TAGOMORI
 
Hijacking Ruby Syntax in Ruby
Hijacking Ruby Syntax in Ruby
SATOSHI TAGOMORI
 
Lock, Concurrency and Throughput of Exclusive Operations
Lock, Concurrency and Throughput of Exclusive Operations
SATOSHI TAGOMORI
 
Data Processing and Ruby in the World
Data Processing and Ruby in the World
SATOSHI TAGOMORI
 
Planet-scale Data Ingestion Pipeline: Bigdam
Planet-scale Data Ingestion Pipeline: Bigdam
SATOSHI TAGOMORI
 
Technologies, Data Analytics Service and Enterprise Business
Technologies, Data Analytics Service and Enterprise Business
SATOSHI TAGOMORI
 
Ruby and Distributed Storage Systems
Ruby and Distributed Storage Systems
SATOSHI TAGOMORI
 
Perfect Norikra 2nd Season
Perfect Norikra 2nd Season
SATOSHI TAGOMORI
 
To Have Own Data Analytics Platform, Or NOT To
To Have Own Data Analytics Platform, Or NOT To
SATOSHI TAGOMORI
 
How To Write Middleware In Ruby
How To Write Middleware In Ruby
SATOSHI TAGOMORI
 
Open Source Software, Distributed Systems, Database as a Cloud Service
Open Source Software, Distributed Systems, Database as a Cloud Service
SATOSHI TAGOMORI
 
How to Make Norikra Perfect
How to Make Norikra Perfect
SATOSHI TAGOMORI
 
Distributed Logging Architecture in Container Era
Distributed Logging Architecture in Container Era
SATOSHI TAGOMORI
 
Fighting API Compatibility On Fluentd Using "Black Magic"
Fighting API Compatibility On Fluentd Using "Black Magic"
SATOSHI TAGOMORI
 
Overview of data analytics service: Treasure Data Service
Overview of data analytics service: Treasure Data Service
SATOSHI TAGOMORI
 
Ad

Recently uploaded (20)

FME for Good: Integrating Multiple Data Sources with APIs to Support Local Ch...
FME for Good: Integrating Multiple Data Sources with APIs to Support Local Ch...
Safe Software
 
vertical-cnc-processing-centers-drillteq-v-200-en.pdf
vertical-cnc-processing-centers-drillteq-v-200-en.pdf
AmirStern2
 
National Fuels Treatments Initiative: Building a Seamless Map of Hazardous Fu...
National Fuels Treatments Initiative: Building a Seamless Map of Hazardous Fu...
Safe Software
 
Security Tips for Enterprise Azure Solutions
Security Tips for Enterprise Azure Solutions
Michele Leroux Bustamante
 
TrustArc Webinar - 2025 Global Privacy Survey
TrustArc Webinar - 2025 Global Privacy Survey
TrustArc
 
“Addressing Evolving AI Model Challenges Through Memory and Storage,” a Prese...
“Addressing Evolving AI Model Challenges Through Memory and Storage,” a Prese...
Edge AI and Vision Alliance
 
Murdledescargadarkweb.pdfvolumen1 100 elementary
Murdledescargadarkweb.pdfvolumen1 100 elementary
JorgeSemperteguiMont
 
War_And_Cyber_3_Years_Of_Struggle_And_Lessons_For_Global_Security.pdf
War_And_Cyber_3_Years_Of_Struggle_And_Lessons_For_Global_Security.pdf
biswajitbanerjee38
 
FIDO Seminar: Targeting Trust: The Future of Identity in the Workforce.pptx
FIDO Seminar: Targeting Trust: The Future of Identity in the Workforce.pptx
FIDO Alliance
 
SAP Modernization Strategies for a Successful S/4HANA Journey.pdf
SAP Modernization Strategies for a Successful S/4HANA Journey.pdf
Precisely
 
Can We Use Rust to Develop Extensions for PostgreSQL? (POSETTE: An Event for ...
Can We Use Rust to Develop Extensions for PostgreSQL? (POSETTE: An Event for ...
NTT DATA Technology & Innovation
 
Kubernetes Security Act Now Before It’s Too Late
Kubernetes Security Act Now Before It’s Too Late
Michael Furman
 
FIDO Seminar: Authentication for a Billion Consumers - Amazon.pptx
FIDO Seminar: Authentication for a Billion Consumers - Amazon.pptx
FIDO Alliance
 
MuleSoft for AgentForce : Topic Center and API Catalog
MuleSoft for AgentForce : Topic Center and API Catalog
shyamraj55
 
AI VIDEO MAGAZINE - June 2025 - r/aivideo
AI VIDEO MAGAZINE - June 2025 - r/aivideo
1pcity Studios, Inc
 
FME for Distribution & Transmission Integrity Management Program (DIMP & TIMP)
FME for Distribution & Transmission Integrity Management Program (DIMP & TIMP)
Safe Software
 
High Availability On-Premises FME Flow.pdf
High Availability On-Premises FME Flow.pdf
Safe Software
 
Integration of Utility Data into 3D BIM Models Using a 3D Solids Modeling Wor...
Integration of Utility Data into 3D BIM Models Using a 3D Solids Modeling Wor...
Safe Software
 
Tech-ASan: Two-stage check for Address Sanitizer - Yixuan Cao.pdf
Tech-ASan: Two-stage check for Address Sanitizer - Yixuan Cao.pdf
caoyixuan2019
 
Creating Inclusive Digital Learning with AI: A Smarter, Fairer Future
Creating Inclusive Digital Learning with AI: A Smarter, Fairer Future
Impelsys Inc.
 
FME for Good: Integrating Multiple Data Sources with APIs to Support Local Ch...
FME for Good: Integrating Multiple Data Sources with APIs to Support Local Ch...
Safe Software
 
vertical-cnc-processing-centers-drillteq-v-200-en.pdf
vertical-cnc-processing-centers-drillteq-v-200-en.pdf
AmirStern2
 
National Fuels Treatments Initiative: Building a Seamless Map of Hazardous Fu...
National Fuels Treatments Initiative: Building a Seamless Map of Hazardous Fu...
Safe Software
 
Security Tips for Enterprise Azure Solutions
Security Tips for Enterprise Azure Solutions
Michele Leroux Bustamante
 
TrustArc Webinar - 2025 Global Privacy Survey
TrustArc Webinar - 2025 Global Privacy Survey
TrustArc
 
“Addressing Evolving AI Model Challenges Through Memory and Storage,” a Prese...
“Addressing Evolving AI Model Challenges Through Memory and Storage,” a Prese...
Edge AI and Vision Alliance
 
Murdledescargadarkweb.pdfvolumen1 100 elementary
Murdledescargadarkweb.pdfvolumen1 100 elementary
JorgeSemperteguiMont
 
War_And_Cyber_3_Years_Of_Struggle_And_Lessons_For_Global_Security.pdf
War_And_Cyber_3_Years_Of_Struggle_And_Lessons_For_Global_Security.pdf
biswajitbanerjee38
 
FIDO Seminar: Targeting Trust: The Future of Identity in the Workforce.pptx
FIDO Seminar: Targeting Trust: The Future of Identity in the Workforce.pptx
FIDO Alliance
 
SAP Modernization Strategies for a Successful S/4HANA Journey.pdf
SAP Modernization Strategies for a Successful S/4HANA Journey.pdf
Precisely
 
Can We Use Rust to Develop Extensions for PostgreSQL? (POSETTE: An Event for ...
Can We Use Rust to Develop Extensions for PostgreSQL? (POSETTE: An Event for ...
NTT DATA Technology & Innovation
 
Kubernetes Security Act Now Before It’s Too Late
Kubernetes Security Act Now Before It’s Too Late
Michael Furman
 
FIDO Seminar: Authentication for a Billion Consumers - Amazon.pptx
FIDO Seminar: Authentication for a Billion Consumers - Amazon.pptx
FIDO Alliance
 
MuleSoft for AgentForce : Topic Center and API Catalog
MuleSoft for AgentForce : Topic Center and API Catalog
shyamraj55
 
AI VIDEO MAGAZINE - June 2025 - r/aivideo
AI VIDEO MAGAZINE - June 2025 - r/aivideo
1pcity Studios, Inc
 
FME for Distribution & Transmission Integrity Management Program (DIMP & TIMP)
FME for Distribution & Transmission Integrity Management Program (DIMP & TIMP)
Safe Software
 
High Availability On-Premises FME Flow.pdf
High Availability On-Premises FME Flow.pdf
Safe Software
 
Integration of Utility Data into 3D BIM Models Using a 3D Solids Modeling Wor...
Integration of Utility Data into 3D BIM Models Using a 3D Solids Modeling Wor...
Safe Software
 
Tech-ASan: Two-stage check for Address Sanitizer - Yixuan Cao.pdf
Tech-ASan: Two-stage check for Address Sanitizer - Yixuan Cao.pdf
caoyixuan2019
 
Creating Inclusive Digital Learning with AI: A Smarter, Fairer Future
Creating Inclusive Digital Learning with AI: A Smarter, Fairer Future
Impelsys Inc.
 

JRuby with Java Code in Data Processing World

  • 1. JRuby with Java Code in Data Processing World JRubyConf.EU at 31 Jul 2015 Satoshi Tagomori (@tagomoris)
  • 2. Satoshi "Moris" Tagomori (@tagomoris) Fluentd, Norikra, MessagePack-Ruby,... Docker logging driver for Fluentd (docker v1.8) Treasure Data, Inc.
  • 3. https://jobs.lever.co/treasure-data We're hiring! OSS team (developer / community manager) Distributed system engineer (Hadoop, queue/workers) Front-end engineer (RoR)
  • 7. Data Processing World Hadoop, Spark, Tez, Flink, Storm, Kafka, ... Hive, Pig, Drill, Impala, Presto, ....
  • 8. Java + Scala, Clojure + C++, .... Data Processing World on JVM
  • 9. Data Processing World Many CPU cores, Large memory, High rate Disk I/O, ... High throughput data processing Hadoop YARN/MapReduce/HDFS API compatibility
  • 10. Two OSS using Java&JRuby
  • 11. Norikra: Stream Processing with SQL for everybody Server software, written in JRuby, runs on JVM Open source software (GPLv2) http://norikra.github.io/ https://github.com/norikra/norikra Distributed on rubygems.org "gem i norikra"
  • 12. What Norikra does: SELECT path, SUM(bytes) AS s FROM www_access_logs.win:length_batch(10) WHERE status=200 GROUP BY path ORDER BY s DESC
  • 13. SELECT path, SUM(bytes) AS s FROM www_access_logs.win:length_batch(10) WHERE status=200 GROUP BY path ORDER BY s DESC {"path":"/", "status":200, "bytes":300, "duration":0.03, "referer":"...", "user-agent":"...." path:"/", s:301 1
  • 14. SELECT path, SUM(bytes) AS s FROM www_access_logs.win:length_batch(10) WHERE status=200 GROUP BY path ORDER BY s DESC {"path":"/download/a", "status":200, "bytes":10240, "duration":0.53, "referer":"...", "user-agent":"...." path:"/", s:301 path:"/download/a", s:10240 2
  • 15. SELECT path, SUM(bytes) AS s FROM www_access_logs.win:length_batch(10) WHERE status=200 GROUP BY path ORDER BY s DESC {"path":"/", "status":404, "bytes":0, "duration":0.08, "referer":"...", "user-agent":"...." path:"/", s:301 path:"/download/a", s:10240 3
  • 16. SELECT path, SUM(bytes) AS s FROM www_access_logs.win:length_batch(10) WHERE status=200 GROUP BY path ORDER BY s DESC {"path":"/", "status":200, "bytes":301, "duration":0.01, "referer":"...", "user-agent":"...." path:"/", s:602 path:"/download/a", s:10240 4
  • 17. SELECT path, SUM(bytes) AS s FROM www_access_logs.win:length_batch(10) WHERE status=200 GROUP BY path ORDER BY s DESC {"path":"/download/b", "status":200, "bytes":678, "duration":0.11, "referer":"...", "user-agent":"...." path:"/", s:602 path:"/download/a", s:10240 path:"/download/b", s:678 5
  • 18. SELECT path, SUM(bytes) AS s FROM www_access_logs.win:length_batch(10) WHERE status=200 GROUP BY path ORDER BY s DESC {"path":"/download/b", "status":200, "bytes":678, "duration":0.13, "referer":"...", "user-agent":"...." path:"/", s:602 path:"/download/a", s:10240 path:"/download/b", s:1356 6
  • 19. SELECT path, SUM(bytes) AS s FROM www_access_logs.win:length_batch(10) WHERE status=200 GROUP BY path ORDER BY s DESC {"path":"/", "status":200, "bytes":301, "duration":0.02, "referer":"...", "user-agent":"...." path:"/", s:903 path:"/download/a", s:10240 path:"/download/b", s:1356 7
  • 20. SELECT path, SUM(bytes) AS s FROM www_access_logs.win:length_batch(10) WHERE status=200 GROUP BY path ORDER BY s DESC {"path":"/", "status":200, "bytes":301, "duration":0.09, "referer":"...", "user-agent":"...." path:"/", s:1204 path:"/download/a", s:10240 path:"/download/b", s:1356 8
  • 21. SELECT path, SUM(bytes) AS s FROM www_access_logs.win:length_batch(10) WHERE status=200 GROUP BY path ORDER BY s DESC {"path":"/download/a", "status":200, "bytes":10240, "duration":1.1, "referer":"...", "user-agent":"...." path:"/", s:1204 path:"/download/a", s:20480 path:"/download/b", s:1356 9
  • 22. SELECT path, SUM(bytes) AS s FROM www_access_logs.win:length_batch(10) WHERE status=200 GROUP BY path ORDER BY s DESC {"path":"/", "status":200, "bytes":301, "duration":0.05, "referer":"...", "user-agent":"...." path:"/", s:1505 path:"/download/a", s:20480 path:"/download/b", s:1356 10
  • 23. SELECT path, SUM(bytes) AS s FROM www_access_logs.win:length_batch(10) WHERE status=200 GROUP BY path ORDER BY s DESC 10 {"path":"/download/a", "s":20480} {"path":"/", "s":1505} {"path":"/download/b", "s":1356}
  • 24. Norikra and Java Norikra is written in JRuby, and using Esper Key factor: productivity (33days until first release) Esper:Java library, provides Complex Event Processing SQL parser, executor Many features and good performance Licensed under GPLv2
  • 25. Plugins as rubygems Norikra Server (on JVM) Esper (Query Engine) Type Definition Manager Output Event Pool Norikra Engine RPC Server mizuno (Jetty + Rack) Rack RPC Handler Listener UDF UDF User-Defined Functions "gem i norikra-udf-xxx" written in Java, or JRuby (compiled to Java) works in Esper instance: must be a Java class Listener handler for output data of queries, written in JRuby "gem i norikra-listener-xxx"
  • 26. Embulk "Embulk is a open-source bulk data loader that helps data transfer between various databases, storages, file formats, and cloud services." http://www.embulk.org/docs/
  • 27. Embulk: makes painful data integration work relaxed Plugin-based parallel bulk data loader Open source software (Apache License v2.0) http://www.embulk.org/ https://github.com/embulk/embulk Distributed as .jar or on rubygems.org Plugins are on rubygems.org http://www.slideshare.net/frsyuki/embuk-making-data-integration-works-relaxed http://www.slideshare.net/HiroshiNakamura/embulk-20150411
  • 28. HDFS MySQL Amazon S3 Embulk CSV Files SequenceFile Salesforce.com Elasticsearch Cassandra Hive Redis ✓ Parallel execution ✓ Data validation ✓ Error recovery ✓ Deterministic behavior ✓ Idempotet retrying Plugins Plugins bulk load
  • 29. #ccc_cd4 / #embulk InputPlugin OutputPlugin Executor plugin Filter plugin Filter plugin Filter plugins records Threads, MapReduce records convert, … input, … output. 29 records config
  • 30. #ccc_cd4 / #embulk InputPlugin FileInput plugin OutputPlugin FileOutput plugin Encoder plugin Formatter plugin Decoder plugin Parser plugin HDFS, S3,
 Riak CS, … gzip, bzip2,
 aes, … CSV, JSON,
 pcap, … buffer buffer buffer buffer Filter plugin Filter plugin Filter plugins recordsrecords Executor plugin 30 records config
  • 31. Embulk and Java Embulk core is written in Java mainly for performance Embulk plugins: are loaded over API based on JRuby are written in JRuby or Java JRuby for early release Java for performance
  • 32. InputPlugin module Embulk class InputExample < InputPlugin Plugin.register_input('example', self) def self.transaction(config, &control) # read config task = { 'message' => config.param('message', :string, default: nil) } threads = config.param('threads', :int, default: 2) columns = [ Column.new(0, 'col0', :long), Column.new(1, 'col1', :double), Column.new(2, 'col2', :string), ] # BEGIN here commit_reports = yield(task, columns, threads) # COMMIT here puts "Example input finished" return {} end def run(task, schema, index, page_builder) puts "Example input thread #{@index}…" 10.times do |i| @page_builder.add([i, 10.0, "example"]) end @page_builder.finish commit_report = { } return commit_report end end end
  • 33. OutputPlugin module Embulk class OutputExample < OutputPlugin Plugin.register_output('example', self) def self.transaction( config, schema, processor_count, &control) # read config task = { 'message' => config.param('message', :string, default: "record") } puts "Example output started." commit_reports = yield(task) puts "Example output finished. Commit reports = #{commit_reports.to_json}" return {} end def initialize(task, schema, index) puts "Example output thread #{index}..." super @message = task.prop('message', :string) @records = 0 end def add(page) page.each do |record| hash = Hash[schema.names.zip(record)] puts "#{@message}: #{hash.to_json}" @records += 1 end end def finish end def abort end def commit commit_report = { "records" => @records } return commit_report end end end
  • 34. Plugin management: Norikra Esper instance Engine Plugin management UDF Listener plugins as gems plugin loader written in JRuby Java JRuby
  • 35. Plugin management: Embulk Embulk core Plugin management input/output/filter parser/formatter Java JRuby decoder/encoder file-input/output executor plugins as gems plugin loader written in JRuby
  • 36. Pluggable software on JVM & Java API Java? Scala? Clojure? JRuby?: JRuby Plugin packaging: jar? gem?: gem rubygem.org >>> maven central (or others) especially for plugin authors Plugin loader: Class Loader? "require"?: require
  • 37. JRuby in Japan Not so many users :( CRuby is super major software in Japan Java -> Ruby -> Scala? Golang?
  • 38. Make your software pluggable. Make eco-system&community. with JRuby! Thanks!