Cloud Operations with Streaming Analytics using Apache NiFi and Apache Flink

Miguel Pérez Colino // @mmmmmmpc
CLOUD OPERATIONS WITH STREAMING
ANALYTICS USING BIG DATA TOOLS
DataWorks Summit Sydney 2017
Miguel Pérez Colino
Senior Design Product Manager, ISBU - Red Hat
miguel@redhat.com / @mmmmmmpc
Suneel Marthi
Senior Principal Software Engineer - Red Hat
smarthi@redhat.com / @suneelmarthi

THE PROBLEM

Cloud Deployments
Act as one single thing …
… and need to be managed and operated as one
Source: https://commons.wikimedia.org/wiki/File:Auklet_flock_Shumagins_1986.jpg

Cloud Deployments
They do really scale ...
https://www.cncf.io/blog/2016/08/23/deploying-1000-nodes-of-openshift-on-the-cncf-cluster-part-1/
● Higher scalability
● More workloads per physical
machine (multi-tenant)
● Network and Storage also
Software Defined
● Containers and
Microservices providing
more granularity

THE CHALLENGE

Questions to solve
● Who is the user?
● What is there problem?
● How do other people solve this problem?
● How can we better solve the problem?
● What would the end result look/feel like?

[DESIGN THINKING]
THE BEST WAY TO HAVE A GOOD
IDEA IS TO HAVE LOTS OF IDEAS.

Who is the user? (Personas)
● Cloud Ops
● Developer
● Security Ops
● Monitoring
● Service Designer
● Marketing
● IT Manager
● Infrastructure Architect?
Customer’s issues are mostly
“Day 2” → Operations
● Operate OpenStack
● Operate OpenShift
○ Platform Ops
○ Developer logs
Logs → root cause analysis + forensic

Logs
Config
Telemetry
App debug info
Events
Monitoring
Provides Events,
Consumes Logs
Cloud Ops
Root Cause Analysis
Developer
App Analysis & Debug
Security Engineer
Sec Analysis, Audits
Marketing
Access to stats
Service
DesignerIT Manager
Access to aggregated
data, i.e. SLA, usage
Personae

What are there problems?
● Data aggregation
○ Ingestion
○ Transport
● Data Model → Common Data Model
● Correlation
○ With external sources (Events / Metrics / Config …)
○ Add more Information types to the solution
● Coherency (Data format and Enrichment)

Data (What)
Data + Information flow in Log Aggregation
ProcessIngest StoreCollect Query ViewGenerate
Derived from: http://www.dataintensive.info/

Personae (Who)
That can use Log Aggregation
Log Aggregation
Monitoring
Provides Events,
Consumes Logs
Cloud Ops
Root Cause
Analysis
Developer
App Analysis &
Debug
Security Engineer
User /
Marketing
Access to stats
Service
DesignerIT Manager
Access to
aggregated data,
i.e. SLA, usage

Personae (Motivation)
That need Log Aggregation
Cloud Ops (Apps)
“I want to proactively know
about active or potential
degradation of service”
Cloud Ops (OpenStack)
“User reports that their VM
request failed and returned
error”
Developer (OpenShift)
“My recent commit resulted in
Jenkins test failure”
“Application (multi-tiered)
launched from CloudForms
returns error”
Cloud Suite User

Situational Awareness (Why)
Or the need of it!
Source: https://en.wikipedia.org/wiki/Situation_awareness

THE SOLUTION

Focus on One Persona and Use Case
“Oscar the OpenStack Operator”
Log Aggregation
Monitoring
Provides Events,
Consumes Logs
Cloud Ops
Root Cause
Analysis
Developer
App Analysis &
Debug
Security Engineer
User /
Marketing
Access to stats
Service
DesignerIT Manager
Access to
aggregated data,
i.e. SLA, usage

Prototyped User Experience
Creating User Interface Mockups

Implementation
Red Hat’s containerized solution with EFK stack
ElasticFluent Kibana
ProcessIngest StoreCollect Query ViewCreate

Implementation
KEEDIO’s containerized solution with a Big Data toolset
SOLR /
Cassandra
Kafka PatternFly
ProcessIngest StoreCollect Query ViewCreate
Flume / NiFi
HDFS
(tier 2)
Spark / FlinkRsyslog

Implementation: Generation
Rsyslog
What?
● Open-source software used for
forwarding log messages in a network.
● Implements the syslog protocol
Why?
● Fast system for log processing.
● High performance, Low footprint,
included in the OS
● Inputs from wide variety of sources

Implementation: Ingestion
Apache Nifi
What?
● Reliable system to process and
distribute data
● Language: Java
Why?
● Graphical management
● Clusterizable
● Data Provenance
● Many sources and destinations

Use Case: Ingestion
Apache Nifi
Easily customize “tagging” and processing
rules via Graphical User Interface
Review steps with data provenance
“Like having an IDE and a Debugger for
data processing rules.”

Implementation: Collect
Apache Kafka
What?
● Open-source distributed messaging
system
● Languages: Java & Scala
Why?
● High throughput and low-latency
● Clusterable, load balancing and async
send.
● Allows handling real-time data feeds
● Customizable data retention on disk
● Enables multiple consumers on the
same data
● “Rewind and Replay”

Implementation: Process
Apache Flink
What?
● Open-source stream processing
framework for distributed, high-
performing, always-available, and
accurate data streaming apps.
● Language: Java, Scala
Why?
● Streaming-first, continuous processing
● Fault-tolerant, stateful computations
● Scalable & performance. High
throughput, low latency
● Advanced filtering capabilities (CEP)

Use Case: Collect + Process
Apache Kafka + Flink
● Long retention periods in queue
enable new post processing targets
to previous events
● Only the right info sent to the right
target
● Detect anomalies and trigger alerts

● Different storage targets with filtered post
processed output

● Alerts sent to Kafka. A listener can enable
all kind of alerts
Alert ListenerTelegramE-Mail

Implementation: Store + Query
Apache Cassandra
What?
● Open source NoSQL database, <key,
value> based
● Language: Java
Why?
● Fault tolerant
● Decentralized & scalable
● Fully proven & high performant
● Flexible data model

Implementation: View
Patternfly
What?
● Open Source responsive framework for
frontends
● Language: Javascript, Bootstrap,
AngularJS 1
Why?
● Easy to implement new interfaces
● Includes capabilities for graphs
● (d3 JS + c3 JS)
● Natively responsive (mobile / tablet)
● Well supported and extended (Used in
most Red Hat products)

Implementation
Infrastructure

Deployment

Deployment: View
Patternfly

USE CASE EXAMPLE (CEP)

Use Case: OpenStack Timeouts
Network Timeout by default 30 secs
1. Request of VM
2. Request of vPort (Virtual NIC)
3. vPort generated in more than 30 secs → Timeout!
4. Error generating VM
5. No error generating vPort
Need correlation to detect

What we see ...
Error in Nova
2016-12-05 10:28:14.292 10253 ERROR nova.compute.manager [req-190de497-d90f-48e0-91ea-
f1f1c0877704688ae4039aad471fbab98da1b1e1fcb6 e21be8c7ab34490386508bbd0c58f511 - - -] Instance failed
network setup after 1 attempt(s)
2016-12-05 10:28:14.292 10253 ERROR nova.compute.manager ConnectTimeout: Request to
https://[::1]:9696/v2.0/ports.json timed out
Info in Neutron
2016-12-05 10:28:16.878 13187 INFO neutron.wsgi
[req-827495e1-2ae2-41c1-b51b-2eda57f4ba1d688ae4039aad471fbab98da1b1e1fcb6
e21be8c7ab34490386508bbd0c58f511 - - -] ::1 - - [05/Dec/2016 10:28:16] "POST /v2.0/ports.json HTTP/1.1" 201
900 32.589028

Both lines detected correlated and alert generated. → Alert sent to Kafka
ErrorAlert:
Nova-3-2017-04-28 12:48:20.321
Neutron-6-2017-04-28 12:48:23.123
{"severity":"3","body":"[ Generating synthetic log
CEP_ID=67c8c1cc3d48c3987aee13dce5cf35a1]","spriority":"191","hostname":"overcloud-compute-
1","protocol":"TCP","port":"7790","sender":"/192.168.1.16","service":"Nova","id":"c1318482-11a1-41cd-949e-
5195c54767e5","facility":"23","timestamp":"2017-04-28 12:48:20.321"}
{"severity":"6","body":"[ Generating synthetic log
CEP_ID=67c8c1cc3d48c3987aee13dce5cf35a1]","spriority":"191","hostname":"overcloud-controller-
1","protocol":"TCP","port":"7793","sender":"/192.168.1.13","service":"Neutron","id":"e617d049-7e40-4114-8727-
c6c41140567e","facility":"23","timestamp":"2017-04-28 12:48:23.123"}

Both lines detected correlated and alert generated. → Alert routed to Telegram

THANK YOU
plus.google.com/+RedHat
linkedin.com/company/red-hat
youtube.com/user/RedHatVideos
facebook.com/redhatinc
twitter.com/RedHatNews

BACKUP SLIDES

Cloud Operations with Streaming Analytics using Apache NiFi and Apache Flink

More Related Content

What's hot (20)

Similar to Cloud Operations with Streaming Analytics using Apache NiFi and Apache Flink (20)

More from DataWorks Summit (20)

Recently uploaded (20)

Cloud Operations with Streaming Analytics using Apache NiFi and Apache Flink

Editor's Notes