SlideShare a Scribd company logo
Swift Distributed Tracing Method and Tools
by Zhang Hua (Edward)
Standards Team/ETI/CDL/IBM
Agenda
 Background
 Tracing Proposal
 Tracing Architecture
 Tracing Data Model
 Tracing Analysis Tools
 Reference
Background
• Swift is a large scale distributed object store span thousands of nodes
across multiple zones and different regions.
– End to end performance is critical to success of Swift.
– Tools that aid in understanding the behavior and reasoning about performance issue are
invaluable.
• Motivation
– For a particular client request X, what is the actual route when it is being served by
different services? Is there any difference b/w actual route and expected route even we
know the access patterns?
– What is the performance behavior of the server components and third-party services?
Which part is slower than expected?
– How can we quickly diagnose the problem when it breaks at some points ?
e.g. PUT request X: Client(1) X Proxy-Server (1) Container-Server (1) X1” Account-Server (1)
X ’ Container-Server (2) X2” Account-Server (2)
Container-Server (3) X3” Account-Server (3)
Which part is slow? Looking at your logs?
When a request is made to Swift, it is given an unique transaction id. This id should be
in every log line that has to do with that request. This can be useful when looking at all
the services that are hit by a single request. But….is it efficient or handy to do?
Correlate the logs
Proxy server log @ node-P
Container server log @ node-C
Account server log @ node-A
Object server log @ node-O
Correlate the information pieces by transaction id and client IP from all logs of related hashed nodes!
• Counters + Counter_rate(sampling)
– Proxy-Server.{ACO}.{METHOD}.{CODE}
– {ACO}-server.{METHOD}.{CODE}
• Timers + Timer_data
– {ACO}-{DAEMON}.timing
– {ACO}-{DAEMON}.error.timing
– {ACO}-server.{METHOD}.timing
StatsD Metrics
StatsD logging options:
# access_log_statsd_host = localhost
# access_log_statsd_port = 8125
# access_log_statsd_default_sample_rate = 1.0
# access_log_statsd_sample_rate_factor = 1.0
# access_log_statsd_metric_prefix =
# access_log_headers = false
# log_statsd_valid_http_methods =
GET,HEAD,POST,PUT,DELETE,COPY,OPTIONS
Pros and cons of current implt.
• ReThink it
Can we provide a real time end to end performance tracing/tracking tool in Swift
infrastructure for developers and users to facilitate their analysis in development and
operation environment?
statsD logging
Pros • Real time performance metrics to monitor the
health of Swift cluster
• Performance impact is low by sending metrics
data via UDP protocol, no hit on local disk I/O
• Supported by different backend to report and
visualization
• Light-weighted
• Simple to use
• Rich logging tools
cons • Designed for cluster level healthy, not for end to
end performance.
• Can not provide metrics data for a specific set of
requests.
• No relationship between different set of metrics
for specific transactions or requests.
• Not designed for real time
• Require more efforts to collect and
analysis
• No representation for individual span
• Message size limitation
Our Proposal
• Goal
– Target for researchers, developers and admins, provide a method of traceability to
understand end to end performance issue and identify the bottlenecks.
• Scope
 Add WSGI middleware and hooks into swift components to collect trace data
 The middleware to control the activation and generation of trace
 Generate trace and span ids, collect the data and tired them together
 Send traced data to aggregator and saved into repository
 Minor fix of current Swift implementation to allow the path to include complete hops.
 Similar to trans-id, the trace-id and span-id need to be propagated through HTTP headers correctly b/w
services and components.
 Analysis tools of report and visualization
 Query the traced data by tiered trace ids
 Reconstruct span tree for each trace
Swift Messaging Route
Swift
Client
Proxy
Server
Container
Server
Container
Server
Container
Server
Account
Server
Auth
Account
Server
Account
Server
Request-XPUT Response-XPUT
Request-X’’PUT
Request-X”’PUT Response-
X’”PUT
Response-X’’PUT
Create a new container: PUT /account/container
• Swift components talks via HTTP request
and response messages.
• It is easy to use HTTP headers as the clue to
trace down the route.
Request-X’GET
Response-X’GET
Span Tree of Trace
Swift
Client
Proxy
Server
Container
Server
Container
Server
Container
Server
Account
Server
Auth
Account
Server
Account
Server
Request-XPUT
X-Trace-Id: 1234
Response-XPUT
Request-X’’PUT
X-Trace_Id: 1234
X-Span-Id: 1
Request-X”’PUT
X-Trace-Id: 1234
X-Span-Id: 2
Response-
X’”PUT
Response-X’’PUT
• X-Trace-Id: identification of each
trace
 Use X-Trans-Id to support
different cluster?
 Or generate new id for this
purpose?
• X-Span-Id: identification of each
span to represent individual
HTTP RESTful call and WSGI call.
 Generate new span id for
this purpose
(notes: UUID can be used for implementation)
Create a new container: PUT /account/container
Request-X’GET
Response-X’GET
X-trace Middleware Architecture
1. Generate trace ids based on configuration.
2. Create spans and collect trace data
3. Propagate trace ids to next hop
4. Send trace data into a repository via
separate transport protocol/channel
Swift
Client
Proxy
Server
Container
Server
Container
Server
Container
Server
Account
Server
Auth
Account
Server
Account
Server
x-trace
x-trace
x-
trace
Tracedatarepository
x-trace
Patches to fix the request path
• The trace id is passed along by proxy
server in HTTP headers, but will be lost
at some points because of recreating a
new request for next hops.
• Patches are needed to fix this problem
to form a complete tracing path for
container server, object server, etc.
Swift
Client
Proxy
Server
Container
Server
Container
Server
Container
Server
Account
Server
Auth
Account
Server
Account
Server
x-trace
x-trace
x-
trace
Tracedatarepository
x-tracepropagate
trace id in next
new request
Tie together tracing data
Reconstruct causal and temporal relationship view for PUT container call
Proxy-Server.PUT parent-span-id=0, span-id=1
timeline
Container-Server.PUT parent-span-id=1, span-id=2
Container-Server.PUT parent-span-id=1, span-id=3
Container-Server.PUT parent-span-id=1, span-id=4
Account-Server.PUT
parent-span-id=2, span-id=5
Account-Server.PUT
parent-span-id=3, span-id=6
Account-Server.PUT
parent-span-id=4, span-id=7
0 ms 200 ms50 ms 150 ms100 ms
Swift-Client.PUT parent-span-id=none, span-id=0
201
201
201
201
201
201 201
Another example: upload an object
Proxy-Server.PUT parent-span-id=0, span-id=1
timeline
Object-Server.PUT parent-span-id=1, span-id=2
Object-Server.PUT parent-span-id=1, span-id=3
Object-Server.PUT parent-span-id=1, span-id=4
Container-Server.PUT
parent-span-id=2, span-id=5
Container-Server.PUT
parent-span-id=3, span-id=6
Container-Server.PUT
parent-span-id=4, span-id=7
0 ms 200 ms50 ms 150 ms100 ms
Swift-Client.PUT parent-span-id=none, span-id=0
201
201
201
201
201
201 201
pipeline:main
Trace into middleware of the pipeline
• Expand the trace path into
WSGI call b/w middleware to
get more complete trace data.
• Possible choices
– Decorators for __call__
@trace_here()
def __call__(self, environ, start_response)
– Hack paste deployment package
– Profile with filters
Swift
Client
Proxy
Server
x-trace
Tracedatarepository
tempauth
cache
tempurl
dlo
Pipeline = catch_errors gatekeeper healthcheck proxy-logging cache container_sync bulk slo dlo ratelimit crossdomain tempauth tempurl formpost
staticweb container-quotas account-quotas proxy-logging proxy-serve
slo
…
Backend trace data model
{
"_id" : "14a467a402904aee87de4028a8595493",
"endpoint" : {
"port" : "6031",
"type" : "server",
"name" : "container.server",
"ipv4" : "127.0.0.1"
},
"name" : "GET",
"parent" : "57fbd3ec12fe4912ba89e7a8eb97f2e7",
"start_time" : 1400146616.554865,
"trace_id" : "d7ff028674c5471e94b964ec37d35546",
"end_time" : 1400146616.559608,
"annotations" : [
{
"type" : "string",
"value" :
"/sdb1/347/TEMPAUTH_test/summit",
"key" : "request_path",
"event" : "sr"
},
{
"type" : "string",
"value" : "200 OK",
"key" : "return_code",
"event" : "ss"
}
]
}
{
"_id" : "57fbd3ec12fe4912ba89e7a8eb97f2e7",
"endpoint" : {
"port" : "8080",
"type" : "server",
"name" : "proxy.server",
"ipv4" : "127.0.0.1"
},
"name" : "GET",
"parent" : "5602ca4010fe420c9fa56528faf711ab",
"start_time" : 1400146616.490691,
"trace_id" : "d7ff028674c5471e94b964ec37d35546",
"end_time" : 1400146616.58012,
"annotations" : [
{
"type" : "string",
"value" : "/v1/TEMPAUTH_test/summit",
"key" : "request_path",
"event" : "sr"
},
{
"type" : "string",
"value" : "200 OK",
"key" : "return_code",
"event" : "ss"
}
]
}
Query and analysis tools
• Query
– Query trace data by trace_id, span_id, order or range by time, group by nodes,
annotation keys
• Trace timeline
– Plot the spans on the timeline with causal relationships
• Diagnose
– Analyze the critical path for a success response
– Identify the failure point of in the path
• Simulation
– Replay the recorded processing of the requests
• Data Mining
Reference
• Google Dapper – a large-scale distributed systems tracing infrastructure
• Twitter Zipkin - a distributed tracing system that helps us gather timing
data for all the disparate services at Twitter.
• Berkeley XTrace : a pervasive network tracing framework
Demo
Q&A

More Related Content

PPTX
Spark+flume seattle
PPTX
Deep dive into N1QL: SQL for JSON: Internals and power features.
PPTX
Processing and retrieval of geotagged unmanned aerial system telemetry
PDF
Solutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
PDF
Confluent real time_acquisition_analysis_and_evaluation_of_data_streams_20190...
PDF
So You Want to Write a Connector?
PDF
Closing the Loop in Extended Reality with Kafka Streams and Machine Learning ...
PDF
Introduction to Stream Processing
Spark+flume seattle
Deep dive into N1QL: SQL for JSON: Internals and power features.
Processing and retrieval of geotagged unmanned aerial system telemetry
Solutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
Confluent real time_acquisition_analysis_and_evaluation_of_data_streams_20190...
So You Want to Write a Connector?
Closing the Loop in Extended Reality with Kafka Streams and Machine Learning ...
Introduction to Stream Processing

What's hot (20)

PDF
Spark (Structured) Streaming vs. Kafka Streams
PDF
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
PDF
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
PDF
Solutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
PDF
Deep dive into stateful stream processing in structured streaming by Tathaga...
PDF
Continuous SQL with Apache Streaming (FLaNK and FLiP)
PDF
Apache flume by Swapnil Dubey
PDF
Unified Big Data Processing with Apache Spark
PPTX
Real Time Data Processing Using Spark Streaming
PDF
Ultimate journey towards realtime data platform with 2.5M events per sec
PDF
Location Analytics - Real-Time Geofencing using Kafka
PDF
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
PPTX
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
PDF
Location Analytics - Real Time Geofencing using Apache Kafka
PDF
Cowboy dating with big data
PDF
KSQL - Stream Processing simplified!
PPTX
Spark Streaming Recipes and "Exactly Once" Semantics Revised
PPTX
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
PPTX
Session 09 - Flume
PDF
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
Deep dive into stateful stream processing in structured streaming by Tathaga...
Continuous SQL with Apache Streaming (FLaNK and FLiP)
Apache flume by Swapnil Dubey
Unified Big Data Processing with Apache Spark
Real Time Data Processing Using Spark Streaming
Ultimate journey towards realtime data platform with 2.5M events per sec
Location Analytics - Real-Time Geofencing using Kafka
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Location Analytics - Real Time Geofencing using Apache Kafka
Cowboy dating with big data
KSQL - Stream Processing simplified!
Spark Streaming Recipes and "Exactly Once" Semantics Revised
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Session 09 - Flume
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
Ad

Viewers also liked (14)

PDF
Control review for iOS
KEY
Action Controller Overview, Season 1
PDF
Let's Learn Ruby - Basic
PDF
September2011aftma
PPT
Ruby on Rails testing with Rspec
PPT
jQuery For Beginners - jQuery Conference 2009
PDF
Learning jQuery in 30 minutes
PDF
A swift introduction to Swift
PPTX
Web application architecture
PPT
Introduction to html
PDF
Infinum iOS Talks #1 - Swift under the hood: Method Dispatching by Vlaho Poluta
PPTX
Introduction to Web Architecture
PDF
jQuery and Rails: Best Friends Forever
PDF
Swift Programming Language
Control review for iOS
Action Controller Overview, Season 1
Let's Learn Ruby - Basic
September2011aftma
Ruby on Rails testing with Rspec
jQuery For Beginners - jQuery Conference 2009
Learning jQuery in 30 minutes
A swift introduction to Swift
Web application architecture
Introduction to html
Infinum iOS Talks #1 - Swift under the hood: Method Dispatching by Vlaho Poluta
Introduction to Web Architecture
jQuery and Rails: Best Friends Forever
Swift Programming Language
Ad

Similar to Swift distributed tracing method and tools v2 (20)

PDF
Microservices Tracing with Spring Cloud and Zipkin (devoxx)
PDF
PinTrace Advanced AWS meetup
PDF
Microservices Tracing with Spring Cloud and Zipkin
PDF
Microservices Tracing With Spring Cloud and Zipkin @CybercomDEV
PDF
Everything You wanted to Know About Distributed Tracing
PDF
stackconf 2024 | Ignite: Distributed Tracing using OpenTelemetry and Jaeger b...
PDF
Distributed Tracing
PDF
Monitoring to the Nth tier: The state of distributed tracing in 2016
PDF
Pintrace: Distributed tracing @Pinterest
PDF
Distributed Tracing
PDF
OSMC 2018 | Distributed Tracing FAQ by Gianluca Arbezzano
PDF
Microservices Tracing With Spring Cloud and Zipkin @Szczecin JUG
PPTX
Observability for Application Developers (1)-1.pptx
PPTX
Latency analysis for your microservices using Spring Cloud & Zipkin
PDF
"Distributed Tracing: New DevOps Foundation" by Jayesh Ahire
PDF
Adopting Open Telemetry as Distributed Tracer on your Microservices at Kubern...
PDF
Go Observability (in practice)
PPTX
The Incremental Path to Observability
PDF
Pintrace: Distributed tracing@Pinterest
Microservices Tracing with Spring Cloud and Zipkin (devoxx)
PinTrace Advanced AWS meetup
Microservices Tracing with Spring Cloud and Zipkin
Microservices Tracing With Spring Cloud and Zipkin @CybercomDEV
Everything You wanted to Know About Distributed Tracing
stackconf 2024 | Ignite: Distributed Tracing using OpenTelemetry and Jaeger b...
Distributed Tracing
Monitoring to the Nth tier: The state of distributed tracing in 2016
Pintrace: Distributed tracing @Pinterest
Distributed Tracing
OSMC 2018 | Distributed Tracing FAQ by Gianluca Arbezzano
Microservices Tracing With Spring Cloud and Zipkin @Szczecin JUG
Observability for Application Developers (1)-1.pptx
Latency analysis for your microservices using Spring Cloud & Zipkin
"Distributed Tracing: New DevOps Foundation" by Jayesh Ahire
Adopting Open Telemetry as Distributed Tracer on your Microservices at Kubern...
Go Observability (in practice)
The Incremental Path to Observability
Pintrace: Distributed tracing@Pinterest

Recently uploaded (20)

PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
Spectroscopy.pptx food analysis technology
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
cuic standard and advanced reporting.pdf
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPT
Teaching material agriculture food technology
PDF
Empathic Computing: Creating Shared Understanding
PPTX
Cloud computing and distributed systems.
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Encapsulation theory and applications.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
“AI and Expert System Decision Support & Business Intelligence Systems”
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Spectroscopy.pptx food analysis technology
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
cuic standard and advanced reporting.pdf
The AUB Centre for AI in Media Proposal.docx
Spectral efficient network and resource selection model in 5G networks
Diabetes mellitus diagnosis method based random forest with bat algorithm
Teaching material agriculture food technology
Empathic Computing: Creating Shared Understanding
Cloud computing and distributed systems.
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
MIND Revenue Release Quarter 2 2025 Press Release
Understanding_Digital_Forensics_Presentation.pptx
Encapsulation theory and applications.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Dropbox Q2 2025 Financial Results & Investor Presentation
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx

Swift distributed tracing method and tools v2

  • 1. Swift Distributed Tracing Method and Tools by Zhang Hua (Edward) Standards Team/ETI/CDL/IBM
  • 2. Agenda  Background  Tracing Proposal  Tracing Architecture  Tracing Data Model  Tracing Analysis Tools  Reference
  • 3. Background • Swift is a large scale distributed object store span thousands of nodes across multiple zones and different regions. – End to end performance is critical to success of Swift. – Tools that aid in understanding the behavior and reasoning about performance issue are invaluable. • Motivation – For a particular client request X, what is the actual route when it is being served by different services? Is there any difference b/w actual route and expected route even we know the access patterns? – What is the performance behavior of the server components and third-party services? Which part is slower than expected? – How can we quickly diagnose the problem when it breaks at some points ? e.g. PUT request X: Client(1) X Proxy-Server (1) Container-Server (1) X1” Account-Server (1) X ’ Container-Server (2) X2” Account-Server (2) Container-Server (3) X3” Account-Server (3)
  • 4. Which part is slow? Looking at your logs? When a request is made to Swift, it is given an unique transaction id. This id should be in every log line that has to do with that request. This can be useful when looking at all the services that are hit by a single request. But….is it efficient or handy to do?
  • 5. Correlate the logs Proxy server log @ node-P Container server log @ node-C Account server log @ node-A Object server log @ node-O Correlate the information pieces by transaction id and client IP from all logs of related hashed nodes!
  • 6. • Counters + Counter_rate(sampling) – Proxy-Server.{ACO}.{METHOD}.{CODE} – {ACO}-server.{METHOD}.{CODE} • Timers + Timer_data – {ACO}-{DAEMON}.timing – {ACO}-{DAEMON}.error.timing – {ACO}-server.{METHOD}.timing StatsD Metrics StatsD logging options: # access_log_statsd_host = localhost # access_log_statsd_port = 8125 # access_log_statsd_default_sample_rate = 1.0 # access_log_statsd_sample_rate_factor = 1.0 # access_log_statsd_metric_prefix = # access_log_headers = false # log_statsd_valid_http_methods = GET,HEAD,POST,PUT,DELETE,COPY,OPTIONS
  • 7. Pros and cons of current implt. • ReThink it Can we provide a real time end to end performance tracing/tracking tool in Swift infrastructure for developers and users to facilitate their analysis in development and operation environment? statsD logging Pros • Real time performance metrics to monitor the health of Swift cluster • Performance impact is low by sending metrics data via UDP protocol, no hit on local disk I/O • Supported by different backend to report and visualization • Light-weighted • Simple to use • Rich logging tools cons • Designed for cluster level healthy, not for end to end performance. • Can not provide metrics data for a specific set of requests. • No relationship between different set of metrics for specific transactions or requests. • Not designed for real time • Require more efforts to collect and analysis • No representation for individual span • Message size limitation
  • 8. Our Proposal • Goal – Target for researchers, developers and admins, provide a method of traceability to understand end to end performance issue and identify the bottlenecks. • Scope  Add WSGI middleware and hooks into swift components to collect trace data  The middleware to control the activation and generation of trace  Generate trace and span ids, collect the data and tired them together  Send traced data to aggregator and saved into repository  Minor fix of current Swift implementation to allow the path to include complete hops.  Similar to trans-id, the trace-id and span-id need to be propagated through HTTP headers correctly b/w services and components.  Analysis tools of report and visualization  Query the traced data by tiered trace ids  Reconstruct span tree for each trace
  • 9. Swift Messaging Route Swift Client Proxy Server Container Server Container Server Container Server Account Server Auth Account Server Account Server Request-XPUT Response-XPUT Request-X’’PUT Request-X”’PUT Response- X’”PUT Response-X’’PUT Create a new container: PUT /account/container • Swift components talks via HTTP request and response messages. • It is easy to use HTTP headers as the clue to trace down the route. Request-X’GET Response-X’GET
  • 10. Span Tree of Trace Swift Client Proxy Server Container Server Container Server Container Server Account Server Auth Account Server Account Server Request-XPUT X-Trace-Id: 1234 Response-XPUT Request-X’’PUT X-Trace_Id: 1234 X-Span-Id: 1 Request-X”’PUT X-Trace-Id: 1234 X-Span-Id: 2 Response- X’”PUT Response-X’’PUT • X-Trace-Id: identification of each trace  Use X-Trans-Id to support different cluster?  Or generate new id for this purpose? • X-Span-Id: identification of each span to represent individual HTTP RESTful call and WSGI call.  Generate new span id for this purpose (notes: UUID can be used for implementation) Create a new container: PUT /account/container Request-X’GET Response-X’GET
  • 11. X-trace Middleware Architecture 1. Generate trace ids based on configuration. 2. Create spans and collect trace data 3. Propagate trace ids to next hop 4. Send trace data into a repository via separate transport protocol/channel Swift Client Proxy Server Container Server Container Server Container Server Account Server Auth Account Server Account Server x-trace x-trace x- trace Tracedatarepository x-trace
  • 12. Patches to fix the request path • The trace id is passed along by proxy server in HTTP headers, but will be lost at some points because of recreating a new request for next hops. • Patches are needed to fix this problem to form a complete tracing path for container server, object server, etc. Swift Client Proxy Server Container Server Container Server Container Server Account Server Auth Account Server Account Server x-trace x-trace x- trace Tracedatarepository x-tracepropagate trace id in next new request
  • 13. Tie together tracing data Reconstruct causal and temporal relationship view for PUT container call Proxy-Server.PUT parent-span-id=0, span-id=1 timeline Container-Server.PUT parent-span-id=1, span-id=2 Container-Server.PUT parent-span-id=1, span-id=3 Container-Server.PUT parent-span-id=1, span-id=4 Account-Server.PUT parent-span-id=2, span-id=5 Account-Server.PUT parent-span-id=3, span-id=6 Account-Server.PUT parent-span-id=4, span-id=7 0 ms 200 ms50 ms 150 ms100 ms Swift-Client.PUT parent-span-id=none, span-id=0 201 201 201 201 201 201 201
  • 14. Another example: upload an object Proxy-Server.PUT parent-span-id=0, span-id=1 timeline Object-Server.PUT parent-span-id=1, span-id=2 Object-Server.PUT parent-span-id=1, span-id=3 Object-Server.PUT parent-span-id=1, span-id=4 Container-Server.PUT parent-span-id=2, span-id=5 Container-Server.PUT parent-span-id=3, span-id=6 Container-Server.PUT parent-span-id=4, span-id=7 0 ms 200 ms50 ms 150 ms100 ms Swift-Client.PUT parent-span-id=none, span-id=0 201 201 201 201 201 201 201
  • 15. pipeline:main Trace into middleware of the pipeline • Expand the trace path into WSGI call b/w middleware to get more complete trace data. • Possible choices – Decorators for __call__ @trace_here() def __call__(self, environ, start_response) – Hack paste deployment package – Profile with filters Swift Client Proxy Server x-trace Tracedatarepository tempauth cache tempurl dlo Pipeline = catch_errors gatekeeper healthcheck proxy-logging cache container_sync bulk slo dlo ratelimit crossdomain tempauth tempurl formpost staticweb container-quotas account-quotas proxy-logging proxy-serve slo …
  • 16. Backend trace data model { "_id" : "14a467a402904aee87de4028a8595493", "endpoint" : { "port" : "6031", "type" : "server", "name" : "container.server", "ipv4" : "127.0.0.1" }, "name" : "GET", "parent" : "57fbd3ec12fe4912ba89e7a8eb97f2e7", "start_time" : 1400146616.554865, "trace_id" : "d7ff028674c5471e94b964ec37d35546", "end_time" : 1400146616.559608, "annotations" : [ { "type" : "string", "value" : "/sdb1/347/TEMPAUTH_test/summit", "key" : "request_path", "event" : "sr" }, { "type" : "string", "value" : "200 OK", "key" : "return_code", "event" : "ss" } ] } { "_id" : "57fbd3ec12fe4912ba89e7a8eb97f2e7", "endpoint" : { "port" : "8080", "type" : "server", "name" : "proxy.server", "ipv4" : "127.0.0.1" }, "name" : "GET", "parent" : "5602ca4010fe420c9fa56528faf711ab", "start_time" : 1400146616.490691, "trace_id" : "d7ff028674c5471e94b964ec37d35546", "end_time" : 1400146616.58012, "annotations" : [ { "type" : "string", "value" : "/v1/TEMPAUTH_test/summit", "key" : "request_path", "event" : "sr" }, { "type" : "string", "value" : "200 OK", "key" : "return_code", "event" : "ss" } ] }
  • 17. Query and analysis tools • Query – Query trace data by trace_id, span_id, order or range by time, group by nodes, annotation keys • Trace timeline – Plot the spans on the timeline with causal relationships • Diagnose – Analyze the critical path for a success response – Identify the failure point of in the path • Simulation – Replay the recorded processing of the requests • Data Mining
  • 18. Reference • Google Dapper – a large-scale distributed systems tracing infrastructure • Twitter Zipkin - a distributed tracing system that helps us gather timing data for all the disparate services at Twitter. • Berkeley XTrace : a pervasive network tracing framework
  • 19. Demo
  • 20. Q&A