Logistics
Due Friday June 1, 5pm
Quantifying the effect of tail latency on networked services
In this exercise, you are going to simulate a distributed, replicated network service’s overall performance. In particular, you are going to quantify the effect of latency and latency variation on delivered performance.
Start by reading Dean and Barroso’s The Tail at Scale, Comm. of the ACM, Vol 56, No 2, Pages 74-80.
Overview
You are going to simulate the performance of three different systems:
- Scenario 1: A single client issues N independent, sequential, non-overlapping requests to a network service
- Scenario 2: A single client issues P parallel requests to a network service, waiting until all responses have been completed and the results have been returned before considering the operation complete
- Scenario 3: A single client issues P parallel requests to a network service, waiting until the first 90% of responses have been completed and returned before considering the operation complete. The remaining 10% of responses are simply ignored.
Note that you do not need to actually code up such a service, or send any data over the network! Instead you can use Matlab, Excel, Python (or any other software you want) to calculate the latency of a request, and you’ll analyze those latency values. This assignment isn’t a programming assignment per se.
Emulating delay
To emulate this service, you’re going to instantiate a random variable N(mu,sigma), drawn from a Gaussian/Normal distribution with mean mu and standard deviation sigma. This variable represents the observed request/response delay, in milliseconds. Any negative observations seen should be “rounded up” to 0 (i.e., don’t wait).
One way to generate variables from this distribution in Python
is to use the random.gauss()
function. In Matlab, you can use
the normrnd function, and in Excel you can use NORMINV.
Thought question: How do you find the average time from a sorted list of numbers? The 90th percentile of a sorted list of length N? The 100th percentile of a list of length N?
Experiments
You are going to implement a Factorial design experiment, in which you measure the effect of variance and scale on the performance of a distributed system.
For each of scenario 1, 2, and 3, described above, compute the latency for:
- A 10 node cluster, mean request time of 250 milliseconds, std. deviation of 250
- A 10 node cluster, mean request time of 250 milliseconds, std. deviation of 10
- A 10 node cluster, mean request time of 200 milliseconds, std. deviation of 200
- A 10 node cluster, mean request time of 200 milliseconds, std. deviation of 10
- A 10,000 node cluster, mean request time of 250 milliseconds, std. deviation of 250
- A 10,000 node cluster, mean request time of 250 milliseconds, std. deviation of 10
- A 10,000 node cluster, mean request time of 200 milliseconds, std. deviation of 200
- A 10,000 node cluster, mean request time of 200 milliseconds, std. deviation of 10
Report
For each of the experiments and scenarios, report your results. You should include the mean service latency in the case of Scenario 1, and the operation latency in the case of Scenarios 2 and 3.
In a paragraph, describe the impact that tail latency has on parallelism within a distributed system. What was the effect of “shedding” the slowest 10% of the requests (the so-called stragglers)? Does this shedding effect change depending on the scale of the service?
Submission
Please submit this homework as a PDF on gradescope.