CSE 291
2018 May 5: Homework 2: Sorting around the world

Update

The due date has been extended on account of the AWS limits that were in place. Those limits have been removed. Let us know if you come across a limit.

Notes

  • Due date: Monday May 9, 5pm
  • You should do this homework individually
  • This homework looks long and involved–don’t panic! We’ve provided the code you’ll need to use, and you’ll be spending your time running the code on the servers and collecting measurements.

Overview

In this exercise, you’re going to do three things: (1) learn how to use RPCs, in particular Google’s gRPC framework, to write a wide-area distributed application, (2) push your code to datacenters on four continents(!), and (3) measure the resulting performance of the application.

During this course we’ll cover a number of important application performance metrics, however three of the most important are:

  • Latency: The time taken to complete an operation
  • Throughput: The amount of data exchanged per unit time.
  • Bandwidth: The maximum amount of raw data that can be transferred between a source and a destination per unit time.

Setting up AWS

Before starting the homework, please read the ACMS instructions for using Amazon AWS.

The TAs have also created a very helpful walkthrough-guide located on the course webpage.

Once you’re inside your VM, you need to get the homework code. You can download this code from GitHub: https://github.com/gmporter/sp18-hw2

Prepare your VM:

./setup_vm.sh
. ~/.bashrc

When running your tests, you will need to use the VM’s public IP. This can be viewed in the AWS console.

When you’re done working, stop any VMs you started! Even if you haven’t finished the homework, stop your VMs. VMs cost money, and you only have a set amount you can use before ACMS will prevent you from using AWS and thus preventing you from completing the homework. To stop your VMs, just click on the VM in the AWS console, and select Actions -> Instance State -> Stop.

The application

You are going to implement a data sorting service called GlobeSort. GlobeSort is a cloud-hosted, RPC-based service which accepts data from a client, sorts that data, and returns the data in sorted order back to the client. As you might imagine, GlobeSort is unlikely to succeed as a startup, but it does have the quality of helping to illuminate some of the issues of building widely distributed systems.

GlobeSort only sorts integers, and its interface is extremely simple:

void ping(), // returns if the service is running
int[] sortIntegers(int[] input)  // sorts input

The ping() command is used to simply detect whether the service is up and running–it is generally good practice to always include it in any RPC you develop.

The sortIntegers() command sorts the list given to it as input in ascending order, returning the result in an ordered list. The input can have duplicate and negative values. For example, here are a few examples of invocations of sortIntegers():

sortIntegers([3,1,9,4]) == [1,3,4,9]
sortIntegers([3,3,1,2]) == [1,2,3,3]
sortIntegers([]) == []

We’ve actually already implemented this simple sort service using Google’s gRPC library. The starter code includes this implementation is in Java. Although we’ve implemented the basic sort, you will need to (slightly) modify the code we’ve provided to you in order to perform your measurements.

Java

For Java, we will be using the Maven build system. Inside of the main directory, run mvn protobuf:compile protobuf:compile-custom to create the protobuf stubs required, then run mvn package to compile the Java code. To run the server, use the following command:

./target/globesort/bin/runServer <server_port>

To run the client, use this command:

./target/globesort/bin/runClient <server_ip> <server_port> <num_values>

The client sends a random list to the server for sorting.

In terms of resources to help with learning gRPC and Java, consult:

Measuring GlobeSort

For this project, you will be using Amazon’s AWS EC2 cloud service to provision VMs onto multiple datacenters, or “regions”, across the world. You’re going to measure the latency, throughput, and bandwidth of GlobeSort.

Measuring throughput

To measure GlobeSort’s throughput, you’re going to invoke the sortIntegers() call with a list of integers, and use the measured invocation time to compute the (1) throughput of the application, and (2) the one-way network throughput between the client and the server.

You will need to choose the number of integers to send. You want to pick a value that is large enough so that the transfer runs for 5 or more seconds to get a good, stable reading. If you pick a number that is too small, then the overheads of setting up the connection and invoking gRPC will dominate your results. And if you set a number that is too big, then your program will run for too long. Part of your experiment will be to figure out how big to set this number so that your experiment runs for at least a few seconds. Something in the low 1 Millions should be about right.

Application throughput

The throughput of the sortIntegers() call can be expressed in two ways: as either records sorted per second, or (input) bytes sorted per second. Measuring the former is straightforward, you only need the number of integers sorted and the total invocation time. For this assignment, you only need to compute the number of integers sorted per second for your report. If you’re feeling adventurous, investigate how you might compute the latter metric.

One-way network throughput

Estimating the one-way network throughput of a sort operation is more challenging than the overall throughput, since in reality, invoking a sort operation involves three different steps:

  • The client transfers the input data to the server
  • The server sorts the data
  • The server transfers the sorted result back to the client

Measuring the combination of these three steps gives the application throughput, which we covered in the previous section. For network throughput, we want to remove the amount of time the server took to sort the data- that way we only measure the amount of time taken by the network. To do this, you will need to modify the sortIntegers() call to return the amount of time the server took sorting the data.

We are going to assume that the network throughput between the client and server is symmetric, meaning that the throughput from the client to the server is that same as the throughput from the server back to to the client. So you need to measure the total time of the sort RPC at the client, and subtract out the time it takes to sort the data, and divide the resulting time in half to estimate the one-way throughput. You’ll need to modify the provided RPC to provide the amount of time the server took sorting the data to the client.

Measuring latency

To measure latency, you’re going to measure the time it takes to run the ping() command. Similar to the one-way bandwidth calculation above, we’re going to assume that the latency between the client and the server is symmetric.

Measuring GlobeSort: Summary

In summary, you are going to modify the GlobeSort API as needed to measure the one-way network throughput of the service. You are then going to collect estimates of the round-trip latency, the application throughput, and the one-way network throughput of GlobeSort on Amazon AWS.

Data collection

Remote measurements

For the overall application throughput, one-way network throughput, and round-trip latency metrics above, collect results each for:

  • Seoul, Korea ↔ Dublin, Ireland
  • Dublin, Ireland ↔ Sao Paulo, Brazil
  • Sao Paulo, Brazil ↔ Mumbai, India
  • Mumbai, India ↔ Seoul, Korea

Using the provided AWS account given to you, instantiate one t2.micro instance type in each of the above four locations, which correspond to the AWS regions:

  • ap-northeast-2 (Seoul, Korea)
  • eu-west-1 (Dublin, Ireland)
  • sa-east-1 (Sao Paulo, Brazil)
  • ap-south-1 (Mumbai, India)

Perform the measurements three times each of the pairs above (12 sets of measurements in all).

Report

You will need to report on your observations and measurements in a written report, submitted in PDF format to GradeScope.com

Your report (which will likely be 1-2 pages) should include:

  • Your name
  • Your PID
  • Your GitHub ID
  • Latency experiment
    • The date/time you ran the experiment
    • Your raw data in a table. The table should have a row for each source city, and a column for each destination city. In each cell, put the three measurements. Note that if you fill in the cell from city A to B, you don’t need to fill in the cell from B to A.
    • A short description (a few sentences) describing your results–what did you discover running this experiment? What do your results show in terms of the latency between different parts of the world?
  • Application-level throughput experiment
    • The date/time you ran the experiment
    • Your raw data in a table. The table should have a row for each source city, and a column for each destination city. In each cell, put the three measurements. Note that if you fill in the cell from city A to B, you don’t need to fill in the cell from B to A. Remember to sort enough data so that the RPC call takes about 20 seconds or so.
    • A short description (a few sentences) describing your results–what did you discover running this experiment? What do your results show in terms of the application-level throughput between different parts of the world?
  • Network-level throughput experiment
    • The date/time you ran the experiment
    • Your raw data in a table. The table should have a row for each source city, and a column for each destination city. In each cell, put the three measurements. Note that if you fill in the cell from city A to B, you don’t need to fill in the cell from B to A. Remember to sort enough data so that the RPC call takes about 20 seconds or so.
    • A short description (a few sentences) describing your results–what did you discover running this experiment? What do your results show in terms of the network-level throughput between different parts of the world? How does the application-level throughput compare to the network-level throughput?