Unit testing data with marbles - Jane Stewart Adams, Leif Walsh

UNIT TESTING DATAUNIT TESTING DATA
WITH MARBLESWITH MARBLES
JANE ADAMS & LEIF WALSHJANE ADAMS & LEIF WALSH

ONCE UPON A TIME...ONCE UPON A TIME...

CHILDREN WERE OLDER THAN THEIR PARENTS.CHILDREN WERE OLDER THAN THEIR PARENTS.

THEY WERETHEY WERE ALOTALOTOLDER THAN THEIR PARENTS.OLDER THAN THEIR PARENTS.

THEY WERETHEY WERE ALOTALOTOLDER THAN THEIR PARENTS.OLDER THAN THEIR PARENTS.
THEY WERE A LOT OLDER THANTHEY WERE A LOT OLDER THAN EVERYONEEVERYONE..

EVERYONE THAT WORKS WITH DATA HAS STORIES LIKEEVERYONE THAT WORKS WITH DATA HAS STORIES LIKE
THISTHIS

WHAT WERE MY ASSUMPTIONS?WHAT WERE MY ASSUMPTIONS?

1. Children are born after their parents

1. Children are born after their parents
2. People can't live forever

WHAT ELSE DO WE ASSUME ABOUT DATA?WHAT ELSE DO WE ASSUME ABOUT DATA?

Values are correct

Values are correct
We're not missing any data

Values are correct
Records are unique

Values are correct
Records are unique
Measurements are precise

Values are correct
Records are unique
Measurements are precise
(this is a non-exhaustive list)

WHY DOES THIS MATTER?WHY DOES THIS MATTER?

WE DON'T JUST HAVE DATA TO HAVE IT.WE DON'T JUST HAVE DATA TO HAVE IT.

WE DON'T JUST HAVE DATA TO HAVE IT.WE DON'T JUST HAVE DATA TO HAVE IT.
WE USE DATA TO MAKE DECISIONS.WE USE DATA TO MAKE DECISIONS.

WE SHOULD BE EXPLICIT ABOUT OUR ASSUMPTIONS.WE SHOULD BE EXPLICIT ABOUT OUR ASSUMPTIONS.

WHAT ARE THE IMPORTANT PROBLEMS HERE?WHAT ARE THE IMPORTANT PROBLEMS HERE?

Data are always changing

Some changes are loud while others are silent

Manually checking data is inconsistent and error-prone

We're working with alotof data

We're working with alotof data
We're working with a lot of differentkindsof data

WHAT DO WE WANT TO DO?WHAT DO WE WANT TO DO?

Encode our assumptions in testable form

Test those assumptions on incoming data

Report when our assumptions don't hold

Report allof the assumptions that don't hold

"Whatifwewroteunittestsfordata
likewewriteunittestsforcode?"

HOW DOES UNITTEST SOLVE OUR PROBLEM?HOW DOES UNITTEST SOLVE OUR PROBLEM?

Report allof the assumptions that don't hold

tripduration 0 days 00:18:09
starttime 2018-08-01 00:00:09.341000-04:00
stoptime 2018-08-01 00:18:18.889000-04:00
start station id 31
start station name Seaport Hotel - Congress St at Seaport Ln
start station latitude 42.3488
start station longitude -71.0417
end station id 190
end station name Nashua Street at Red Auerbach Way
end station latitude 42.3657
end station longitude -71.0643
bikeid 1026
usertype Subscriber
birth year 1969
gender 0
Hubway Bike Share Dataset

How long was the trip?
starttime 2018-08-01 00:00:09.341000-04:00
stoptime 2018-08-01 00:18:18.889000-04:00
start station id 31
end station id 190
bikeid 1026
usertype Subscriber
birth year 1969
gender 0

How far was the trip?
starttime 2018-08-01 00:00:09.341000-04:00
stoptime 2018-08-01 00:18:18.889000-04:00
start station id 31
end station id 190
bikeid 1026
usertype Subscriber
birth year 1969
gender 0

Internal metadata
starttime 2018-08-01 00:00:09.341000-04:00
stoptime 2018-08-01 00:18:18.889000-04:00
start station id 31
end station id 190
bikeid 1026
usertype Subscriber
birth year 1969
gender 0

Who took the trip?
starttime 2018-08-01 00:00:09.341000-04:00
stoptime 2018-08-01 00:18:18.889000-04:00
start station id 31
end station id 190
bikeid 1026
usertype Subscriber
birth year 1969
gender 0

???
starttime 2018-08-01 00:00:09.341000-04:00
stoptime 2018-08-01 00:18:18.889000-04:00
start station id 31
end station id 190
bikeid 1026
usertype Subscriber
birth year 1969
gender 0

class TripDistanceTestCase(unittest.TestCase):

def setUp(self):
self.data = ...

def tearDown(self):
delattr(self, 'data')

def test_for_long_trips(self):
thresholds = [
('marathon', 42195),
('10km', 10000)
]

for severity, threshold in thresholds:
with self.subTest(severity=severity):
long_trips = self.data[
self.data['distance_meters'] > threshold]
self.assertTrue(long_trips.empty)

Load the data

def setUp(self):
self.data = ...

def tearDown(self):

thresholds = [
('10km', 10000)
]


Pick some thresholds

def setUp(self):
self.data = ...

def tearDown(self):

thresholds = [
('10km', 10000)
]


For each threshold

def setUp(self):
self.data = ...

def tearDown(self):

thresholds = [
('10km', 10000)
]


Find trips longer than the threshold

def setUp(self):
self.data = ...

def tearDown(self):

thresholds = [
('10km', 10000)
]


Assert none exist

def setUp(self):
self.data = ...

def tearDown(self):

thresholds = [
('10km', 10000)
]


$ python -m unittest test_bikeshare.py
======================================================================
FAIL: test_for_long_trips (test_bikeshare.TripDistanceTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/leif/test_bikeshare.py", line 107, in test_for_long_trips
AssertionError: False is not true

----------------------------------------------------------------------
Ran 1 test in 0.000s

FAILED (failures=1)

WHAT DOES THIS GIVE US?WHAT DOES THIS GIVE US?

New data are automatically tested for long trips

Don't have to remember howto test for long trips

Don't have to remember howto test for long trips
Can easily run this test over historical data

TEST WRITING INTERLUDE...TEST WRITING INTERLUDE...

WE'RE IN A PRETTY GOOD SPOT!WE'RE IN A PRETTY GOOD SPOT!

1. We've thought through our assumptions about the data

2. We've made them explicit by writing them down

3. We've made them executable

4. We've automated them

4. We've automated them
⭐⭐

Her: "Is there a way to see local variables in my unittest output?"
======================================================================
----------------------------------------------------------------------

----------------------------------------------------------------------

FAILED (failures=1)

Her: "Is there a way to see local variables in my unittest output?"
Him: "I think pytest does that..."
======================================================================
----------------------------------------------------------------------

----------------------------------------------------------------------

FAILED (failures=1)

PUT YOURSELF IN THE TEST CONSUMER'S SHOESPUT YOURSELF IN THE TEST CONSUMER'S SHOES

What is this test doing? Why is it here?

What am I supposed to do about this failure?

How bad is it?

How bad is it?
Have we seen this failure before? When?

How bad is it?
CONTEXT IS EXPENSIVE TO RECOVERCONTEXT IS EXPENSIVE TO RECOVER

THIS ISN'T UNIQUE TO DATA,THIS ISN'T UNIQUE TO DATA,
BUT IT'S ESPECIALLY HARD WITH DATABUT IT'S ESPECIALLY HARD WITH DATA

1. Assumptions aren't black-and-white

2. Failures are usually introduced by someone else

2. Failures are usually introduced by someone else
3. Different tests require different follow-up

Unit testing data with marbles - Jane Stewart Adams, Leif Walsh

ANATOMY OF A MARBLES FAILURE MESSAGEANATOMY OF A MARBLES FAILURE MESSAGE

$ python -m marbles test_bikeshare.py
======================================================================
----------------------------------------------------------------------
marbles.core.marbles.ContextualAssertionError: False is not true

Source (/home/leif/test_bikeshare.py):
151 self.data['distance_meters'] > threshold]
> 152 self.assertTrue(long_trips.empty, note=note)
153
Locals:
severity = 'marathon'
threshold = 42195
long_trips =
start station latitude stop station latitude ...
27955 42.366277 0.0 ...
Note:
There appear to be some trips in the data that are longer than a
marathon! If these are legitimate trips, consider contacting the
local news station about a human-interest story. If these do not
appear to be legitimate trips, contact the bike share mechanics
to have affected bikes identified and repaired.

What is this test doing?
======================================================================
----------------------------------------------------------------------

153
Locals:
threshold = 42195
long_trips =
27955 42.366277 0.0 ...
Note:

Why is it here?
======================================================================
----------------------------------------------------------------------

153
Locals:
threshold = 42195
long_trips =
27955 42.366277 0.0 ...
Note:

======================================================================
----------------------------------------------------------------------

153
Locals:
threshold = 42195
long_trips =
27955 42.366277 0.0 ...
Note:

How bad is it?
======================================================================
----------------------------------------------------------------------

153
Locals:
threshold = 42195
long_trips =
27955 42.366277 0.0 ...
Note:

Can we add more context?
======================================================================
----------------------------------------------------------------------

153
Locals:
threshold = 42195
long_trips =
27955 42.366277 0.0 ...
Note:

SEMANTIC ASSERTIONSSEMANTIC ASSERTIONS

self.assertTrue((lower < x) and (x < upper))

self.assertGreater(x, lower)
self.assertGreater(upper, x)

self.assertTrue(all(a < b for a, b in zip([lower, x], [x, upper])))

self.assertTrue(all(a < b for a, b in zip([lower, x], [x, upper])))
self.assertBetween(x, lower, upper)

marbles.mixinsmarbles.mixins
from marbles.mixins import mixins

class TripDistanceTestCase(BikeshareTestCase, mixins.BetweenMixins):

def setUp(self):
self.data = ...

def tearDown(self):

def test_for_unreasonable_distances(self):
for distance in self.data['distance_meters']:
self.assertBetween(distance, 100, 42195)

CUSTOM ASSERTIONSCUSTOM ASSERTIONS

self.assertEqual(len(long_trips), 0)

self.assertEqual(len(long_trips), 0)
class DataFrameMixins(object):

def assertDataFrameEmpty(self, df, msg=None):
self.assertTrue(df.empty, msg=msg)

DOES MARBLES RECOVER THE CONTEXT WE WANTED?DOES MARBLES RECOVER THE CONTEXT WE WANTED?

How bad is it?

ASSERTION LOGGINGASSERTION LOGGING

ASSERTION LOGGINGASSERTION LOGGING
import marbles.core
from marbles.core import log

class TripDistanceTestCase(BikeshareTestCase):
...

if __name__ == '__main__':
log.logger.configure(logfile='marbles.log')
marbles.core.main()

{
"case": "test_for_long_trips (test_bikeshare.TripDistanceTestCase)",
"test_case": "TripDistanceTestCase",
"test_method": "test_for_long_trips",
"assertion": "assertTrue",
...
"locals": [
{
"key": "severity",
"value": "marathon"
},
{
"key": "threshold",
"value": "42195"
},
{
"key": "long_trips",
"value": "
27955 42.366277 0.0 ... "
}
],
"month": "2016-07-01",
"severity": "marathon",
"anomalies": "1",
"result": "fail"
}

Which test was running?
{
...
"locals": [
{
"key": "severity",
"value": "marathon"
},
{
"key": "threshold",
"value": "42195"
},
{
"value": "
27955 42.366277 0.0 ... "
}
],
"month": "2016-07-01",
"anomalies": "1",
"result": "fail"
}

What did we assert?
{
...
"locals": [
{
"key": "severity",
"value": "marathon"
},
{
"key": "threshold",
"value": "42195"
},
{
"value": "
27955 42.366277 0.0 ... "
}
],
"month": "2016-07-01",
"anomalies": "1",
"result": "fail"
}

Local variables
{
...
"locals": [
{
"key": "severity",
"value": "marathon"
},
{
"key": "threshold",
"value": "42195"
},
{
"value": "
27955 42.366277 0.0 ... "
}
],
"month": "2016-07-01",
"anomalies": "1",
"result": "fail"
}

Which data were we testing?
{
...
"locals": [
{
"key": "severity",
"value": "marathon"
},
{
"key": "threshold",
"value": "42195"
},
{
"value": "
27955 42.366277 0.0 ... "
}
],
"month": "2016-07-01",
"anomalies": "1",
"result": "fail"
}

Other information about the assertion
{
...
"locals": [
{
"key": "severity",
"value": "marathon"
},
{
"key": "threshold",
"value": "42195"
},
{
"value": "
27955 42.366277 0.0 ... "
}
],
"month": "2016-07-01",
"anomalies": "1",
"result": "fail"
}

More (not pictured)
{
...
"locals": [
{
"key": "severity",
"value": "marathon"
},
{
"key": "threshold",
"value": "42195"
},
{
"value": "
27955 42.366277 0.0 ... "
}
],
"month": "2016-07-01",
"anomalies": "1",
"result": "fail"
}

HISTORICAL FAILURESHISTORICAL FAILURES

HISTORICAL FAILURESHISTORICAL FAILURES
"Have we seen this kind of problem before?"

AGGREGATE DATASET HEALTH METRICSAGGREGATE DATASET HEALTH METRICS

AGGREGATE DATASET HEALTH METRICSAGGREGATE DATASET HEALTH METRICS
df = df.pivot_table(
index=['month'], columns=['severity'],
values='anomalies', aggfunc=sum)
df.describe()

CONTEXT IS GOOD FOR SOFTWARE TESTS, TOOCONTEXT IS GOOD FOR SOFTWARE TESTS, TOO

$ python -m unittest
F
======================================================================
FAIL: test_return_code (docs.examples.getting_started.ResponseTestCase)
----------------------------------------------------------------------
File "/home/leif/git/marbles/docs/examples/getting_started.py", line 43, in test_return_code
201
AssertionError: 409 != 201

----------------------------------------------------------------------

$ python -m marbles
F
======================================================================
FAIL: test_return_code (docs.examples.getting_started.ResponseTestCase)
----------------------------------------------------------------------
marbles.core.marbles.ContextualAssertionError: 409 != 201

Source (/home/leif/git/marbles/docs/examples/getting_started.py):
40 res = requests.put(endpoint, data=data)
> 41 self.assertEqual(
42 res.status_code,
43 201
44 )
Locals:
endpoint = 'http://example.com/api/v1/resource'
data = {'id': 1, 'name': 'Little Bobby Tables'}
res = <docs.examples.getting_started.Response object at 0x7fae97e78978>

----------------------------------------------------------------------

TWO STEPS TO MARBLESTWO STEPS TO MARBLES
$ pip install marbles
$ python -m marbles test_module.py

GITHUBGITHUB
github.com/twosigma/marbles

DOCUMENTATIONDOCUMENTATION
marbles.readthedocs.io

CONTRIBUTING AND GETTING HELPCONTRIBUTING AND GETTING HELP
github.com/twosigma/marbles/issues

✨ READ BETTER TEST FAILURES ✨✨ READ BETTER TEST FAILURES ✨
&
github.com/twosigma/marbles
marbles.readthedocs.io
@thejunglejane @leifwalsh

Unit testing data with marbles - Jane Stewart Adams, Leif Walsh

More Related Content

Similar to Unit testing data with marbles - Jane Stewart Adams, Leif Walsh (6)

More from PyData (20)

Recently uploaded (20)

Unit testing data with marbles - Jane Stewart Adams, Leif Walsh