SlideShare a Scribd company logo
1
Endgame Proprietary
2
Time Series Analysis for Network Security
Phil Roth
Data Scientist @ Endgame
mrphilroth.com
33
First, an introduction. My history of Python
scientific computing, in function calls:
44
os.path.walk
Physics Undergraduate @ PSU
AMANDA Neutrino Telescope
55
pylab.plot
Physics Graduate Student @ UMD
IceCube Neutrino Telescope
66
numpy.fft.fft
Radar Scientist @ User Systems, Inc.
Various Radar Simulations
77
pandas.io.parsers.read_csv
Side Projects
Scraping data from the web
88
sklearn.linear_model.LogisticRegression
Side Projects
Machine learning competitions
99
(the rest of this talk…)
Data Scientist @ Endgame
Time Series Anomaly Detection
1010
Problem:
Highlight when recorded metrics deviate from
normal patterns.
for example: a high number of connections might be an
indication of a brute force attack
for example: a large volume of outgoing data might be an
indication of an exfiltration event
1111
Solution:
Build a system that can track and store
historical records of any metric. Develop an
algorithm that will detect irregular behavior
with minimal false positives.
1212
Gathering Data
kairos
kafka-python
pyspark
Building Models
classification
ewma
arima
1313
real time
stream
batch
historical
Redis
In memory
key-value data
store
HDFS
Large scale
distributed
data store
Kafka Topics
Distributed
message
passing
Data Sources
data flow
1414
kairos
A Python interface to backend storage databases
(redis in my case, others available) tailored for time
series storage.
Takes care of expiring data and different types of time
series (series, histogram, count, gauge, set).
Open sourced by Agora Games.
https://github.com/agoragames/kairos
1515
kairos
Example code:
from redis import Redis
from kairos import Timeseries
intervals = {"days" : {"step" : 60, "steps" : 2880},
"months" : {"step" : 1800, "steps" : 4032}}
rclient = Redis(“localhost”, 6379)
ktseries = Timeseries(rclient, type="histogram”, intervals=intervals)
ktseries.insert(metric_name, metric_value, timestamp)
1616
kafka-python
A Python interface to Apache Kafka, where Kafka is
publish-subscribe messaging rethought as a
distributed commit log.
Allows me to subscribe to the events as they come in
real time.
https://github.com/mumrah/kafka-python
1717
kafka-python
from kafka.client import KafkaClient
from kafka.consumer import SimpleConsumer
kclient = KafkaClient(“localhost:9092”)
kconsumer = SimpleConsumer(kclient, “timevault, “rawmsgs”)
for message in kconsumer :
insert_to_kairos(message)
Example code:
1818
pyspark
A Python interface to Apache Spark, where Spark is a
fast and general engine for large scale data
processing.
Allows me to fill in historical data to the time series
when I add or modify metrics.
http://spark.apache.org/
1919
pyspark
from pyspark import SparkContext, SparkConf
spark_conf = (SparkConf()
.setMaster(“localhost”)
.setAppName(“timevault-update”))
sc = SparkContext(conf=spark_conf)
rdd = (sc.textFile(hdfs_files)
.map(insert_to_kairos)
.count())
Example code:
2020
pyspark
from json import loads
import timevault as tv
from functools import partial
from pyspark import SparkContext, SparkConf
spark_conf = (SparkConf()
.setMaster(“localhost”)
.setAppName(“timevault-update”))
sc = SparkContext(conf=spark_conf)
rdd = (sc.textFile(tv.conf.hdfs_files)
.map(loads)
.flatMap(tv.flatten_message)
.flatMap(partial(tv.emit_metrics, metrics=tv.metrics_to_emit))
.filter(lambda tup : tup[2] < float(tv.conf.limit_time))
.mapPartitions(partial(tv.insert_to_kairos, conf=tv.conf)
.count())
Example code:
2121
the end result
from pandas import DataFrame, to_datetime
series = ktseries.series(metric_name, “months”, transform=transform)
ts, fields = zip(*series.items())
df = DataFrame({"data” : fields}, index=to_datetime(ts, unit="s"))
2222
building models
First naïve model is simply the mean and standard
deviation across all time.
blue: actual number of connections
green: prediction window
red: actual value exceeded standard deviation limit
2323
building models
Second slightly less naïve model is fitting a sine curve
to the whole series.
blue: actual number of connections
green: prediction window
red: actual value exceeded standard deviation limit
2424
classification
Both naïve models left a lot to be desired. Two simple
classifications would help us treat different types of
time series appropriately:
Does this metric show a weekly pattern (ie. different
behavior on weekends versus weekdays)?
Does this metric show a daily pattern?
2525
classification
Fit a sine curve to
the weekday and
weekend periods.
Ratio of the level of
those fits to
determine if
weekdays will be
divided from
weekends.
weekly
2626
classification weekly
from scipy.optimize import leastsq
def fitfunc(p, x) :
return (p[0] * (1 - p[1] * np.sin(2 * np.pi / (24 * 3600) * (x + p[2]))))
def residuals(p, y, x) :
return y - fitfunc(p, x)
def fit(tsdf) :
tsgb = tsdf.groupby(tsdf.timeofday).mean()
p0 = np.array([tsgb[“conns”].mean(), 1.0, 0.0])
plsq, suc = leastsq(residuals, p0, args=(tsgb[“conns”],
np.array(tsgb.index)))
return plsq
2727
classification weekly
def weekend_ratio(tsdf) :
tsdf['weekday'] = pd.Series(tsdf.index.weekday < 5, index=tsdf.index)
tsdf['timeofday'] = (tsdf.index.second + tsdf.index.minute * 60 +
tsdf.index.hour * 3600)
wdayplsq = fit(tsdf[tsdf.weekday == 1])
wendplsq = fit(tsdf[tsdf.weekdy == 0])
return wendplsq[0] / wdayplsq[0]
0 1cutoff 1 / cutoff
No weekly variation.
2828
classification
Weekly pattern.
No weekly pattern.
weekly
2929
classification
Take a Fourier
transform of the time
series, and inspect
the bins associated
with a frequency of a
day.
Use the ratio of
those bins to the first
(constant or DC
component) in order
to classify the time
series.
daily
3030
classification
Time series on
weekdays shown
with a strong daily
pattern.
Fourier transform
with bins around the
day frequency
highlighted.
daily
3131
classification
Time series on
weekends shown
with no daily pattern.
Fourier transform
with bins around the
day frequency
highlighted.
daily
3232
classification
def daily_ratio(tsdf) :
nbins = len(tsdf)
deltat = (tsdf.index[1] - tsdf.index[0]).seconds
deltaf = 1.0 / (len(tsdf) * deltat)
daybin = int((1.0 / (24 * 3600)) / deltaf)
rfft = np.abs(np.fft.rfft(tsdf[“conns”]))
daily_ratio = np.sum(rfft[daybin - 1:daybin + 2]) / rfft[0]
return daily_ratio
daily
Find the bin
associated with the
frequency of a day
using:
3333
ewma
Exponentially weighted moving average:
The decay parameter is specified as a span, s, in
pandas, related to α by:
α = 2 / (s + 1)
A normal EWMA analysis is done when the metric
shows no daily pattern. A stacked EWMA analysis is
done when there is a daily pattern.
3434
ewma
def ewma_outlier(tsdf, stdlimit=5, span=15) :
tsdf[’conns_binpred’] = pd.ewma(tsdf[‘conns’], span=span).shift(1)
tsdf[’conns_binstd’] = pd.ewmstd(tsdf[‘conns’], span=span).shift(1)
tsdf[‘conns_stds’] = ((tsdf[‘conns’] – tsdf[’conns_binpred’]) /
tsdf[‘conns_binstd’])
tsdf[‘conns_outlier’] = (tsdf[‘conns_stds’].abs() > stdlimit)
return tsdf
normal
3535
ewma normal
blue: actual number of connections
green: prediction window
red: actual value exceeded standard deviation limit
3636
ewma
blue: actual response size
green: prediction window
red: actual value exceeded standard deviation limit
normal
3737
ewma stacked
3838
ewma stacked
3939
ewma stacked
4040
ewma
def stacked_outlier(tsdf, stdlimit=4, span=10) :
gbdf = tsdf.groupby(‘timeofday’)[colname]
gbdf = pd.DataFrame({‘conns_binpred’ : gbdf.apply(pd.ewma, span=span),
‘conns_binstd’ : gbdf.apply(pd.ewmstd, span=span)})
interval = tsdf.timeofday[1] - tsdf.timeofday[0]
nshift = int(86400.0 / interval)
gbdf = gbdf.shift(nshift)
tsdf = gbdf.combine_first(tsdf)
tsdf[‘conns_stds’] = ((tsdf[‘conns’] – tsdf[‘conns_binpred’]) /
tsdf[‘conns_binstd’])
tsdf[‘conns_outlier’] = (tsdf[‘conns_stds’].abs() > stdlimit)
return tsdf
stacked
Shift the EWMA
results by a day and
overlay them on the
original DataFrame.
4141
ewma
blue: actual number of connections
green: prediction window
red: actual value exceeded standard deviation limit
stacked
4242
arima
I am currently investigating using ARIMA
(autoregressive integrated moving average) models to
make better predictions.
I’m not convinced that this level of detail is necessary
for the analysis I’m doing, but I wanted to highlight
another cool scientific computing library that’s
available.
4343
arima
from statsmodels.tsa.arima_model import ARIMA
def arima_model_forecast(tsdf, p, d q) :
arima_model = ARIMA(tsdf[“conns”][:-1], (p, d, q)).fit()
forecast, stderr, conf_int = arima_model.forecast(1)
tsdf[“conns_binpred"][-1] = forecast[0]
tsdf[“conns_binstd"][-1] = stderr[0]
return tsdf
4444
arima
blue: actual number of connections
green: prediction window
red: actual value exceeded standard deviation limit
p = d = q = 1
4545
takeaways
Python provides simple and usable interfaces to most
data handling projects.
Combined, these interfaces can create a full data
analysis pipeline from collection to analysis.
46
© 2014 Endgame

More Related Content

PPTX
Ember
PPTX
Machine Learning Model Bakeoff
PPTX
ComputeFest 2012: Intro To R for Physical Sciences
PDF
Scaling up data science applications
PDF
GeoMesa on Apache Spark SQL with Anthony Fox
PDF
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
PDF
Making Structured Streaming Ready for Production
PPTX
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Ember
Machine Learning Model Bakeoff
ComputeFest 2012: Intro To R for Physical Sciences
Scaling up data science applications
GeoMesa on Apache Spark SQL with Anthony Fox
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Making Structured Streaming Ready for Production
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library

What's hot (19)

PDF
Time Series Processing with Apache Spark
PDF
Spark Meetup TensorFrames
PDF
Real Time Big Data Management
PPTX
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB
PDF
A Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
PDF
User Defined Aggregation in Apache Spark: A Love Story
PPTX
TensorFrames: Google Tensorflow on Apache Spark
PDF
Spark Summit EU talk by Ted Malaska
PDF
Ge aviation spark application experience porting analytics into py spark ml p...
PDF
Goal Based Data Production with Sim Simeonov
PPTX
Scott Anderson [InfluxData] | InfluxDB Tasks – Beyond Downsampling | InfluxDa...
PDF
Streams processing with Storm
PDF
Extending Spark SQL API with Easier to Use Array Types Operations with Marek ...
PDF
Apache cassandra and spark. you got the the lighter, let's start the fire
PDF
New developments in open source ecosystem spark3.0 koalas delta lake
PDF
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
PDF
Spark RDD-DF-SQL-DS-Spark Hadoop User Group Munich Meetup 2016
PDF
Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
PDF
An Introduction to time series with Team Apache
Time Series Processing with Apache Spark
Spark Meetup TensorFrames
Real Time Big Data Management
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB
A Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
User Defined Aggregation in Apache Spark: A Love Story
TensorFrames: Google Tensorflow on Apache Spark
Spark Summit EU talk by Ted Malaska
Ge aviation spark application experience porting analytics into py spark ml p...
Goal Based Data Production with Sim Simeonov
Scott Anderson [InfluxData] | InfluxDB Tasks – Beyond Downsampling | InfluxDa...
Streams processing with Storm
Extending Spark SQL API with Easier to Use Array Types Operations with Marek ...
Apache cassandra and spark. you got the the lighter, let's start the fire
New developments in open source ecosystem spark3.0 koalas delta lake
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
Spark RDD-DF-SQL-DS-Spark Hadoop User Group Munich Meetup 2016
Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
An Introduction to time series with Team Apache
Ad

Viewers also liked (18)

PPTX
Hunting on the Cheap
PPTX
Hunting before a Known Incident
PPTX
Examining Malware with Python
PPS
Outpost networksecurity
PPTX
Differential Network Entropy Reveals Cancer System Hallmarks
PDF
When Should I Use Simulation?
PDF
Sim Slides,Tricks,Trends,2012jan15
PDF
A SURVEY ON SECURITY IN WIRELESS SENSOR NETWORKS
PDF
Network analysis methods for assessment & measurement
PDF
Python reading and writing files
PPTX
Loop presentation 2014
PPTX
Yoga gives your life a new direction
PPT
PPT
Evolution of computers
PPT
Cd jackets
PPT
Bookmarks
PDF
Assembly Information Management System
PPTX
New Jersey photos
Hunting on the Cheap
Hunting before a Known Incident
Examining Malware with Python
Outpost networksecurity
Differential Network Entropy Reveals Cancer System Hallmarks
When Should I Use Simulation?
Sim Slides,Tricks,Trends,2012jan15
A SURVEY ON SECURITY IN WIRELESS SENSOR NETWORKS
Network analysis methods for assessment & measurement
Python reading and writing files
Loop presentation 2014
Yoga gives your life a new direction
Evolution of computers
Cd jackets
Bookmarks
Assembly Information Management System
New Jersey photos
Ad

Similar to Time Series Analysis for Network Secruity (20)

PDF
Pythran: Static compiler for high performance by Mehdi Amini PyData SV 2014
PDF
Spark Meetup TensorFrames
PDF
Time Series Analysis:Basic Stochastic Signal Recovery
PDF
User Defined Aggregation in Apache Spark: A Love Story
PDF
Writing Faster Python 3
PDF
Profiling in Python
PDF
Python profiling
PDF
WattGo: Analyses temps-réél de series temporelles avec Spark et Solr (Français)
PPTX
Python for Scientists
PDF
PySpark in practice slides
PDF
Time Series Processing with Solr and Spark: Presented by Josef Adersberger, Q...
PDF
Time Series Analysis
PDF
Time Series Processing with Solr and Spark
PPT
Profiling and optimization
PDF
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
PDF
Simple, fast, and scalable torch7 tutorial
PDF
SFSCON23 - Emily Bourne Yaman Güçlü - Pyccel write Python code, get Fortran ...
PDF
Real-Time Spark: From Interactive Queries to Streaming
PPT
Spark streaming with kafka
PPT
Spark stream - Kafka
Pythran: Static compiler for high performance by Mehdi Amini PyData SV 2014
Spark Meetup TensorFrames
Time Series Analysis:Basic Stochastic Signal Recovery
User Defined Aggregation in Apache Spark: A Love Story
Writing Faster Python 3
Profiling in Python
Python profiling
WattGo: Analyses temps-réél de series temporelles avec Spark et Solr (Français)
Python for Scientists
PySpark in practice slides
Time Series Processing with Solr and Spark: Presented by Josef Adersberger, Q...
Time Series Analysis
Time Series Processing with Solr and Spark
Profiling and optimization
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Simple, fast, and scalable torch7 tutorial
SFSCON23 - Emily Bourne Yaman Güçlü - Pyccel write Python code, get Fortran ...
Real-Time Spark: From Interactive Queries to Streaming
Spark streaming with kafka
Spark stream - Kafka

Recently uploaded (20)

PPTX
Log360_SIEM_Solutions Overview PPT_Feb 2020.pptx
PDF
Tally Prime Crack Download New Version 5.1 [2025] (License Key Free
PDF
CCleaner Pro 6.38.11537 Crack Final Latest Version 2025
PPTX
Advanced SystemCare Ultimate Crack + Portable (2025)
PDF
Cost to Outsource Software Development in 2025
PDF
Designing Intelligence for the Shop Floor.pdf
PPTX
Computer Software and OS of computer science of grade 11.pptx
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PDF
Product Update: Alluxio AI 3.7 Now with Sub-Millisecond Latency
PPTX
Reimagine Home Health with the Power of Agentic AI​
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PDF
17 Powerful Integrations Your Next-Gen MLM Software Needs
PPTX
AMADEUS TRAVEL AGENT SOFTWARE | AMADEUS TICKETING SYSTEM
PPTX
Why Generative AI is the Future of Content, Code & Creativity?
PDF
AutoCAD Professional Crack 2025 With License Key
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PDF
medical staffing services at VALiNTRY
PDF
Salesforce Agentforce AI Implementation.pdf
PPTX
Transform Your Business with a Software ERP System
Log360_SIEM_Solutions Overview PPT_Feb 2020.pptx
Tally Prime Crack Download New Version 5.1 [2025] (License Key Free
CCleaner Pro 6.38.11537 Crack Final Latest Version 2025
Advanced SystemCare Ultimate Crack + Portable (2025)
Cost to Outsource Software Development in 2025
Designing Intelligence for the Shop Floor.pdf
Computer Software and OS of computer science of grade 11.pptx
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
Product Update: Alluxio AI 3.7 Now with Sub-Millisecond Latency
Reimagine Home Health with the Power of Agentic AI​
Design an Analysis of Algorithms II-SECS-1021-03
17 Powerful Integrations Your Next-Gen MLM Software Needs
AMADEUS TRAVEL AGENT SOFTWARE | AMADEUS TICKETING SYSTEM
Why Generative AI is the Future of Content, Code & Creativity?
AutoCAD Professional Crack 2025 With License Key
Odoo Companies in India – Driving Business Transformation.pdf
medical staffing services at VALiNTRY
Salesforce Agentforce AI Implementation.pdf
Transform Your Business with a Software ERP System

Time Series Analysis for Network Secruity

  • 2. 2 Time Series Analysis for Network Security Phil Roth Data Scientist @ Endgame mrphilroth.com
  • 3. 33 First, an introduction. My history of Python scientific computing, in function calls:
  • 4. 44 os.path.walk Physics Undergraduate @ PSU AMANDA Neutrino Telescope
  • 5. 55 pylab.plot Physics Graduate Student @ UMD IceCube Neutrino Telescope
  • 6. 66 numpy.fft.fft Radar Scientist @ User Systems, Inc. Various Radar Simulations
  • 9. 99 (the rest of this talk…) Data Scientist @ Endgame Time Series Anomaly Detection
  • 10. 1010 Problem: Highlight when recorded metrics deviate from normal patterns. for example: a high number of connections might be an indication of a brute force attack for example: a large volume of outgoing data might be an indication of an exfiltration event
  • 11. 1111 Solution: Build a system that can track and store historical records of any metric. Develop an algorithm that will detect irregular behavior with minimal false positives.
  • 13. 1313 real time stream batch historical Redis In memory key-value data store HDFS Large scale distributed data store Kafka Topics Distributed message passing Data Sources data flow
  • 14. 1414 kairos A Python interface to backend storage databases (redis in my case, others available) tailored for time series storage. Takes care of expiring data and different types of time series (series, histogram, count, gauge, set). Open sourced by Agora Games. https://github.com/agoragames/kairos
  • 15. 1515 kairos Example code: from redis import Redis from kairos import Timeseries intervals = {"days" : {"step" : 60, "steps" : 2880}, "months" : {"step" : 1800, "steps" : 4032}} rclient = Redis(“localhost”, 6379) ktseries = Timeseries(rclient, type="histogram”, intervals=intervals) ktseries.insert(metric_name, metric_value, timestamp)
  • 16. 1616 kafka-python A Python interface to Apache Kafka, where Kafka is publish-subscribe messaging rethought as a distributed commit log. Allows me to subscribe to the events as they come in real time. https://github.com/mumrah/kafka-python
  • 17. 1717 kafka-python from kafka.client import KafkaClient from kafka.consumer import SimpleConsumer kclient = KafkaClient(“localhost:9092”) kconsumer = SimpleConsumer(kclient, “timevault, “rawmsgs”) for message in kconsumer : insert_to_kairos(message) Example code:
  • 18. 1818 pyspark A Python interface to Apache Spark, where Spark is a fast and general engine for large scale data processing. Allows me to fill in historical data to the time series when I add or modify metrics. http://spark.apache.org/
  • 19. 1919 pyspark from pyspark import SparkContext, SparkConf spark_conf = (SparkConf() .setMaster(“localhost”) .setAppName(“timevault-update”)) sc = SparkContext(conf=spark_conf) rdd = (sc.textFile(hdfs_files) .map(insert_to_kairos) .count()) Example code:
  • 20. 2020 pyspark from json import loads import timevault as tv from functools import partial from pyspark import SparkContext, SparkConf spark_conf = (SparkConf() .setMaster(“localhost”) .setAppName(“timevault-update”)) sc = SparkContext(conf=spark_conf) rdd = (sc.textFile(tv.conf.hdfs_files) .map(loads) .flatMap(tv.flatten_message) .flatMap(partial(tv.emit_metrics, metrics=tv.metrics_to_emit)) .filter(lambda tup : tup[2] < float(tv.conf.limit_time)) .mapPartitions(partial(tv.insert_to_kairos, conf=tv.conf) .count()) Example code:
  • 21. 2121 the end result from pandas import DataFrame, to_datetime series = ktseries.series(metric_name, “months”, transform=transform) ts, fields = zip(*series.items()) df = DataFrame({"data” : fields}, index=to_datetime(ts, unit="s"))
  • 22. 2222 building models First naïve model is simply the mean and standard deviation across all time. blue: actual number of connections green: prediction window red: actual value exceeded standard deviation limit
  • 23. 2323 building models Second slightly less naïve model is fitting a sine curve to the whole series. blue: actual number of connections green: prediction window red: actual value exceeded standard deviation limit
  • 24. 2424 classification Both naïve models left a lot to be desired. Two simple classifications would help us treat different types of time series appropriately: Does this metric show a weekly pattern (ie. different behavior on weekends versus weekdays)? Does this metric show a daily pattern?
  • 25. 2525 classification Fit a sine curve to the weekday and weekend periods. Ratio of the level of those fits to determine if weekdays will be divided from weekends. weekly
  • 26. 2626 classification weekly from scipy.optimize import leastsq def fitfunc(p, x) : return (p[0] * (1 - p[1] * np.sin(2 * np.pi / (24 * 3600) * (x + p[2])))) def residuals(p, y, x) : return y - fitfunc(p, x) def fit(tsdf) : tsgb = tsdf.groupby(tsdf.timeofday).mean() p0 = np.array([tsgb[“conns”].mean(), 1.0, 0.0]) plsq, suc = leastsq(residuals, p0, args=(tsgb[“conns”], np.array(tsgb.index))) return plsq
  • 27. 2727 classification weekly def weekend_ratio(tsdf) : tsdf['weekday'] = pd.Series(tsdf.index.weekday < 5, index=tsdf.index) tsdf['timeofday'] = (tsdf.index.second + tsdf.index.minute * 60 + tsdf.index.hour * 3600) wdayplsq = fit(tsdf[tsdf.weekday == 1]) wendplsq = fit(tsdf[tsdf.weekdy == 0]) return wendplsq[0] / wdayplsq[0] 0 1cutoff 1 / cutoff No weekly variation.
  • 29. 2929 classification Take a Fourier transform of the time series, and inspect the bins associated with a frequency of a day. Use the ratio of those bins to the first (constant or DC component) in order to classify the time series. daily
  • 30. 3030 classification Time series on weekdays shown with a strong daily pattern. Fourier transform with bins around the day frequency highlighted. daily
  • 31. 3131 classification Time series on weekends shown with no daily pattern. Fourier transform with bins around the day frequency highlighted. daily
  • 32. 3232 classification def daily_ratio(tsdf) : nbins = len(tsdf) deltat = (tsdf.index[1] - tsdf.index[0]).seconds deltaf = 1.0 / (len(tsdf) * deltat) daybin = int((1.0 / (24 * 3600)) / deltaf) rfft = np.abs(np.fft.rfft(tsdf[“conns”])) daily_ratio = np.sum(rfft[daybin - 1:daybin + 2]) / rfft[0] return daily_ratio daily Find the bin associated with the frequency of a day using:
  • 33. 3333 ewma Exponentially weighted moving average: The decay parameter is specified as a span, s, in pandas, related to α by: α = 2 / (s + 1) A normal EWMA analysis is done when the metric shows no daily pattern. A stacked EWMA analysis is done when there is a daily pattern.
  • 34. 3434 ewma def ewma_outlier(tsdf, stdlimit=5, span=15) : tsdf[’conns_binpred’] = pd.ewma(tsdf[‘conns’], span=span).shift(1) tsdf[’conns_binstd’] = pd.ewmstd(tsdf[‘conns’], span=span).shift(1) tsdf[‘conns_stds’] = ((tsdf[‘conns’] – tsdf[’conns_binpred’]) / tsdf[‘conns_binstd’]) tsdf[‘conns_outlier’] = (tsdf[‘conns_stds’].abs() > stdlimit) return tsdf normal
  • 35. 3535 ewma normal blue: actual number of connections green: prediction window red: actual value exceeded standard deviation limit
  • 36. 3636 ewma blue: actual response size green: prediction window red: actual value exceeded standard deviation limit normal
  • 40. 4040 ewma def stacked_outlier(tsdf, stdlimit=4, span=10) : gbdf = tsdf.groupby(‘timeofday’)[colname] gbdf = pd.DataFrame({‘conns_binpred’ : gbdf.apply(pd.ewma, span=span), ‘conns_binstd’ : gbdf.apply(pd.ewmstd, span=span)}) interval = tsdf.timeofday[1] - tsdf.timeofday[0] nshift = int(86400.0 / interval) gbdf = gbdf.shift(nshift) tsdf = gbdf.combine_first(tsdf) tsdf[‘conns_stds’] = ((tsdf[‘conns’] – tsdf[‘conns_binpred’]) / tsdf[‘conns_binstd’]) tsdf[‘conns_outlier’] = (tsdf[‘conns_stds’].abs() > stdlimit) return tsdf stacked Shift the EWMA results by a day and overlay them on the original DataFrame.
  • 41. 4141 ewma blue: actual number of connections green: prediction window red: actual value exceeded standard deviation limit stacked
  • 42. 4242 arima I am currently investigating using ARIMA (autoregressive integrated moving average) models to make better predictions. I’m not convinced that this level of detail is necessary for the analysis I’m doing, but I wanted to highlight another cool scientific computing library that’s available.
  • 43. 4343 arima from statsmodels.tsa.arima_model import ARIMA def arima_model_forecast(tsdf, p, d q) : arima_model = ARIMA(tsdf[“conns”][:-1], (p, d, q)).fit() forecast, stderr, conf_int = arima_model.forecast(1) tsdf[“conns_binpred"][-1] = forecast[0] tsdf[“conns_binstd"][-1] = stderr[0] return tsdf
  • 44. 4444 arima blue: actual number of connections green: prediction window red: actual value exceeded standard deviation limit p = d = q = 1
  • 45. 4545 takeaways Python provides simple and usable interfaces to most data handling projects. Combined, these interfaces can create a full data analysis pipeline from collection to analysis.

Editor's Notes

  • #27: y=p_0left[1-p_1 sin left( frac{2 pi }{24*3600} left( x - p_2 ight) ight) ight]