SlideShare a Scribd company logo
EY Hong Kong NextWave
Data Science Challenge
Byung Eun Jeon - byungeuni
Hyunju Shim - sg04088
University of Hong Kong
May 30, 2019
Disclaimer
These presentation slides are not official EY presentation slides.
A winning team of EY Hong Kong NextWave Data Science
Competition produced the slides for the presentation on May 30,
2019 at the Citic Tower, Admiralty, Hong Kong.
Agenda
Page 2 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
1 Methodology and Algorithms
Findings and Patterns
Opportunities to Improve Performance
Smart Cities Applications
3
11
19
21
2
3
4
Page 3 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
Methodology
and Algorithms
Overview of Methodology
1. Problem Formulation
2. EDA & Feature Engineering
3. Model Exploration and Selection
4. Training & Fine-Tuning
5. Prediction & Ensembling
Page 4 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
Problem Formulation
Page 5 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
Objective Model that finds out whether a specific citizen will
be in a predefined city-center
Meaningful Points:
ID resets every 24 hours
Cannot trace the same device across days
Exact date / Day of the week is not known
Limits the usage of Weekend/Weekday trends, Public Holidays
⅔ of Velocity Related Data is missing: Data handling is an issue
Hash is used to link several trajectories to a sequence
Variable Length Sequence Binary Classification
Number of Trajectories for each
unique hash ranges from 1 to 20
Target Variable:
0 (Not in center)
or 1 (In center)
Selection of Approach: Deep Learning
Why Deep Learning?
Size of Data is Big
Difficult to hand-design useful columns
Non-ML Statistical Models (e.g. ARIMA, Holt-Winters Method) require many
assumptions, yet it is difficult to make good assumptions
Machine Learning
Deep Learning
Requires minimal feature engineering because DL is flexible
at approximating non-linear functions useful for prediction
Convolutional NN – family
Has no sense of time
(i.e. Difficult to learn the seasonality during the day)
Recurrent NN – family
Why LSTM? better cope with vanishing gradient and
capable of learning long-term dependencies
Other Machine
Learning Models
Difficult to capture
sense of time
(Random Forest,
k-Nearest-
Neighbors)
Page 6 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
Feature Engineering
Page 7 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
Finally, concatenate trajectories with the same hash
to create one variable length sequence for each unique hash
Modification of
Given Features
Target
Entry Time (to seconds),
Exit Time (to seconds)
Handling Missing Values of
Velocity-related Features
Fill with each group’s median and mark
NaN values with new “valid” columns
If time spent in the trajectory is 0, fill
velocities to 0. Otherwise, fill velocities to
median of non-NaN values
Design of
New Features
Entry Center, Exit Center
Time Spent, Time After
New Hash (first trajectory of the hash),
Last Hash (last trajectory of the hash)
Vmin Valid, Vmean Valid, Vmax Valid
MinMax Normalization on
Continuous Variables
Modify so that inputs with different
ranges have same scale for features
Input with Variable Sequence Length
Page 8 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
Objective
Create the batch with similar length
sequence to reduce sparsity of
input data
Problem
If sequence length ranges from 1 to 20,
input data becomes too sparse
General way of
“Variable Sequence
Length LSTM”
Our Approach
Combine Trajectories to create one long
sequence for each unique hash
Zero-pad each batch
with maximum
sequence length for
each batch
Bucketing and
zero-padding:
sort hashes
according to number
of trajectories
1.2 Million zero-
paddings per epoch
on average from
1000 simulations
313
zero-paddings per
epoch
Multiplicative LSTM
Motivation
LSTM architecture that has hidden-to-hidden transition functions that are
input-dependent and thus better suited to recover from surprising inputs
Larger Number of Parameters (x 1.25)
Trade-off between Flexibility of Model and Training Time
Source: Krause, B., Lu, L., Murray, I., and Renals, S. Multiplicative LSTM for sequence
modelling. ArXiv, 2016.
Page 9 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
Multiplicative LSTMTraditional LSTM
Training & Prediction
Page 10 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
Loss Function
• Metric (F1) is not differentiable
and when modified, does not
converge well
•
• Use Binary Cross Entropy Loss
Optimizer
• Adadelta Optimizer for first
part of training
• Adam Optimizer with learning
rate decay for later part
Train/Val Data Split
• Use 70/30 split when exploring
various models
• Use 95/5 split when fine-tuning
Computing
• Cloud Computing for GPU
computation
Simple Weighted Ensembling with higher weights
on predictions with higher score
Page 11 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
Findings
and Patterns
Inspired by the theory and practice,
we present domain-specific findings
Page 12 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
Batch Normalization (BN) and Dropout combined together do not work well
Inconsistency in variance – Recent study suggests that Dropout shifts the
variance of a specific neural unit; BN maintains its variance
Conducted experiments on three cases
i) BN and Dropout ii) Only Dropout iii) Dropout only after BN
Empirically observed that using Only Dropout performs the best
Disharmony
between
BN &
Dropout
Significance
of Velocity-
related
Variables
Experiment
on
LSTM+CNN
In NLP, tokenization with greater granularity sometimes achieves better results
Tried training model with each trajectory separated into two positions
This forced us to either delete or duplicate the velocity-related variables
Neither deleting or duplicating velocity led to better prediction
Although ⅔ of data is missing, information about velocity is valuable
Some people achieved SOTA using LSTM+CNN
Replicated the model and tried to generalize it to the given geolocation domain
For the given domain, our approach (LSTM with FC at the end)
performed better than LSTM+CNN
Sources:
1. Xiang Li, Shuo Chen, Xiaolin Hu, and Jian Yang. Understanding the disharmony between dropout and batch
normalization by variance shift. ArXiv, 2018
Through the Exploratory Data Analysis,
We found patterns of citizens in the city
Page 13 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
Time:
00:00 ~ 04:00
Target Variable:
Trajectories inside
the yellow box
(city center)
Through the Exploratory Data Analysis,
We found patterns of citizens in the city
Page 14 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
Time:
04:00 ~ 08:00
Target Variable:
Trajectories inside
the yellow box
(city center)
Page 15 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
Through the Exploratory Data Analysis,
We found patterns of citizens in the city
Time:
08:00 ~ 12:00
Target Variable:
Trajectories inside
the yellow box
(city center)
Page 16 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
Through the Exploratory Data Analysis,
We found patterns of citizens in the city
Time:
12:00 ~ 16:00
Target Variable:
Trajectories inside
the yellow box
(city center)
Populated area such
as Residential Area,
Highway, and
Business Area can be
speculated using this
graph
Trajectory
Count
City-center
Percentage
Page 17 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
Analysis of activity-percentage
complements the visualization of counts
We explored the data extensively, and
this led to better feature engineering
Page 18 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
Both distribution of time within the trajectory and
time between trajectories are right-skewed
“Broken GPS” exits but are negligible
No time spent in the trajectory yet shift in the positions
Happens very rarely. Deep Learning handles small noises well (robustness of DL)
Limitations worth noting
Information about
seasonality such as
week-trend and holiday is
missing and cannot be
inferred using the given
information
Page 19 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
Opportunities to
Improve Performance
With less constraints on resource,
each stage of process can be improved
Page 20 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
Feature
Engineering
Hyperparameter
Tuning
Feeding the
Input
Designing
“Highway” Column
Stratified Random
Shuffling (SRS)
Grid Search
with Multi-GPU
Ensemble and EDA
suggest that hashes
on highway are
difficult to predict
Similar approach to
designing “center”
column, which
significantly improved
performance
Prioritized fine-tuning
model over hand
designing columns
When bucketing and
zero-padding,
random shuffle
among hashes that
have the same
sequence length
Keras does not
support SRS. Should
be implemented
using TF and NumPy
Less prioritized due
to time limit
If more computing
power were available,
Grid Search may
have outperformed
manual fine-tuning
LearningRate Number of Epoch
Page 21 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
Smart Cities
Applications
Atlanta stands to benefit from
Data-driven Litter Collection Application
Sources: 1. Forbes 2. U.S. Bureau of Labor Statistics 3. U.S. Census Bureau 4. Atlanta Journal-Constitution
5. TechRepublic 6. Bisnow
Page 22 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
While the urbanization and the surge of trash expect to continue,
Atlanta’s ecosystem presents opportunity to develop technology for trash management
Trends in
Atlanta
Demography Policy Ecosystem
Implication
Major changes to trash
collection schedule
made for the last 2
consecutive years
(April 1, 2019 and
July 9, 2018)
Rubicon Global, the
first tech unicorn in
trash, is Atlanta-based
“Smart Dumpster” for
recycling has been
developed by the
Government of Atlanta
Higher % change in
the employment than
the that of U.S
average in 2014~19
Metro Atlanta in 2019
has 4th fastest
growing population in
the U.S.
Frequent adjustments
to policies indicate the
increase of trash
Willingness of the state
to tackle the surging
amount of trash
Potential to leverage
the existing
infrastructure
Synergies with the
existing services
Development of the
urban culture, which
leads to the increase
of economic activities
and consumption
Agile Software Development with
a clear KPI is crucial for the Application
Page 23 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
Review Community
Appearance Index (CAI),
Number of Complaints
Analyze Determine
requirements by involving
users continually
Value Proposition
Improve citizens’ life quality
Protect soil and water quality
Save cost of cleaning
Deploy Ex) population
density before and after
the baseball game
Design Database,
server, and UI/UX for
real-time prediction
Test Similar to EY
NextWave Competition,
aim for higher F-1 score
Develop Predict
locations/crowdedness
of public gatherings
Sources: 1. Team’s Analysis 2. Keep Atlanta Beautiful Commission
Prediction system can effectively allocate cleaning staffs to temporally crowded
areas that yield more pedestrian litters and optimize trash truck routes
Thank you!

More Related Content

PDF
Improved Performance of Fuzzy Logic Algorithm for Lane Detection Images
PDF
Evaluating Graph Signal Processing for Neuroimaging Through Classification an...
PDF
Graph Signal Processing: an interpretable framework to link neurocognitive ar...
PPTX
Vtc9252019
PDF
With saloni in ijarcsse
PDF
Application of transportation problem under pentagonal neutrosophic environment
PDF
Stanford ICME Lecture on Why Deep Learning Works
PDF
Novel approach for predicting the rise and fall of stock index for a specific...
Improved Performance of Fuzzy Logic Algorithm for Lane Detection Images
Evaluating Graph Signal Processing for Neuroimaging Through Classification an...
Graph Signal Processing: an interpretable framework to link neurocognitive ar...
Vtc9252019
With saloni in ijarcsse
Application of transportation problem under pentagonal neutrosophic environment
Stanford ICME Lecture on Why Deep Learning Works
Novel approach for predicting the rise and fall of stock index for a specific...

What's hot (19)

PDF
International Journal of Engineering Research and Development (IJERD)
PDF
AI and Machine Learning for the Lean Start Up
PDF
One-Pass Clustering Superpixels
PDF
presentation
PDF
Spline (Interpolation)
PDF
Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley
PDF
Statistical Mechanics Methods for Discovering Knowledge from Production-Scale...
PDF
This Week in Machine Learning and AI Feb 2019
PDF
Search relevance
PDF
ACT Science Coffee - Michael Emmerich
PDF
Spectrum Analytic Approach for Cooperative Navigation of Connected and Autono...
PDF
Data Imputation by Soft Computing
PPT
Recognition as Graph Matching
PDF
Georgetown B-school Talk 2021
PDF
Data Structures and Algorithm - Week 8 - Minimum Spanning Trees
PDF
A computationally efficient method to find transformed residue
PDF
International Journal of Computational Engineering Research(IJCER)
PDF
StarekGomez.ea.IROS2015.Presentation
PDF
Cari presentation maurice-tchoupe-joskelngoufo
International Journal of Engineering Research and Development (IJERD)
AI and Machine Learning for the Lean Start Up
One-Pass Clustering Superpixels
presentation
Spline (Interpolation)
Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley
Statistical Mechanics Methods for Discovering Knowledge from Production-Scale...
This Week in Machine Learning and AI Feb 2019
Search relevance
ACT Science Coffee - Michael Emmerich
Spectrum Analytic Approach for Cooperative Navigation of Connected and Autono...
Data Imputation by Soft Computing
Recognition as Graph Matching
Georgetown B-school Talk 2021
Data Structures and Algorithm - Week 8 - Minimum Spanning Trees
A computationally efficient method to find transformed residue
International Journal of Computational Engineering Research(IJCER)
StarekGomez.ea.IROS2015.Presentation
Cari presentation maurice-tchoupe-joskelngoufo
Ad

Similar to Winner of EY NextWave Data Science Challenge 2019 (20)

PPTX
[20240703_LabSeminar_Huy]MakeGNNGreatAgain.pptx
PDF
Model Evaluation in the land of Deep Learning
PPTX
Traffic Prediction from Street Network images.pptx
PDF
Drsp dimension reduction for similarity matching and pruning of time series ...
PDF
RunPool: A Dynamic Pooling Layer for Convolution Neural Network
PPTX
[20240513_LabSeminar_Huy]GraphFewShort_Transfer.pptx
PPTX
Presentation
PPTX
[20240712_LabSeminar_Huy]Spatio-Temporal Neural Structural Causal Models for ...
PDF
Combinatorial optimization and deep reinforcement learning
PDF
PDF
MODELS 2022 Journal-First presentation: ETeMoX - explaining reinforcement lea...
PDF
Developing Competitive Strategies in Higher Education through Visual Data Mining
PDF
Geometric Deep Learning
PDF
Artificial Neural Network Based Graphical User Interface for Estimation of Fa...
PDF
Artificial Neural Network Based Graphical User Interface for Estimation of Fa...
PPTX
[20240812_LabSeminar_Huy]Spatio-Temporal Fusion for Human Action Recognition ...
PDF
Analysis of computational
PPTX
[20240819_LabSeminar_Huy]Learning Decomposed Spatial Relations for Multi-Vari...
PDF
Cluster Computing for Web-Scale Data Processing
PPTX
[20240614_LabSeminar_Huy]GRLSTM: Trajectory Similarity Computation with Graph...
[20240703_LabSeminar_Huy]MakeGNNGreatAgain.pptx
Model Evaluation in the land of Deep Learning
Traffic Prediction from Street Network images.pptx
Drsp dimension reduction for similarity matching and pruning of time series ...
RunPool: A Dynamic Pooling Layer for Convolution Neural Network
[20240513_LabSeminar_Huy]GraphFewShort_Transfer.pptx
Presentation
[20240712_LabSeminar_Huy]Spatio-Temporal Neural Structural Causal Models for ...
Combinatorial optimization and deep reinforcement learning
MODELS 2022 Journal-First presentation: ETeMoX - explaining reinforcement lea...
Developing Competitive Strategies in Higher Education through Visual Data Mining
Geometric Deep Learning
Artificial Neural Network Based Graphical User Interface for Estimation of Fa...
Artificial Neural Network Based Graphical User Interface for Estimation of Fa...
[20240812_LabSeminar_Huy]Spatio-Temporal Fusion for Human Action Recognition ...
Analysis of computational
[20240819_LabSeminar_Huy]Learning Decomposed Spatial Relations for Multi-Vari...
Cluster Computing for Web-Scale Data Processing
[20240614_LabSeminar_Huy]GRLSTM: Trajectory Similarity Computation with Graph...
Ad

Recently uploaded (20)

PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
importance of Data-Visualization-in-Data-Science. for mba studnts
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
modul_python (1).pptx for professional and student
PDF
Introduction to the R Programming Language
PDF
Introduction to Data Science and Data Analysis
PDF
Microsoft Core Cloud Services powerpoint
PDF
annual-report-2024-2025 original latest.
PDF
Mega Projects Data Mega Projects Data
PPTX
Pilar Kemerdekaan dan Identi Bangsa.pptx
PDF
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
PDF
Oracle OFSAA_ The Complete Guide to Transforming Financial Risk Management an...
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PDF
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
PDF
Transcultural that can help you someday.
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPT
Predictive modeling basics in data cleaning process
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Data_Analytics_and_PowerBI_Presentation.pptx
importance of Data-Visualization-in-Data-Science. for mba studnts
STERILIZATION AND DISINFECTION-1.ppthhhbx
Introduction-to-Cloud-ComputingFinal.pptx
modul_python (1).pptx for professional and student
Introduction to the R Programming Language
Introduction to Data Science and Data Analysis
Microsoft Core Cloud Services powerpoint
annual-report-2024-2025 original latest.
Mega Projects Data Mega Projects Data
Pilar Kemerdekaan dan Identi Bangsa.pptx
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
Oracle OFSAA_ The Complete Guide to Transforming Financial Risk Management an...
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
Transcultural that can help you someday.
IBA_Chapter_11_Slides_Final_Accessible.pptx
Predictive modeling basics in data cleaning process

Winner of EY NextWave Data Science Challenge 2019

  • 1. EY Hong Kong NextWave Data Science Challenge Byung Eun Jeon - byungeuni Hyunju Shim - sg04088 University of Hong Kong May 30, 2019
  • 2. Disclaimer These presentation slides are not official EY presentation slides. A winning team of EY Hong Kong NextWave Data Science Competition produced the slides for the presentation on May 30, 2019 at the Citic Tower, Admiralty, Hong Kong.
  • 3. Agenda Page 2 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU) 1 Methodology and Algorithms Findings and Patterns Opportunities to Improve Performance Smart Cities Applications 3 11 19 21 2 3 4
  • 4. Page 3 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU) Methodology and Algorithms
  • 5. Overview of Methodology 1. Problem Formulation 2. EDA & Feature Engineering 3. Model Exploration and Selection 4. Training & Fine-Tuning 5. Prediction & Ensembling Page 4 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
  • 6. Problem Formulation Page 5 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU) Objective Model that finds out whether a specific citizen will be in a predefined city-center Meaningful Points: ID resets every 24 hours Cannot trace the same device across days Exact date / Day of the week is not known Limits the usage of Weekend/Weekday trends, Public Holidays ⅔ of Velocity Related Data is missing: Data handling is an issue Hash is used to link several trajectories to a sequence Variable Length Sequence Binary Classification Number of Trajectories for each unique hash ranges from 1 to 20 Target Variable: 0 (Not in center) or 1 (In center)
  • 7. Selection of Approach: Deep Learning Why Deep Learning? Size of Data is Big Difficult to hand-design useful columns Non-ML Statistical Models (e.g. ARIMA, Holt-Winters Method) require many assumptions, yet it is difficult to make good assumptions Machine Learning Deep Learning Requires minimal feature engineering because DL is flexible at approximating non-linear functions useful for prediction Convolutional NN – family Has no sense of time (i.e. Difficult to learn the seasonality during the day) Recurrent NN – family Why LSTM? better cope with vanishing gradient and capable of learning long-term dependencies Other Machine Learning Models Difficult to capture sense of time (Random Forest, k-Nearest- Neighbors) Page 6 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
  • 8. Feature Engineering Page 7 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU) Finally, concatenate trajectories with the same hash to create one variable length sequence for each unique hash Modification of Given Features Target Entry Time (to seconds), Exit Time (to seconds) Handling Missing Values of Velocity-related Features Fill with each group’s median and mark NaN values with new “valid” columns If time spent in the trajectory is 0, fill velocities to 0. Otherwise, fill velocities to median of non-NaN values Design of New Features Entry Center, Exit Center Time Spent, Time After New Hash (first trajectory of the hash), Last Hash (last trajectory of the hash) Vmin Valid, Vmean Valid, Vmax Valid MinMax Normalization on Continuous Variables Modify so that inputs with different ranges have same scale for features
  • 9. Input with Variable Sequence Length Page 8 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU) Objective Create the batch with similar length sequence to reduce sparsity of input data Problem If sequence length ranges from 1 to 20, input data becomes too sparse General way of “Variable Sequence Length LSTM” Our Approach Combine Trajectories to create one long sequence for each unique hash Zero-pad each batch with maximum sequence length for each batch Bucketing and zero-padding: sort hashes according to number of trajectories 1.2 Million zero- paddings per epoch on average from 1000 simulations 313 zero-paddings per epoch
  • 10. Multiplicative LSTM Motivation LSTM architecture that has hidden-to-hidden transition functions that are input-dependent and thus better suited to recover from surprising inputs Larger Number of Parameters (x 1.25) Trade-off between Flexibility of Model and Training Time Source: Krause, B., Lu, L., Murray, I., and Renals, S. Multiplicative LSTM for sequence modelling. ArXiv, 2016. Page 9 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU) Multiplicative LSTMTraditional LSTM
  • 11. Training & Prediction Page 10 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU) Loss Function • Metric (F1) is not differentiable and when modified, does not converge well • • Use Binary Cross Entropy Loss Optimizer • Adadelta Optimizer for first part of training • Adam Optimizer with learning rate decay for later part Train/Val Data Split • Use 70/30 split when exploring various models • Use 95/5 split when fine-tuning Computing • Cloud Computing for GPU computation Simple Weighted Ensembling with higher weights on predictions with higher score
  • 12. Page 11 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU) Findings and Patterns
  • 13. Inspired by the theory and practice, we present domain-specific findings Page 12 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU) Batch Normalization (BN) and Dropout combined together do not work well Inconsistency in variance – Recent study suggests that Dropout shifts the variance of a specific neural unit; BN maintains its variance Conducted experiments on three cases i) BN and Dropout ii) Only Dropout iii) Dropout only after BN Empirically observed that using Only Dropout performs the best Disharmony between BN & Dropout Significance of Velocity- related Variables Experiment on LSTM+CNN In NLP, tokenization with greater granularity sometimes achieves better results Tried training model with each trajectory separated into two positions This forced us to either delete or duplicate the velocity-related variables Neither deleting or duplicating velocity led to better prediction Although ⅔ of data is missing, information about velocity is valuable Some people achieved SOTA using LSTM+CNN Replicated the model and tried to generalize it to the given geolocation domain For the given domain, our approach (LSTM with FC at the end) performed better than LSTM+CNN Sources: 1. Xiang Li, Shuo Chen, Xiaolin Hu, and Jian Yang. Understanding the disharmony between dropout and batch normalization by variance shift. ArXiv, 2018
  • 14. Through the Exploratory Data Analysis, We found patterns of citizens in the city Page 13 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU) Time: 00:00 ~ 04:00 Target Variable: Trajectories inside the yellow box (city center)
  • 15. Through the Exploratory Data Analysis, We found patterns of citizens in the city Page 14 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU) Time: 04:00 ~ 08:00 Target Variable: Trajectories inside the yellow box (city center)
  • 16. Page 15 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU) Through the Exploratory Data Analysis, We found patterns of citizens in the city Time: 08:00 ~ 12:00 Target Variable: Trajectories inside the yellow box (city center)
  • 17. Page 16 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU) Through the Exploratory Data Analysis, We found patterns of citizens in the city Time: 12:00 ~ 16:00 Target Variable: Trajectories inside the yellow box (city center) Populated area such as Residential Area, Highway, and Business Area can be speculated using this graph
  • 18. Trajectory Count City-center Percentage Page 17 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU) Analysis of activity-percentage complements the visualization of counts
  • 19. We explored the data extensively, and this led to better feature engineering Page 18 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU) Both distribution of time within the trajectory and time between trajectories are right-skewed “Broken GPS” exits but are negligible No time spent in the trajectory yet shift in the positions Happens very rarely. Deep Learning handles small noises well (robustness of DL) Limitations worth noting Information about seasonality such as week-trend and holiday is missing and cannot be inferred using the given information
  • 20. Page 19 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU) Opportunities to Improve Performance
  • 21. With less constraints on resource, each stage of process can be improved Page 20 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU) Feature Engineering Hyperparameter Tuning Feeding the Input Designing “Highway” Column Stratified Random Shuffling (SRS) Grid Search with Multi-GPU Ensemble and EDA suggest that hashes on highway are difficult to predict Similar approach to designing “center” column, which significantly improved performance Prioritized fine-tuning model over hand designing columns When bucketing and zero-padding, random shuffle among hashes that have the same sequence length Keras does not support SRS. Should be implemented using TF and NumPy Less prioritized due to time limit If more computing power were available, Grid Search may have outperformed manual fine-tuning LearningRate Number of Epoch
  • 22. Page 21 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU) Smart Cities Applications
  • 23. Atlanta stands to benefit from Data-driven Litter Collection Application Sources: 1. Forbes 2. U.S. Bureau of Labor Statistics 3. U.S. Census Bureau 4. Atlanta Journal-Constitution 5. TechRepublic 6. Bisnow Page 22 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU) While the urbanization and the surge of trash expect to continue, Atlanta’s ecosystem presents opportunity to develop technology for trash management Trends in Atlanta Demography Policy Ecosystem Implication Major changes to trash collection schedule made for the last 2 consecutive years (April 1, 2019 and July 9, 2018) Rubicon Global, the first tech unicorn in trash, is Atlanta-based “Smart Dumpster” for recycling has been developed by the Government of Atlanta Higher % change in the employment than the that of U.S average in 2014~19 Metro Atlanta in 2019 has 4th fastest growing population in the U.S. Frequent adjustments to policies indicate the increase of trash Willingness of the state to tackle the surging amount of trash Potential to leverage the existing infrastructure Synergies with the existing services Development of the urban culture, which leads to the increase of economic activities and consumption
  • 24. Agile Software Development with a clear KPI is crucial for the Application Page 23 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU) Review Community Appearance Index (CAI), Number of Complaints Analyze Determine requirements by involving users continually Value Proposition Improve citizens’ life quality Protect soil and water quality Save cost of cleaning Deploy Ex) population density before and after the baseball game Design Database, server, and UI/UX for real-time prediction Test Similar to EY NextWave Competition, aim for higher F-1 score Develop Predict locations/crowdedness of public gatherings Sources: 1. Team’s Analysis 2. Keep Atlanta Beautiful Commission Prediction system can effectively allocate cleaning staffs to temporally crowded areas that yield more pedestrian litters and optimize trash truck routes