SlideShare a Scribd company logo
Art of Feature Engineering For Data Science
Nabeel Sarwar
Machine Learning Engineer at Comcast
Spark Summit Europe - October 26th, 2017
What is Comcast NBCUniversal?
• Much like Ireland’s Sky
• Provide Broadband Services in the United States
- Internet
- Video
- Media
- Mobile
• ~25 million customers on combination of broadband services
• Use Case: machine learning to optimize customer experience
- While complying with all applicable privacy policies and laws
XFINITY TV
XFINITY Internet
XFINITY Voice
XFINITY Home ®
Digital & OtherOther
*Minority interestand/or non-controlling interest.
Slide is notcomprehensiveofall ComcastNBCUniversal assets
Updated:December 22,2015
Where in the process?
Obtain
Select
Clean
Transform & Enrich
Define
Model Selection
Tuning & Eval
Feedback
Imagine Arrows Everywhere
Feature Engineering 101
• Cheat sheet to some of the terminology and best practices
• Why is feature engineering necessary?
• How and how much feature engineering?
• Feature Engineering Process variations
• Select
• Clean
• Normalize and Transform
• Philosophies
- Confirm a hypothesis
- Explore trends
- Middle ground approach
Features
• ML algorithms need pieces of information
• Each individual piece in a sample: feature
• Encoded in all sort of different ways
• Differentiate by use case
- Insurance: Model of car
- Imaging: Color of pixels
- General Relativity (black hole detection): coordinates in space
- Online Food Ordering: what you ate last
• Differentiate by model
- Sequencing: Last event states
- Imaging: Channels of the pixels
- Reinforcement Learning: Actions from state to state
• Differentiate by available data:
- How do you partition?
- Feature selection
Example: Titanic Survival Prediction
• Who survives on the titanic?
• Select the features
• Prioritize who survives
- Age
- Role on ship
- Speaking tongue of passenger
• Pick Decision Tree
• Bucket Age
• Keep role as is
• Encode speaking tongue into binary variable whether they speak English or not
How to actually do it
Categorical Variables
• Discrete (but not necessarily disjoint) levels
• Can often seem numeric
• Encoding methods
- Leave as is (but encode into some number - StringIndexer)
- One Hot Encoding
- Binary Encoding
• Text: Tokenizing & Word2Vec
• Decision Trees and variants can often do worse with one hot encoding
Numerical Variables
• Can be continuous or discrete
• Can become categorical through binning and bucketting
• To scale or not variables
- Some variables might be weighted higher than others (KNN?)
- Per mini-batch in neural network training
- Normalize or max-min scaling
• Always determine relevant stats:
- Mean
- Variance
- Kurtosis
- Mode
- Median
• Specifically for words: term frequency-inverse documentfrequency (tf-idf)
Cleaning Data
• Correcting for input mistakes or known mistakes in the data (normalization)
- If predicting text: Knowing the last few sequence was Iceland instead of ice land
• Transforming missing data and outliers
- Missing Data -> Imputation
- Mean
- Mode
- Matrix Factorization (A = LDU)
• Converting data into known formats for rest of pipeline
- Can be very simple: Mapping categories to different numbers
- Or very difficult: Cleaning irregularly sampled data
Selecting Features
• Domain Knowledge
• Statistically correlate features
- Weights
- Gini
- Correlation
- Information Gain
- Information Criteria (AIC, BIC, Bayes Factors)
• Global Gains vs Local Gains
• Throw all the data and then down sample
• Throw in little data and then increase until acceptable tolerance
• Noisy or dirty data
• Some models prefer no intra feature other interactions
- Ordinary Least Squares and some exponential families
Transformations and Enrichment
• Build Features from Features
- Add
- Subtract (the mean)
- Multiply
- Ratios
- Polynomials (kernel trick in SVMs)
- Rational Differences
- Logarithms, Exponentials, Sigmoids
- Fourier or Laplace Transforms
• Encode categorical variables
• Scaling numerical variables
• Look up tables to bring in additional features
• Transforming dates
- Year, month, day, hour, minute, second
- Holidays and Special Events
val encoder = new OneHotEncoder()
.setInputCol(”categories")
.setOutputCol(”oneHot")
val encodedDF = encoder.transform(categoryDF)
Dimension Reduction
• Curse of Dimensionality
- Unhealthy prunes
- Distances look the same
- Slow training
• Optimization in production
• Reduce number of dimensions by learning the most important “directions” or
combination of features
- Finding the right basis (eigen) vectors
• Principal Component Analysis
- Singular Value Decomposition on Covariance Matrix
• Factor Analysis
- Expectation-Maximization to find basis vectors
• Non-negative Matrix Factorization
- Poisson-gamma factorization
• Just remove features or transform better!
A Harder Use Case: Star-Galaxy Separation
• Stars and galaxies look like little points in space, hard to differentiate on images
• Can read temperatures
• Know that Stars are one-dimensional based on temperature
• Galaxies are composed of various stars
• Bin the temperatures
• Logarithm of temperatures
• Form Features from the combination of differences
• Logarithms on same magnitudes
- No need to scale
Art of Feature Engineering for Data Science with Nabeel Sarwar
Art of Feature Engineering for Data Science with Nabeel Sarwar
Philosophy
• Explore:
- Not sure what exactly is there
- Try out different combinations until you see trends
- Select all and throw out as needed
• Confirmation:
- Have some hypothesis on how data is supposed to behave
- Statistical tests
- Add Feature if it passes your confirmation test
• Hybrid:
- Data sets often have a combination of features that fit both philosophies
New Horizons
• Deep Learning: Feature Engineering -> Architecture building
- Still need good data and good features
• Meta Features
- Models building features for other models
• Automation!
- Combinatorial complexity
Art of Feature Engineering for Data Science with Nabeel Sarwar

More Related Content

PDF
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
PDF
Building a Business Logic Translation Engine with Spark Streaming for Communi...
PDF
Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...
PDF
Powering a Startup with Apache Spark with Kevin Kim
PDF
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
PDF
Semi-Supervised Learning In An Adversarial Environment
PDF
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
PDF
Hardware Acceleration of Apache Spark on Energy-Efficient FPGAs with Christof...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Building a Business Logic Translation Engine with Spark Streaming for Communi...
Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...
Powering a Startup with Apache Spark with Kevin Kim
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Semi-Supervised Learning In An Adversarial Environment
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Hardware Acceleration of Apache Spark on Energy-Efficient FPGAs with Christof...

What's hot (20)

PDF
Spark Summit EU talk by Bas Geerdink
PDF
Parallelizing Large Simulations with Apache SparkR with Daniel Jeavons and Wa...
PDF
Spark Summit EU talk by Berni Schiefer
PDF
Storage Engine Considerations for Your Apache Spark Applications with Mladen ...
PDF
Dr. Elephant: Achieving Quicker, Easier, and Cost-Effective Big Data Analytic...
PPTX
Apache Spark and Online Analytics
PDF
Using Pluggable Apache Spark SQL Filters to Help GridPocket Users Keep Up wit...
PPTX
Spark - The Ultimate Scala Collections by Martin Odersky
PDF
Resource-Efficient Deep Learning Model Selection on Apache Spark
PDF
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
PDF
Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...
PDF
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
PDF
Spark Summit EU talk by Ruben Pulido Behar Veliqi
PDF
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
PPTX
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
PDF
Spark Summit EU talk by Bas Geerdink
PDF
Introduction to Apache Spark
PDF
Spark Summit EU talk by Kaarthik Sivashanmugam
PDF
Spark Summit EU talk by Oscar Castaneda
PDF
Near Data Computing Architectures: Opportunities and Challenges for Apache Spark
Spark Summit EU talk by Bas Geerdink
Parallelizing Large Simulations with Apache SparkR with Daniel Jeavons and Wa...
Spark Summit EU talk by Berni Schiefer
Storage Engine Considerations for Your Apache Spark Applications with Mladen ...
Dr. Elephant: Achieving Quicker, Easier, and Cost-Effective Big Data Analytic...
Apache Spark and Online Analytics
Using Pluggable Apache Spark SQL Filters to Help GridPocket Users Keep Up wit...
Spark - The Ultimate Scala Collections by Martin Odersky
Resource-Efficient Deep Learning Model Selection on Apache Spark
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit EU talk by Ruben Pulido Behar Veliqi
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Spark Summit EU talk by Bas Geerdink
Introduction to Apache Spark
Spark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit EU talk by Oscar Castaneda
Near Data Computing Architectures: Opportunities and Challenges for Apache Spark
Ad

Viewers also liked (20)

PDF
Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...
PDF
Building Machine Learning Algorithms on Apache Spark with William Benton
PDF
Feature Hashing for Scalable Machine Learning with Nick Pentreath
PPTX
Low Touch Machine Learning with Leah McGuire (Salesforce)
PDF
Experimental Design for Distributed Machine Learning with Myles Baker
PDF
VSSML17 L5. Basic Data Transformations and Feature Engineering
PPTX
Reverse Engineering Feature Models From Software Variants to Build Software P...
PPTX
The How and Why of Feature Engineering
PPTX
Overview of Machine Learning and Feature Engineering
PDF
How to Share State Across Multiple Apache Spark Jobs using Apache Ignite with...
PDF
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
PDF
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
PDF
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
PDF
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
PDF
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
PDF
VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D...
PDF
Working with Skewed Data: The Iterative Broadcast with Fokko Driesprong Rob K...
PDF
One-Pass Data Science In Apache Spark With Generative T-Digests with Erik Erl...
PDF
Histogram Equalized Heat Maps from Log Data via Apache Spark with Arvind Rao
PDF
Natural Language Understanding at Scale with Spark-Native NLP, Spark ML, and ...
Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...
Building Machine Learning Algorithms on Apache Spark with William Benton
Feature Hashing for Scalable Machine Learning with Nick Pentreath
Low Touch Machine Learning with Leah McGuire (Salesforce)
Experimental Design for Distributed Machine Learning with Myles Baker
VSSML17 L5. Basic Data Transformations and Feature Engineering
Reverse Engineering Feature Models From Software Variants to Build Software P...
The How and Why of Feature Engineering
Overview of Machine Learning and Feature Engineering
How to Share State Across Multiple Apache Spark Jobs using Apache Ignite with...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D...
Working with Skewed Data: The Iterative Broadcast with Fokko Driesprong Rob K...
One-Pass Data Science In Apache Spark With Generative T-Digests with Erik Erl...
Histogram Equalized Heat Maps from Log Data via Apache Spark with Arvind Rao
Natural Language Understanding at Scale with Spark-Native NLP, Spark ML, and ...
Ad

Similar to Art of Feature Engineering for Data Science with Nabeel Sarwar (20)

PDF
Machine learning for IoT - unpacking the blackbox
PDF
L5. Data Transformation and Feature Engineering
PPTX
Data Normalization and Alignment in Heterogeneous Data Sets
PDF
Feature Engineering
PPT
Preprocessing.ppt
PPT
Preprocessing.ppt
PPT
Preprocessing.ppt
PPT
Preprocessing.ppt
PPTX
Enar short course
PPT
Data extraction, cleanup & transformation tools 29.1.16
PPT
Preprocessing.ppt
PDF
Automated product categorization
PDF
Automated product categorization
PPTX
Machine Learning
PPTX
Introduction to computer vision with Convoluted Neural Networks
PPT
data clean.ppt
PDF
Connected Components Labeling
PDF
3 module 2
PDF
Continuous Inspection - Uma abordagem efetiva para melhoria contínua da quali...
Machine learning for IoT - unpacking the blackbox
L5. Data Transformation and Feature Engineering
Data Normalization and Alignment in Heterogeneous Data Sets
Feature Engineering
Preprocessing.ppt
Preprocessing.ppt
Preprocessing.ppt
Preprocessing.ppt
Enar short course
Data extraction, cleanup & transformation tools 29.1.16
Preprocessing.ppt
Automated product categorization
Automated product categorization
Machine Learning
Introduction to computer vision with Convoluted Neural Networks
data clean.ppt
Connected Components Labeling
3 module 2
Continuous Inspection - Uma abordagem efetiva para melhoria contínua da quali...

More from Spark Summit (20)

PDF
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
PDF
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
PDF
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
PDF
Next CERN Accelerator Logging Service with Jakub Wozniak
PDF
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
PDF
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
PDF
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
PDF
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
PDF
Goal Based Data Production with Sim Simeonov
PDF
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
PDF
Getting Ready to Use Redis with Apache Spark with Dvir Volk
PDF
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
PDF
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
PDF
Variant-Apache Spark for Bioinformatics with Piotr Szul
PDF
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
PDF
Best Practices for Using Alluxio with Apache Spark with Gene Pang
PDF
Smack Stack and Beyond—Building Fast Data Pipelines with Jorg Schad
PDF
Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Next CERN Accelerator Logging Service with Jakub Wozniak
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Goal Based Data Production with Sim Simeonov
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
Variant-Apache Spark for Bioinformatics with Piotr Szul
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Best Practices for Using Alluxio with Apache Spark with Gene Pang
Smack Stack and Beyond—Building Fast Data Pipelines with Jorg Schad
Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...

Recently uploaded (20)

PPTX
importance of Data-Visualization-in-Data-Science. for mba studnts
PDF
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
PPTX
A Complete Guide to Streamlining Business Processes
PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
PPTX
modul_python (1).pptx for professional and student
PDF
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
PDF
Microsoft Core Cloud Services powerpoint
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
PPTX
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
climate analysis of Dhaka ,Banglades.pptx
PDF
[EN] Industrial Machine Downtime Prediction
PPTX
Topic 5 Presentation 5 Lesson 5 Corporate Fin
PDF
annual-report-2024-2025 original latest.
PPTX
Managing Community Partner Relationships
PPTX
Leprosy and NLEP programme community medicine
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
importance of Data-Visualization-in-Data-Science. for mba studnts
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
A Complete Guide to Streamlining Business Processes
Optimise Shopper Experiences with a Strong Data Estate.pdf
modul_python (1).pptx for professional and student
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
Microsoft Core Cloud Services powerpoint
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
Data_Analytics_and_PowerBI_Presentation.pptx
Qualitative Qantitative and Mixed Methods.pptx
ISS -ESG Data flows What is ESG and HowHow
climate analysis of Dhaka ,Banglades.pptx
[EN] Industrial Machine Downtime Prediction
Topic 5 Presentation 5 Lesson 5 Corporate Fin
annual-report-2024-2025 original latest.
Managing Community Partner Relationships
Leprosy and NLEP programme community medicine
Acceptance and paychological effects of mandatory extra coach I classes.pptx

Art of Feature Engineering for Data Science with Nabeel Sarwar

  • 1. Art of Feature Engineering For Data Science Nabeel Sarwar Machine Learning Engineer at Comcast Spark Summit Europe - October 26th, 2017
  • 2. What is Comcast NBCUniversal? • Much like Ireland’s Sky • Provide Broadband Services in the United States - Internet - Video - Media - Mobile • ~25 million customers on combination of broadband services • Use Case: machine learning to optimize customer experience - While complying with all applicable privacy policies and laws
  • 3. XFINITY TV XFINITY Internet XFINITY Voice XFINITY Home ® Digital & OtherOther *Minority interestand/or non-controlling interest. Slide is notcomprehensiveofall ComcastNBCUniversal assets Updated:December 22,2015
  • 4. Where in the process? Obtain Select Clean Transform & Enrich Define Model Selection Tuning & Eval Feedback Imagine Arrows Everywhere
  • 5. Feature Engineering 101 • Cheat sheet to some of the terminology and best practices • Why is feature engineering necessary? • How and how much feature engineering? • Feature Engineering Process variations • Select • Clean • Normalize and Transform • Philosophies - Confirm a hypothesis - Explore trends - Middle ground approach
  • 6. Features • ML algorithms need pieces of information • Each individual piece in a sample: feature • Encoded in all sort of different ways • Differentiate by use case - Insurance: Model of car - Imaging: Color of pixels - General Relativity (black hole detection): coordinates in space - Online Food Ordering: what you ate last • Differentiate by model - Sequencing: Last event states - Imaging: Channels of the pixels - Reinforcement Learning: Actions from state to state • Differentiate by available data: - How do you partition? - Feature selection
  • 7. Example: Titanic Survival Prediction • Who survives on the titanic? • Select the features • Prioritize who survives - Age - Role on ship - Speaking tongue of passenger • Pick Decision Tree • Bucket Age • Keep role as is • Encode speaking tongue into binary variable whether they speak English or not
  • 9. Categorical Variables • Discrete (but not necessarily disjoint) levels • Can often seem numeric • Encoding methods - Leave as is (but encode into some number - StringIndexer) - One Hot Encoding - Binary Encoding • Text: Tokenizing & Word2Vec • Decision Trees and variants can often do worse with one hot encoding
  • 10. Numerical Variables • Can be continuous or discrete • Can become categorical through binning and bucketting • To scale or not variables - Some variables might be weighted higher than others (KNN?) - Per mini-batch in neural network training - Normalize or max-min scaling • Always determine relevant stats: - Mean - Variance - Kurtosis - Mode - Median • Specifically for words: term frequency-inverse documentfrequency (tf-idf)
  • 11. Cleaning Data • Correcting for input mistakes or known mistakes in the data (normalization) - If predicting text: Knowing the last few sequence was Iceland instead of ice land • Transforming missing data and outliers - Missing Data -> Imputation - Mean - Mode - Matrix Factorization (A = LDU) • Converting data into known formats for rest of pipeline - Can be very simple: Mapping categories to different numbers - Or very difficult: Cleaning irregularly sampled data
  • 12. Selecting Features • Domain Knowledge • Statistically correlate features - Weights - Gini - Correlation - Information Gain - Information Criteria (AIC, BIC, Bayes Factors) • Global Gains vs Local Gains • Throw all the data and then down sample • Throw in little data and then increase until acceptable tolerance • Noisy or dirty data • Some models prefer no intra feature other interactions - Ordinary Least Squares and some exponential families
  • 13. Transformations and Enrichment • Build Features from Features - Add - Subtract (the mean) - Multiply - Ratios - Polynomials (kernel trick in SVMs) - Rational Differences - Logarithms, Exponentials, Sigmoids - Fourier or Laplace Transforms • Encode categorical variables • Scaling numerical variables • Look up tables to bring in additional features • Transforming dates - Year, month, day, hour, minute, second - Holidays and Special Events val encoder = new OneHotEncoder() .setInputCol(”categories") .setOutputCol(”oneHot") val encodedDF = encoder.transform(categoryDF)
  • 14. Dimension Reduction • Curse of Dimensionality - Unhealthy prunes - Distances look the same - Slow training • Optimization in production • Reduce number of dimensions by learning the most important “directions” or combination of features - Finding the right basis (eigen) vectors • Principal Component Analysis - Singular Value Decomposition on Covariance Matrix • Factor Analysis - Expectation-Maximization to find basis vectors • Non-negative Matrix Factorization - Poisson-gamma factorization • Just remove features or transform better!
  • 15. A Harder Use Case: Star-Galaxy Separation • Stars and galaxies look like little points in space, hard to differentiate on images • Can read temperatures • Know that Stars are one-dimensional based on temperature • Galaxies are composed of various stars • Bin the temperatures • Logarithm of temperatures • Form Features from the combination of differences • Logarithms on same magnitudes - No need to scale
  • 18. Philosophy • Explore: - Not sure what exactly is there - Try out different combinations until you see trends - Select all and throw out as needed • Confirmation: - Have some hypothesis on how data is supposed to behave - Statistical tests - Add Feature if it passes your confirmation test • Hybrid: - Data sets often have a combination of features that fit both philosophies
  • 19. New Horizons • Deep Learning: Feature Engineering -> Architecture building - Still need good data and good features • Meta Features - Models building features for other models • Automation! - Combinatorial complexity