Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in Apache Spark with Lorand Dali

3 likes1,446 views

The document discusses lessons learned from implementing a sparse logistic regression algorithm in Spark, highlighting optimization techniques and the importance of using suitable representations for distributed implementations. Key insights include the use of mini-batch gradient descent, better bias initialization, and the Adam optimizer for improved convergence speed. Final performance improvements resulted in a 40x reduction in iteration time.

Lessons Learned while Implementing a Sparse
Logistic Regression Algorithm in Spark
Lorand Dali
@lorserker
#EUds9

You don’t have to implement your
own optimization algorithm*
*unless you want to play around and learn a lot of new stuff

Use a representation that is suited for
distributed implementation

Logistic regression definition
weights
Feature Vector
Prediction
Loss
Weight update
Derivative of loss
Gradient

Logistic regression vectorized
weights Predictionsfeatures
examples
Dot products

weights
Partitions
Examples
Predictions
Array[Double]
RDD[(Long, Double)]
Seq[(Int, Double)]
RDD[(Long, Seq[(Int, Double)])]
Column index
Feature value
row index
Map[Int, Double]

Gradient
Array[Double]
Prediction minus label
Transposed data matrix
RDD[(Long, Double)]
RDD[(Long, Seq[(Int, Double)])]

Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in Apache Spark with Lorand Dali

Experimental dataset
- avazu click prediction dataset (sites)
- 20 million examples
- 1 million dimensions
- we just want to try it out
https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#avazu

time per iteration AWS EMR Cluster
5 nodes of m4.2xlarge

Why is the join expensive
+
*
*
*
Needs shuffle
No shuffle

Gradient descent without joins
dimension

Gradient (part)
Features
Examples
Tree aggregate
Comb OP
Seq op

If you can’t decrease the time per
iteration, make the iteration smaller

If time per iteration is minimal, try to have
fewer iterations

Find a good initialization for the bias
- Usually we initialize weights randomly (or to zero)
- But a careful initialization of the bias can help
(especially in very unbalanced datasets)
- We start the gradient descent from a better point and
can save several iterations

Try a better optimization algorithm to
converge faster

ADAM
- converges faster
- combines ideas from: gradient descent,
momentum and rmsprop
- basically just keeps moving averages and
makes larger steps when values are consistent
or gradients are small
- useful for making better progress in plateaus

Conclusion
- we implemented logistic regression from scratch
- the first version was very slow
- but we managed to improve the iteration time 40x
- and also made it converge faster

Thank you!
- Questions, but only simple ones please :)
- Looking forward to discussing offline
- Or write me an email Lorand@Lorand.me
- Play with the code
- And come work with me at
http://bit.ly/slogreg

This document provides an introduction to the Expectation Maximization (EM) algorithm. EM is used to estimate parameters in statistical models when data is incomplete or has missing values. It is a two-step process: 1) Expectation step (E-step), where the expected value of the log likelihood is computed using the current estimate of parameters; 2) Maximization step (M-step), where the parameters are re-estimated to maximize the expected log likelihood found in the E-step. EM is commonly used for problems like clustering with mixture models and hidden Markov models. Applications of EM discussed include clustering data using mixture of Gaussian distributions, and training hidden Markov models for natural language processing tasks. The derivation of the EM algorithm and

Introduction to Bootstrap: Design for DevelopersMelvin John

The document provides an introduction to Bootstrap, one of the most popular front-end frameworks. It discusses basic design principles like proximity, alignment, repetition and contrast. It then covers key aspects of Bootstrap like the grid system, CSS components, JavaScript plugins, customization options, and how it relates to basic design principles. The benefits of Bootstrap are faster development, powerful grid system, customizable styles and responsive components, while potential drawbacks include file size overhead and templates looking similar without customization.

API Design - 3rd EditionApigee | Google Cloud

This document provides an agenda for a presentation on API design. It begins with recapping the previous edition and then covers topics like API modeling, security, message design, hypermedia, transactions, URL design, versioning, errors, and client considerations. Throughout the presentation examples are given from APIs like Twitter, Foursquare, Instagram, GitHub, Netflix, and others. The goal is to discuss best practices for designing APIs.

Introduction to MSBIEdureka!

The document provides an introduction to Microsoft Business Intelligence (MSBI). It discusses how MSBI addresses the needs of users by integrating data across networks, providing summarized and historical data to help understand organizational health, and enabling 'what-if' analysis. It describes the MSBI architecture and how it uses SQL Server Integration Services, SQL Server Analysis Services, and SQL Server Reporting Services to move data between sources and destinations, perform online analytical processing to build cubes for analysis, and deliver reports, respectively. The document also compares MSBI to other BI tools and argues it provides the most reliable solution at the lowest total cost.

NumPy.pptxEN1036VivekSingh

NumPy is a Python package that provides multidimensional array and matrix objects as well as tools to work with these objects. It was created to handle large, multi-dimensional arrays and matrices efficiently. NumPy arrays enable fast operations on large datasets and facilitate scientific computing using Python. NumPy also contains functions for Fourier transforms, random number generation and linear algebra operations.

Laravel Tutorial PPTPiyush Aggarwal

This document provides an overview of the Laravel PHP framework, including instructions for installation, directory structure, MVC concepts, and a sample "task list" application to demonstrate basic Laravel features. The summary covers creating a Laravel project, defining a database migration and Eloquent model, adding routes and views with Blade templating, performing validation and CRUD operations, and more.

Self Service Reporting & Analytics For an EnterpriseSreejith Madhavan

The document discusses self-service analytics for enterprise users, covering key concepts of business intelligence and analytics, including data mining, predictive and prescriptive analytics. It outlines the landscape of enterprise analytics, emphasizing the need for a flexible solution stack that caters to diverse user needs and types, and highlights the importance of integrating various data sources. The document also addresses performance considerations and design principles for building an effective analytics platform, aimed at supporting both technical and non-technical users.

8. sqlkhoahuy82

This document provides an overview of SQL and database concepts. It discusses: - The basic structure of tables, rows, columns, and data types - The four main SQL languages: DDL, DML, TCL, and DCL and common commands like CREATE, INSERT, UPDATE, DELETE, etc. - Database objects like tables, views, indexes, and how to query them - Constraints like PRIMARY KEY, UNIQUE, NOT NULL and REFERENTIAL integrity - Transactions with COMMIT, ROLLBACK, and SAVEPOINT

Html / CSS PresentationShawn Calvert

The document is a detailed guide on front-end and back-end coding, focusing on HTML, CSS, and JavaScript within the context of web design. It outlines the structure, style, and behavior of web pages, including the roles of static and dynamic content, as well as CSS rules and selectors. Key topics include document hierarchy, attributes, specificity in styling, and the application of styles through classes and IDs.

Vistas En Sql Y My SqlZiscko

Este documento describe las vistas en SQL Server y MySQL. Explica que las vistas son tablas virtuales que muestran datos de otras tablas sin almacenarlos realmente. Detalla cómo crear, modificar y eliminar vistas usando sentencias como CREATE VIEW, ALTER VIEW y DROP VIEW. Además, señala que las vistas se usan comúnmente para simplificar consultas complejas, restringir el acceso a datos sensibles y crear esquemas externos.

Le schéma directeurSVrignaud

Le schéma directeur du système d'informations (S.I.) vise à aligner l'évolution du S.I. avec l'axe stratégique de l'organisation pour garantir sa performance dans le temps. Il définit des étapes de migration et d'adaptation nécessaires pour éviter les surcoûts et les freins, tout en s'assurant que la vision stratégique est partagée. En intégrant l'infrastructure technique, l'organisation logicielle et le management, ce schéma se veut accessible à tous les acteurs de l'entreprise.

SQL QueriesNilt1234

This document provides an introduction to SQL and database systems. It begins with example tables to demonstrate SQL concepts. It then covers the objectives of SQL, including allowing users to create database structures, manipulate data, and perform queries. Various SQL concepts are introduced such as data types, comparison operators, logical operators, and arithmetic operators. The document also discusses SQL statements for schema and catalog definitions, data definition, data manipulation, and other operators. Example SQL queries are provided to illustrate concepts around selecting columns, rows, sorting, aggregation, grouping, and more.

les techniques TALNetudiantemaster2

Le document présente le traitement automatique des langues (TAL) comme une discipline multidisciplinaire liant linguistique et informatique, visant à développer des logiciels capables de traiter des données linguistiques. Deux approches y sont décrites : l'approche symbolique, reposant sur des analyses syntaxiques et sémantiques, et l'approche statistique, qui utilise l'apprentissage automatique pour traiter de grandes quantités de données. Le TAL trouve de nombreuses applications dans les outils quotidiens et professionnels, bien qu'il reste des défis à relever pour intégrer ces deux approches efficacement.

Intro to HTML & CSSSyed Sami

The document provides an overview of HTML and CSS, covering topics such as the structure of an HTML document, HTML tags, CSS, and how to create a basic webpage. It discusses what HTML and CSS are, why they are needed, popular HTML tags, and gives examples of adding CSS to an HTML document. It also provides a hands-on tutorial showing how to build a simple website covering HTML basics and using CSS for styling.

Modèles de langue : NgrammesJaouad Dabounou

Le document traite du traitement automatique du langage naturel (TALN) et présente les modèles de langue, incluant leurs applications dans la traduction, la correction d'erreur et l'analyse de sentiments. Il explique également comment ces modèles sont construits à partir de corpus textuels et l'importance des probabilités conditionnelles dans leur conception. Enfin, il aborde la notion d'espace sémantique et la manière dont les machines peuvent comprendre le langage à travers l'analyse des régularités linguistiques.

ETL VS ELT.pdfBOSupport

ETL extracts raw data from sources, transforms it on a separate server, and loads it into a target database. ELT loads raw data directly into a data warehouse, where data cleansing, enrichment, and transformations occur. While ETL has been used longer and has more supporting tools, ELT allows for faster queries, greater flexibility, and takes advantage of cloud data warehouse capabilities by performing transformations within the warehouse. However, ELT can present greater security risks and increased latency compared to ETL.

Introduction to Stream ProcessingGuido Schmutz

The document is a presentation on stream processing by Guido Schmutz, covering motivation, capabilities, implementation, and applications of stream processing solutions. It discusses the differences between data at rest and data in motion, the architecture for stream analytics, and technologies involved (like Apache Kafka and Spark). Additionally, it outlines key concepts such as stream data integration, delivery guarantees, state management, and event pattern detection.

Overview SQL Server 2019Juan Fabian

The document outlines the roadmap for SQL Server, including enhancements to performance, security, availability, development tools, and big data capabilities. Key updates include improved intelligent query processing, confidential computing with secure enclaves, high availability options on Kubernetes, machine learning services, and tools in Azure Data Studio. The roadmap aims to make SQL Server the most secure, high performing, and intelligent data platform across on-premises, private cloud and public cloud environments.

Intégration des données avec Talend ETLLilia Sfaxi

Le document présente un TP sur l'intégration de données utilisant Talend Open Studio, abordant des concepts comme l'extraction et la transformation de données à partir de différentes sources. Il détaille les étapes pour manipuler des fichiers CSV et des bases de données, incluant le tri, la jointure et la sélection des données. Le TP se conclut par un devoir à réaliser à partir des travaux pratiques, axé sur les opérations de jointure et de stockage des données.

Data Visualization With Tableau | EdurekaEdureka!

The document discusses data visualization using Tableau, highlighting its ability to connect to various data sources and facilitate data analysis. It outlines who should consider using Tableau, such as business analysts and data scientists, and provides insights into job trends and companies utilizing the software. Additionally, it covers Tableau's architecture and visualization capabilities, including real-time use cases.

CSS Font & Text style Yaowaluck Promdee

Qu'est-ce qu'un ETL ?Mathieu Lahaye

L'ETL (Extract-Transform-Load) est un processus utilisé pour accéder à diverses sources de données, les manipuler, et les intégrer dans un entrepôt de données. Il permet de traiter des données structurées, semi-structurées et non-structurées tout en garantissant leur qualité et leur conformité. Les entreprises utilisent l'ETL pour analyser et croiser des informations afin de prendre de meilleures décisions d'affaires.

Computer Vision - Classification automatique des races de chien à partir de p...FUMERY Michael

Le document traite de la détection automatique de la race de chien à partir d'images en utilisant des algorithmes de deep learning, en se concentrant sur le dataset Stanford Dogs. Différentes méthodes de modélisation, y compris le deep learning from scratch et le transfert learning avec les modèles Xception et ResNet50, ont été explorées, aboutissant à des résultats significatifs, notamment une précision supérieure à 72% sur des données de test. Le processus inclut des étapes de prétraitement des images et d'optimisation des hyperparamètres pour améliorer la performance du modèle.

Data Quality With or Without Apache Spark and Its EcosystemDatabricks

The document discusses the importance of data quality and various methodologies for ensuring it, particularly in relation to Apache Spark and its ecosystem. It highlights the need for data profiling, ETL quality checks, and various tools like Deequ and Great Expectations for managing data integrity. The text emphasizes that effective data management requires sophistication within organizations and may necessitate continued internal development of expertise.

Data Engineering BasicsCatherine Kimani

The document outlines the basics of data engineering, including definitions of key roles such as data engineers and data scientists, as well as core concepts like ETL (extract, transform, load) processes and data classification. It highlights the properties of big data, including volume, velocity, variety, and veracity, along with processing methods like batch and stream processing. Additionally, it mentions various data storage solutions, such as relational databases and document stores.

CSS3 Media Queries And Creating Adaptive LayoutsSvitlana Ivanytska

This document discusses creating adaptive layouts using CSS3 media queries. It defines the differences between adaptive and responsive design, with adaptive using predefined layouts for different screen sizes and responsive providing an optimal experience across devices. Key concepts for adaptive design are progressive enhancement and mobile-first. The document outlines main principles like flexible grid-based layouts, flexible media, and using media queries to apply CSS styles based on features like width, height, and orientation. It provides examples of media query syntax and supported media features.

Easy, scalable, fault tolerant stream processing with structured streaming - ...Databricks

The document discusses the complexities of building scalable and fault-tolerant stream processing applications using Structured Streaming with Apache Spark. It highlights key features such as handling diverse data formats and storage systems, ensuring exactly-once processing semantics, and leveraging Spark SQL for incremental execution. The presentation covers practical examples and concepts like triggers, output modes, and watermarking to efficiently manage state and late data in streaming queries.

Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit

The document discusses advancements in deduplication and author disambiguation of scholarly records using supervised models and content encoders within the Elsevier platform, emphasizing the importance of efficient data processing techniques. It outlines the challenges and solutions for managing peer-reviewed publications in Scopus, focusing on machine learning and cloud scalability for real-time document analysis. The implementation of algorithms like word2vec for document encoding aims to improve recommendation systems and personalized user experiences across various academic entities.

More Related Content

What's hot (20)

Self Service Reporting & Analytics For an EnterpriseSreejith Madhavan

8. sqlkhoahuy82

Html / CSS PresentationShawn Calvert

Vistas En Sql Y My SqlZiscko

Le schéma directeurSVrignaud

SQL QueriesNilt1234

les techniques TALNetudiantemaster2

Intro to HTML & CSSSyed Sami

Modèles de langue : NgrammesJaouad Dabounou

ETL VS ELT.pdfBOSupport

Introduction to Stream ProcessingGuido Schmutz

Overview SQL Server 2019Juan Fabian

Intégration des données avec Talend ETLLilia Sfaxi

Data Visualization With Tableau | EdurekaEdureka!

CSS Font & Text style Yaowaluck Promdee

Qu'est-ce qu'un ETL ?Mathieu Lahaye

Computer Vision - Classification automatique des races de chien à partir de p...FUMERY Michael

Data Quality With or Without Apache Spark and Its EcosystemDatabricks

Data Engineering BasicsCatherine Kimani

CSS3 Media Queries And Creating Adaptive LayoutsSvitlana Ivanytska

Self Service Reporting & Analytics For an EnterpriseSreejith Madhavan

8. sqlkhoahuy82

Html / CSS PresentationShawn Calvert

Vistas En Sql Y My SqlZiscko

Le schéma directeurSVrignaud

SQL QueriesNilt1234

les techniques TALNetudiantemaster2

Intro to HTML & CSSSyed Sami

Modèles de langue : NgrammesJaouad Dabounou

ETL VS ELT.pdfBOSupport

Introduction to Stream ProcessingGuido Schmutz

Overview SQL Server 2019Juan Fabian

Intégration des données avec Talend ETLLilia Sfaxi

Data Visualization With Tableau | EdurekaEdureka!

CSS Font & Text style Yaowaluck Promdee

Qu'est-ce qu'un ETL ?Mathieu Lahaye

Computer Vision - Classification automatique des races de chien à partir de p...FUMERY Michael

Data Quality With or Without Apache Spark and Its EcosystemDatabricks

Data Engineering BasicsCatherine Kimani

CSS3 Media Queries And Creating Adaptive LayoutsSvitlana Ivanytska

Viewers also liked (8)

Easy, scalable, fault tolerant stream processing with structured streaming - ...Databricks

Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit

Using Pluggable Apache Spark SQL Filters to Help GridPocket Users Keep Up wit...Spark Summit

The document discusses the use of pluggable Apache Spark SQL filters by GridPocket to enhance energy management and consumer engagement in smart grid applications. It emphasizes the importance of peer comparison in influencing energy-saving behaviors and details a dataset management solution involving object storage and metadata indexing. The proposed methods aim to improve query performance and reduce data transfer costs in Spark SQL operations, specifically for geospatial and other types of data.

Optimal Strategies for Large Scale Batch ETL Jobs with Emma TangDatabricks

The document discusses optimal strategies for managing large-scale batch ETL jobs at Neustar, emphasizing efficient resource use and addressing issues such as data skew and memory management. It provides insights into handling large datasets, improving job performance, and using tools like Ganglia for monitoring. Key recommendations include increasing partitioning, filtering data, and configuring Spark settings for performance optimization.

Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit

The document discusses advancements in deep learning, particularly the integration of TensorFlow and Spark for distributed training, highlighting the need for robust data pipelines and hardware configurations. It outlines various models and frameworks, such as Horovod and TensorFlow on Spark, to enhance efficiency in training using multiple GPUs and optimization techniques. Additionally, it presents the AI hierarchy of needs and the role of Hopsworks in supporting machine learning workflows.

Storage Engine Considerations for Your Apache Spark Applications with Mladen ...Spark Summit

The document discusses various storage solutions for Apache Spark applications, including HDFS, HBase, Kudu, and Solr, outlining their unique capabilities and ideal use cases. It emphasizes the importance of understanding ingestion and consumption requirements, asking the right questions, and carefully choosing a storage system based on specific use-case needs. Additionally, it provides design patterns and implementation strategies for integrating these storage systems with Spark.

Building Custom ML PipelineStages for Feature Selection with Marc KaminskiSpark Summit

The document discusses the implementation of custom machine learning pipeline stages for feature selection at BMW, focusing on the challenges of high-dimensional datasets and the necessity for data-driven car diagnostics. It outlines motivations for transitioning from manual to automated processes, highlighting the impact of feature selection on learning performance. Attendees of the Spark Summit Europe 2017 session will learn about Spark ML pipeline stages, including code examples in Scala.

Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim HunterDatabricks

This document discusses the use of Apache Spark for building, scaling, and deploying deep learning pipelines, emphasizing the integration of various libraries such as TensorFlow, Keras, and others. It outlines the typical workflow of deep learning, covers the benefits of using Spark for distributed training and efficient data handling, and highlights features of Databricks' deep learning pipelines which simplify the process without sacrificing performance. Additionally, it addresses future developments and current practices in the industry for enhancing deep learning accessibility.