SlideShare a Scribd company logo
Building massive scale,
    fault tolerant,
job processing systems
    with Scala Akka
      framework
     Vignesh Sukumar
        SVCC 2012
About me

• Storage group, Backend Engineering at Box
• Love enterprise software!
• Interested in Big Data and building distributed
  systems in the cloud
About Box

• Leader in enterprise cloud collaboration and
  storage
• Cutting-edge work in backend, frontend,
  platform and engineering services
• A really fun place to work – we have a long
  slide!
Talk outline
• Job processing requirements
• Traditional & new models for job processing

• Akka actors framework
• Achieving and controlling high IO throughput
• Fine-grained fault tolerance
Typical architecture in a cloud storage
             environment
Practical realities

•Storage nodes are usually of varying
configurations (OS, processing power, storage
capacity, etc) mainly because of rapid evolution
in provisioning operations
•Some nodes are more over-worked than the
others (for ex, accepting live uploads)
•Billions of files; petabytes
Job processing requirements

• Iterate over all files (billions, petabyte scale):
  for ex, check consistency of all files

• High throughput

• Fault tolerant

• Secure
Traditional job processing model
Why traditional models fail in cloud
       storage environments
• Not scalable: petabyte scale, billions of files
• Insecure: cannot move files out of storage
  nodes
• No performance control: easy to overwhelm
  any storage node
• No fine grained fault tolerance
Compute on Storage

• Move job computation directly to storage
  nodes
• Utilize abundant CPU on storage nodes
• Metadata store still stays in a highly available
  system like a RDBMS
• Results from operations on a file are
  completely independent
Master – slave architecture
Benefits

• High IO throughput: Direct access; no transfer
  of files over a network
• Secure: files do not leave storage nodes
• Better performance control: compute can
  easily monitor system load and back off
• Better fault tolerance handling: finer grained
  handling of errors
Master node

• Responsible for accepting job submissions and
  splitting them to tasks for slave nodes
• Stateful: keeps durable copy of jobs and tasks
  in Zookeeper
• Horizontally scalable: service can be run on
  multiple nodes
Agent

• Runs directly on the storage nodes on a
  machine-independent JVM container
• Stateless: no task state is maintained
• Monitors system load with back-off
• Reports results directly to master without
  synchronizing with other agents
Implementation with the
  the Scala Akka Actor
       framework
Actors

• Concurrent threads abstraction with no
  shared state
• Exchange messages
• Asynchronous, non-blocking
• Multiple actors can map to a single OS thread
• Parent-children hierarchical relationship
Actors and messages
• Class MyActor extends Actor {
  def receive = {
    case MsgType1 => // do something
  }
}

// instantiation and sending messages
 val actorRef = system.actorOf(Props(new MyActor))
actorRef ! MsgType1
Agent Actor System
Achieving high IO throughput
• Parallel, asynchronous IO through “Futures”
val fileIOResult = Future {
  // issue high latency tasks like file IO
 }
val networkIOResult = Future { // read from network }

Futures.awaitAll(<wait time>, fileIOResult, networkIOResult)
fileIOResult onSuccess { // do something }
networkIOResult onFailure { // retry }
Controlling system throughput

• The problem: agents need to throttle
  themselves as storage nodes serve live traffic

• Adjust number of parallel workers dynamically
  through a monitoring service
Controlling throughput: Examples

•Parallelism parameters can be gotten from a
separate configuration service on a per node
basis
•Some machines can be speeded up and others
slowed down this way
•The configuration can be updated on a cron
schedule to speed up during weekends
Fine grained fault tolerance with
              Supervisors

• Parents of child actors can define specific
  fault-handling strategies for each failure
  scenario in their children
• Components can fail gracefully without
  affecting the entire system
Supervision strategy: Examples


Class TaskActor extends Actor {
  // create child workers
  override val supervisorStrategy = OneForOneStrategy(maxNrOrRetries = 3) {
   case SqlException => Resume // retry the same file
   case FileCorruptionException => Stop // don’t clobber it!
   case IOException => Restart // report and move on
}
Unit testing

• Scalatra test framework: very easy to read!
  TaskActorTest.receive(BadFileMsg) must throw
  FileNotFoundException
• Mocks for network and database calls
val mockHttp = mock[HttpExecutor]
TaskActorTest ! doHttpPost
there was atLeastOne(mockHttp).POST


• Extensive testing of failure injection scenarios
Takeaways
• Keep your architecture simple by modeling
  actor message flow along the same paths as
  parent-child actor hierarchy (i.e., no message
  exchange between peer child actors)
• Design and implement for component failures
• Write unit tests extensively: we did not have
  any fundamental level functionality breakage
• Box Engineering is awesome!

More Related Content

DOCX
Nascimento krishna
PPT
Architecture of .net framework
PPTX
Result Management System - CSE Final Year Projects
PDF
Activity diagram-UML diagram
PDF
Building and deploying LLM applications with Apache Airflow
PPTX
RPA in a Day
PPTX
Microsoft dot net framework
Nascimento krishna
Architecture of .net framework
Result Management System - CSE Final Year Projects
Activity diagram-UML diagram
Building and deploying LLM applications with Apache Airflow
RPA in a Day
Microsoft dot net framework

What's hot (20)

PDF
Enabling on-device learning at scale
PPTX
RPA Uipath Presentation.pptx
PDF
Discover AI with Microsoft Azure
PDF
IT Infrastructure Management Powerpoint Presentation Slides
PPTX
PPTX
Introduction-To-RPA_1.pptx
PDF
Telecommunication Business Process - eTOM Flows
PDF
Temenos data lake brochure
PDF
Report on online bus management
PPTX
Center of Excellence Building Blocks
PDF
Build and Modernize Intelligent Apps​
PPTX
Employee Management System
PDF
Fault Management System (OSS)
DOCX
College admission system
PPTX
Crm siebel
PPTX
Smart attendance system
PDF
Productionzing ML Model Using MLflow Model Serving
PDF
E learning project report (Yashraj Nigam)
PPTX
Open Digital Framework from TMFORUM
PPTX
Context model
Enabling on-device learning at scale
RPA Uipath Presentation.pptx
Discover AI with Microsoft Azure
IT Infrastructure Management Powerpoint Presentation Slides
Introduction-To-RPA_1.pptx
Telecommunication Business Process - eTOM Flows
Temenos data lake brochure
Report on online bus management
Center of Excellence Building Blocks
Build and Modernize Intelligent Apps​
Employee Management System
Fault Management System (OSS)
College admission system
Crm siebel
Smart attendance system
Productionzing ML Model Using MLflow Model Serving
E learning project report (Yashraj Nigam)
Open Digital Framework from TMFORUM
Context model
Ad

Similar to Building large scale, job processing systems with Scala Akka Actor framework (20)

PPTX
Stream Computing (The Engineer's Perspective)
PDF
Agile Lab_BigData_Meetup_AKKA
PPTX
Distributed Model Validation with Epsilon
PDF
Typesafe stack - Scala, Akka and Play
PPTX
Indic threads pune12-typesafe stack software development on the jvm
PDF
Scaling tappsi
PPTX
Fastest Servlets in the West
PPTX
Fault tolerance
PDF
Latest (storage IO) patterns for cloud-native applications
PDF
Machine Learning With H2O vs SparkML
PPTX
automation_test_framewjdsjhdsjhsdorks.pptx
PDF
Alluxio - Scalable Filesystem Metadata Services
PPTX
Graphene – Microsoft SCOPE on Tez
PPTX
Enhanced Reframework Session_16-07-2022.pptx
PPTX
MongoDB: How We Did It – Reanimating Identity at AOL
PPT
Reactive programming with examples
PDF
DataOps with Project Amaterasu
PDF
Case Study: Migrating Hyperic from EJB to Spring from JBoss to Apache Tomcat
PPTX
Road Trip To Component
PPTX
OracleStore: A Highly Performant RawStore Implementation for Hive Metastore
Stream Computing (The Engineer's Perspective)
Agile Lab_BigData_Meetup_AKKA
Distributed Model Validation with Epsilon
Typesafe stack - Scala, Akka and Play
Indic threads pune12-typesafe stack software development on the jvm
Scaling tappsi
Fastest Servlets in the West
Fault tolerance
Latest (storage IO) patterns for cloud-native applications
Machine Learning With H2O vs SparkML
automation_test_framewjdsjhdsjhsdorks.pptx
Alluxio - Scalable Filesystem Metadata Services
Graphene – Microsoft SCOPE on Tez
Enhanced Reframework Session_16-07-2022.pptx
MongoDB: How We Did It – Reanimating Identity at AOL
Reactive programming with examples
DataOps with Project Amaterasu
Case Study: Migrating Hyperic from EJB to Spring from JBoss to Apache Tomcat
Road Trip To Component
OracleStore: A Highly Performant RawStore Implementation for Hive Metastore
Ad

Recently uploaded (20)

PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
A Presentation on Artificial Intelligence
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
Empathic Computing: Creating Shared Understanding
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PPTX
OMC Textile Division Presentation 2021.pptx
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PDF
Heart disease approach using modified random forest and particle swarm optimi...
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Machine learning based COVID-19 study performance prediction
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Encapsulation theory and applications.pdf
PPTX
TLE Review Electricity (Electricity).pptx
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Unlocking AI with Model Context Protocol (MCP)
Encapsulation_ Review paper, used for researhc scholars
A Presentation on Artificial Intelligence
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Empathic Computing: Creating Shared Understanding
gpt5_lecture_notes_comprehensive_20250812015547.pdf
OMC Textile Division Presentation 2021.pptx
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
Heart disease approach using modified random forest and particle swarm optimi...
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Machine learning based COVID-19 study performance prediction
Assigned Numbers - 2025 - Bluetooth® Document
Programs and apps: productivity, graphics, security and other tools
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Diabetes mellitus diagnosis method based random forest with bat algorithm
Encapsulation theory and applications.pdf
TLE Review Electricity (Electricity).pptx
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Digital-Transformation-Roadmap-for-Companies.pptx
Unlocking AI with Model Context Protocol (MCP)

Building large scale, job processing systems with Scala Akka Actor framework

  • 1. Building massive scale, fault tolerant, job processing systems with Scala Akka framework Vignesh Sukumar SVCC 2012
  • 2. About me • Storage group, Backend Engineering at Box • Love enterprise software! • Interested in Big Data and building distributed systems in the cloud
  • 3. About Box • Leader in enterprise cloud collaboration and storage • Cutting-edge work in backend, frontend, platform and engineering services • A really fun place to work – we have a long slide!
  • 4. Talk outline • Job processing requirements • Traditional & new models for job processing • Akka actors framework • Achieving and controlling high IO throughput • Fine-grained fault tolerance
  • 5. Typical architecture in a cloud storage environment
  • 6. Practical realities •Storage nodes are usually of varying configurations (OS, processing power, storage capacity, etc) mainly because of rapid evolution in provisioning operations •Some nodes are more over-worked than the others (for ex, accepting live uploads) •Billions of files; petabytes
  • 7. Job processing requirements • Iterate over all files (billions, petabyte scale): for ex, check consistency of all files • High throughput • Fault tolerant • Secure
  • 9. Why traditional models fail in cloud storage environments • Not scalable: petabyte scale, billions of files • Insecure: cannot move files out of storage nodes • No performance control: easy to overwhelm any storage node • No fine grained fault tolerance
  • 10. Compute on Storage • Move job computation directly to storage nodes • Utilize abundant CPU on storage nodes • Metadata store still stays in a highly available system like a RDBMS • Results from operations on a file are completely independent
  • 11. Master – slave architecture
  • 12. Benefits • High IO throughput: Direct access; no transfer of files over a network • Secure: files do not leave storage nodes • Better performance control: compute can easily monitor system load and back off • Better fault tolerance handling: finer grained handling of errors
  • 13. Master node • Responsible for accepting job submissions and splitting them to tasks for slave nodes • Stateful: keeps durable copy of jobs and tasks in Zookeeper • Horizontally scalable: service can be run on multiple nodes
  • 14. Agent • Runs directly on the storage nodes on a machine-independent JVM container • Stateless: no task state is maintained • Monitors system load with back-off • Reports results directly to master without synchronizing with other agents
  • 15. Implementation with the the Scala Akka Actor framework
  • 16. Actors • Concurrent threads abstraction with no shared state • Exchange messages • Asynchronous, non-blocking • Multiple actors can map to a single OS thread • Parent-children hierarchical relationship
  • 17. Actors and messages • Class MyActor extends Actor { def receive = { case MsgType1 => // do something } } // instantiation and sending messages val actorRef = system.actorOf(Props(new MyActor)) actorRef ! MsgType1
  • 19. Achieving high IO throughput • Parallel, asynchronous IO through “Futures” val fileIOResult = Future { // issue high latency tasks like file IO } val networkIOResult = Future { // read from network } Futures.awaitAll(<wait time>, fileIOResult, networkIOResult) fileIOResult onSuccess { // do something } networkIOResult onFailure { // retry }
  • 20. Controlling system throughput • The problem: agents need to throttle themselves as storage nodes serve live traffic • Adjust number of parallel workers dynamically through a monitoring service
  • 21. Controlling throughput: Examples •Parallelism parameters can be gotten from a separate configuration service on a per node basis •Some machines can be speeded up and others slowed down this way •The configuration can be updated on a cron schedule to speed up during weekends
  • 22. Fine grained fault tolerance with Supervisors • Parents of child actors can define specific fault-handling strategies for each failure scenario in their children • Components can fail gracefully without affecting the entire system
  • 23. Supervision strategy: Examples Class TaskActor extends Actor { // create child workers override val supervisorStrategy = OneForOneStrategy(maxNrOrRetries = 3) { case SqlException => Resume // retry the same file case FileCorruptionException => Stop // don’t clobber it! case IOException => Restart // report and move on }
  • 24. Unit testing • Scalatra test framework: very easy to read! TaskActorTest.receive(BadFileMsg) must throw FileNotFoundException • Mocks for network and database calls val mockHttp = mock[HttpExecutor] TaskActorTest ! doHttpPost there was atLeastOne(mockHttp).POST • Extensive testing of failure injection scenarios
  • 25. Takeaways • Keep your architecture simple by modeling actor message flow along the same paths as parent-child actor hierarchy (i.e., no message exchange between peer child actors) • Design and implement for component failures • Write unit tests extensively: we did not have any fundamental level functionality breakage • Box Engineering is awesome!

Editor's Notes

  • #8: 1. Example of a job is to check consistency of all the files: this will involve iterating over every file on all storage nodes, reading file and verifying content integrity.
  • #10: Scalability: non-performant because of the IO bottleneck in getting files to the application cluster Insecure: application clusters can store the files locally. It’s easy to melt a single a storage node by reading or writing a lot to it Cannot perform fine grained fault tolerance