SlideShare a Scribd company logo
3
Most read
8
Most read
9
Most read
CommonCrawl
Building an open Web-Scale crawl using Hadoop.
Ahad Rana
Architect / Engineer at CommonCrawl
ahad@commoncrawl.org
Who is CommonCrawl ?
• A 501(c)3 non-profit “dedicated to building, maintaining and
making widely available a comprehensive crawl of the
Internet for the purpose of enabling a new wave of
innovation, education and research.”
• Funded through a grant by Gil Elbaz, former Googler and
founder of Applied Semantics, and current CEO of Factual Inc.
• Board members include Carl Malamud and Nova Spivack.
Motivations Behind CommonCrawl
• Internet is a massively disruptive force.
• Exponential advances in computing capacity, storage and
bandwidth are creating constant flux and disequilibrium in the IT
domain.
• Cloud computing makes large scale, on-demand computing
affordable for even the smallest startup.
• Hadoop provides the technology stack that enables us to crunch
massive amounts of data.
• Having the ability to “Map-Reduce the Internet” opens up lots of
new opportunities for disruptive innovation and we would like to
reduce the cost of doing this by an order of magnitude, at least.
• White list only the major search engines trend by Webmasters puts
the future of the Open Web at risk and stifles future search
innovation and evolution.
Our Strategy
• Crawl broadly and frequently across all TLDs.
• Prioritize the crawl based on simplified criteria (rank and
freshness).
• Upload the crawl corpus to S3.
• Make our S3 bucket widely accessible to as many users as
possible.
• Build support libraries to facilitate access to the S3 data via
Hadoop.
• Focus on doing a few things really well.
• Listen to customers and open up more metadata and services
as needed.
• We are not a comprehensive crawl, and may never be 
Some Numbers
• URLs in Crawl DB – 14 billion
• URLs with inverse link graph – 1.6 billion
• URLS with content in S3 – 2.5 billion
• Recent crawled documents – 500 million
• Uploaded documents after Deduping 300 million.
• Newly discovered URLs – 1.9 billion
• # of Vertices in Page Rank (recent caclulation) – 3.5 billion
• # of Edges in Page Rank Graph (recent caclulation) – 17 billion
Current System Design
• Batch oriented crawl list generation.
• High volume crawling via independent crawlers.
• Crawlers dump data into HDFS.
• Map-Reduce jobs parse, extract metadata from crawled
documents in bulk independently of crawlers.
• Periodically, we ‘checkpoint’ the crawl, which involves, among
other things:
– Post processing of crawled documents (deduping etc.)
– ARC file generation
– Link graph updates
– Crawl database updates.
– Crawl list regeneration.
Our Cluster Config
• Modest internal cluster consisting of 24 Hadoop nodes,4
crawler nodes, and 2 NameNode / Database servers.
• Each Hadoop node has 6 x 1.5 TB drives and Dual-QuadCore
Xeons with 24 or 32 GB of RAM.
• 9 Map Tasks per node, avg 4 Reducers per node, BLOCK
compression using LZO.
Crawler Design Overview
Crawler Design Details
• Java codebase.
• Asynchronous IO model using custom NIO based HTTP stack.
• Lots of worker threads that synchronize with main thread via
Asynchronous message queues.
• Can sustain a crawl rate of ~250 URLS per second.
• Up to 500 active HTTP connections at any one time.
• Currently, no document parsing in crawler process.
• We currently run 8 crawlers and crawl on average ~100 million
URLs per day, when crawling.
• During post processing phase, on average we process 800
million documents.
• After Deduping, we package and upload on average
approximately 500 million documents to S3.
Crawl Database
• Primary Keys are 128 bit URL fingerprints, consisting of 64 bit
domain fingerprint, and 64 bit URL fingerprint (Rabin-Hash).
• Keys are distributed via modulo operation of URL portion of
fingerprint only.
• Currently, we run 4 reducers per node, and there is one node
down, so we have 92 unique shards.
• Keys in each shard are sorted by Domain FP, then URL FP.
• We like the 64 bit domain id, since it is a generated key, but it
is wasteful.
• We may move to a 32 bit root domain id / 32 bit domain id +
64 URL fingerprint key scheme in the future, and then sort by
root domain, domain, and then FP per shard.
Crawl Database – Continued
• Values in the Crawl Database consist of extensible Metadata
structures.
• We currently use our own DDL and compiler for generating
structures (vs. using Thrift/ProtoBuffers/Avro).
• Avro / ProtoBufs were not available when we started, and we
added lots of Hadoop friendly stuff to our version (multipart [key]
attributes lead to auto WritableComparable derived classes, with
built-in Raw Comparator support etc.).
• Our compiler also generates RPC stubs, with Google ProtoBuf style
message passing semantics (Message w/ optional Struct In, optional
Struct Out) instead of Thrift style semantics (Method with multiple
arguments and a return type).
• We prefer the former because it is better attuned to our preference
towards the asynchronous style of RPC programming.
Map-Reduce Pipeline – Parse/Dedupe/Arc Generation
Phase 1
Phase 2
Map-Reduce Pipeline – Link Graph Construction
Link Graph Construction
Inverse Link Graph Construction
Map-Reduce Pipeline – PageRank Edge Graph Construction
Page Rank Process
Distribution Phase
Calculation Phase
Generate Page Rank Values
The Need For a Smarter Merge
• Pipelining nature of HDFS means each Reducer writes it’s
output to local disk first, then to Repl Level – 1 other nodes.
• If intermediate data record sets are already sorted, the need
to run an Identity Mapper/Shuffle/Merge Sort phase to join to
sorted record sets is very expensive.
Our Solution:

More Related Content

PDF
Mining a Large Web Corpus
PDF
Mapping french open data actors on the web with common crawl
PDF
High-speed Database Throughput Using Apache Arrow Flight SQL
PPTX
Azure Data Engineering.pptx
PDF
Achieving Lakehouse Models with Spark 3.0
PPTX
Apache Arrow: In Theory, In Practice
PDF
Apache Iceberg - A Table Format for Hige Analytic Datasets
PPTX
Databricks Fundamentals
Mining a Large Web Corpus
Mapping french open data actors on the web with common crawl
High-speed Database Throughput Using Apache Arrow Flight SQL
Azure Data Engineering.pptx
Achieving Lakehouse Models with Spark 3.0
Apache Arrow: In Theory, In Practice
Apache Iceberg - A Table Format for Hige Analytic Datasets
Databricks Fundamentals

What's hot (20)

PPTX
The Basics of MongoDB
PDF
Introduction to Azure Synapse Webinar
PPTX
Mongo db intro.pptx
PDF
The Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
PDF
Care and Feeding of Catalyst Optimizer
PDF
Introduction to Streaming Analytics
PDF
NiFi Developer Guide
PPTX
PPTX
Azure Synapse Analytics Overview (r2)
PDF
Introduction to MongoDB
PDF
Self-Service Data Ingestion Using NiFi, StreamSets & Kafka
PPT
Graph database
PPTX
An Introduction to Druid
PDF
NoSQL databases
PDF
Common MongoDB Use Cases
PDF
JSON Data Parsing in Snowflake (By Faysal Shaarani)
PPTX
Azure Data Storage
PPTX
Azure Synapse Analytics Overview (r1)
PDF
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
PPTX
Introducing DocumentDB
The Basics of MongoDB
Introduction to Azure Synapse Webinar
Mongo db intro.pptx
The Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
Care and Feeding of Catalyst Optimizer
Introduction to Streaming Analytics
NiFi Developer Guide
Azure Synapse Analytics Overview (r2)
Introduction to MongoDB
Self-Service Data Ingestion Using NiFi, StreamSets & Kafka
Graph database
An Introduction to Druid
NoSQL databases
Common MongoDB Use Cases
JSON Data Parsing in Snowflake (By Faysal Shaarani)
Azure Data Storage
Azure Synapse Analytics Overview (r1)
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Introducing DocumentDB
Ad

Similar to Building a Scalable Web Crawler with Hadoop (20)

PDF
Apache Spark Presentation good for big data
PDF
Building real time data-driven products
PPTX
Meetup#2: Building responsive Symbology & Suggest WebService
PDF
Frontera: open source, large scale web crawling framework
PDF
Michael stack -the state of apache h base
PPTX
Apache drill
PPTX
Big data at scrapinghub
PDF
Data Science
PPTX
Introduction to Hadoop and Big Data
PPT
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
PDF
Introduction to Hadoop Administration
PDF
Introduction to Hadoop Administration
PDF
From a student to an apache committer practice of apache io tdb
PPTX
2014 09-12 lambda-architecture-at-indix
PPTX
Hadoop ppt1
PDF
Petabyte scale on commodity infrastructure
PDF
SQL Engines for Hadoop - The case for Impala
PPTX
Hadoop and Big data in Big data and cloud.pptx
PPTX
Azure DocumentDB Overview
PDF
Technologies for Data Analytics Platform
Apache Spark Presentation good for big data
Building real time data-driven products
Meetup#2: Building responsive Symbology & Suggest WebService
Frontera: open source, large scale web crawling framework
Michael stack -the state of apache h base
Apache drill
Big data at scrapinghub
Data Science
Introduction to Hadoop and Big Data
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Introduction to Hadoop Administration
Introduction to Hadoop Administration
From a student to an apache committer practice of apache io tdb
2014 09-12 lambda-architecture-at-indix
Hadoop ppt1
Petabyte scale on commodity infrastructure
SQL Engines for Hadoop - The case for Impala
Hadoop and Big data in Big data and cloud.pptx
Azure DocumentDB Overview
Technologies for Data Analytics Platform
Ad

More from Hadoop User Group (20)

PPTX
Common crawlpresentation
PDF
Hdfs high availability
ODP
Cascalog internal dsl_preso
PDF
Karmasphere hadoop-productivity-tools
PDF
Hdfs high availability
PPT
Pig at Linkedin
PDF
HUG August 2010: Best practices
PPT
2 hadoop@e bay-hug-2010-07-21
PPT
1 content optimization-hug-2010-07-21
PDF
3 avro hug-2010-07-21
PPT
1 hadoop security_in_details_hadoop_summit2010
PPT
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...
PPT
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...
PDF
Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...
PPT
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce
PPT
Hadoop, Hbase and Hive- Bay area Hadoop User Group
PPTX
Yahoo! Mail antispam - Bay area Hadoop user group
PPT
Hadoop Security Preview
PPT
Flightcaster Presentation Hadoop
PPTX
Map Reduce Online
Common crawlpresentation
Hdfs high availability
Cascalog internal dsl_preso
Karmasphere hadoop-productivity-tools
Hdfs high availability
Pig at Linkedin
HUG August 2010: Best practices
2 hadoop@e bay-hug-2010-07-21
1 content optimization-hug-2010-07-21
3 avro hug-2010-07-21
1 hadoop security_in_details_hadoop_summit2010
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...
Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce
Hadoop, Hbase and Hive- Bay area Hadoop User Group
Yahoo! Mail antispam - Bay area Hadoop user group
Hadoop Security Preview
Flightcaster Presentation Hadoop
Map Reduce Online

Recently uploaded (20)

PDF
Anesthesia in Laparoscopic Surgery in India
PDF
Microbial disease of the cardiovascular and lymphatic systems
PPTX
Cell Types and Its function , kingdom of life
PPTX
Pharma ospi slides which help in ospi learning
PPTX
GDM (1) (1).pptx small presentation for students
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
VCE English Exam - Section C Student Revision Booklet
PDF
Complications of Minimal Access Surgery at WLH
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PPTX
Institutional Correction lecture only . . .
PPTX
202450812 BayCHI UCSC-SV 20250812 v17.pptx
PPTX
Cell Structure & Organelles in detailed.
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PPTX
Lesson notes of climatology university.
PDF
Chinmaya Tiranga quiz Grand Finale.pdf
PPTX
master seminar digital applications in india
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PPTX
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
Anesthesia in Laparoscopic Surgery in India
Microbial disease of the cardiovascular and lymphatic systems
Cell Types and Its function , kingdom of life
Pharma ospi slides which help in ospi learning
GDM (1) (1).pptx small presentation for students
Final Presentation General Medicine 03-08-2024.pptx
VCE English Exam - Section C Student Revision Booklet
Complications of Minimal Access Surgery at WLH
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
FourierSeries-QuestionsWithAnswers(Part-A).pdf
Institutional Correction lecture only . . .
202450812 BayCHI UCSC-SV 20250812 v17.pptx
Cell Structure & Organelles in detailed.
Microbial diseases, their pathogenesis and prophylaxis
Lesson notes of climatology university.
Chinmaya Tiranga quiz Grand Finale.pdf
master seminar digital applications in india
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...

Building a Scalable Web Crawler with Hadoop

  • 1. CommonCrawl Building an open Web-Scale crawl using Hadoop. Ahad Rana Architect / Engineer at CommonCrawl [email protected]
  • 2. Who is CommonCrawl ? • A 501(c)3 non-profit “dedicated to building, maintaining and making widely available a comprehensive crawl of the Internet for the purpose of enabling a new wave of innovation, education and research.” • Funded through a grant by Gil Elbaz, former Googler and founder of Applied Semantics, and current CEO of Factual Inc. • Board members include Carl Malamud and Nova Spivack.
  • 3. Motivations Behind CommonCrawl • Internet is a massively disruptive force. • Exponential advances in computing capacity, storage and bandwidth are creating constant flux and disequilibrium in the IT domain. • Cloud computing makes large scale, on-demand computing affordable for even the smallest startup. • Hadoop provides the technology stack that enables us to crunch massive amounts of data. • Having the ability to “Map-Reduce the Internet” opens up lots of new opportunities for disruptive innovation and we would like to reduce the cost of doing this by an order of magnitude, at least. • White list only the major search engines trend by Webmasters puts the future of the Open Web at risk and stifles future search innovation and evolution.
  • 4. Our Strategy • Crawl broadly and frequently across all TLDs. • Prioritize the crawl based on simplified criteria (rank and freshness). • Upload the crawl corpus to S3. • Make our S3 bucket widely accessible to as many users as possible. • Build support libraries to facilitate access to the S3 data via Hadoop. • Focus on doing a few things really well. • Listen to customers and open up more metadata and services as needed. • We are not a comprehensive crawl, and may never be 
  • 5. Some Numbers • URLs in Crawl DB – 14 billion • URLs with inverse link graph – 1.6 billion • URLS with content in S3 – 2.5 billion • Recent crawled documents – 500 million • Uploaded documents after Deduping 300 million. • Newly discovered URLs – 1.9 billion • # of Vertices in Page Rank (recent caclulation) – 3.5 billion • # of Edges in Page Rank Graph (recent caclulation) – 17 billion
  • 6. Current System Design • Batch oriented crawl list generation. • High volume crawling via independent crawlers. • Crawlers dump data into HDFS. • Map-Reduce jobs parse, extract metadata from crawled documents in bulk independently of crawlers. • Periodically, we ‘checkpoint’ the crawl, which involves, among other things: – Post processing of crawled documents (deduping etc.) – ARC file generation – Link graph updates – Crawl database updates. – Crawl list regeneration.
  • 7. Our Cluster Config • Modest internal cluster consisting of 24 Hadoop nodes,4 crawler nodes, and 2 NameNode / Database servers. • Each Hadoop node has 6 x 1.5 TB drives and Dual-QuadCore Xeons with 24 or 32 GB of RAM. • 9 Map Tasks per node, avg 4 Reducers per node, BLOCK compression using LZO.
  • 9. Crawler Design Details • Java codebase. • Asynchronous IO model using custom NIO based HTTP stack. • Lots of worker threads that synchronize with main thread via Asynchronous message queues. • Can sustain a crawl rate of ~250 URLS per second. • Up to 500 active HTTP connections at any one time. • Currently, no document parsing in crawler process. • We currently run 8 crawlers and crawl on average ~100 million URLs per day, when crawling. • During post processing phase, on average we process 800 million documents. • After Deduping, we package and upload on average approximately 500 million documents to S3.
  • 10. Crawl Database • Primary Keys are 128 bit URL fingerprints, consisting of 64 bit domain fingerprint, and 64 bit URL fingerprint (Rabin-Hash). • Keys are distributed via modulo operation of URL portion of fingerprint only. • Currently, we run 4 reducers per node, and there is one node down, so we have 92 unique shards. • Keys in each shard are sorted by Domain FP, then URL FP. • We like the 64 bit domain id, since it is a generated key, but it is wasteful. • We may move to a 32 bit root domain id / 32 bit domain id + 64 URL fingerprint key scheme in the future, and then sort by root domain, domain, and then FP per shard.
  • 11. Crawl Database – Continued • Values in the Crawl Database consist of extensible Metadata structures. • We currently use our own DDL and compiler for generating structures (vs. using Thrift/ProtoBuffers/Avro). • Avro / ProtoBufs were not available when we started, and we added lots of Hadoop friendly stuff to our version (multipart [key] attributes lead to auto WritableComparable derived classes, with built-in Raw Comparator support etc.). • Our compiler also generates RPC stubs, with Google ProtoBuf style message passing semantics (Message w/ optional Struct In, optional Struct Out) instead of Thrift style semantics (Method with multiple arguments and a return type). • We prefer the former because it is better attuned to our preference towards the asynchronous style of RPC programming.
  • 12. Map-Reduce Pipeline – Parse/Dedupe/Arc Generation Phase 1 Phase 2
  • 13. Map-Reduce Pipeline – Link Graph Construction Link Graph Construction Inverse Link Graph Construction
  • 14. Map-Reduce Pipeline – PageRank Edge Graph Construction
  • 15. Page Rank Process Distribution Phase Calculation Phase Generate Page Rank Values
  • 16. The Need For a Smarter Merge • Pipelining nature of HDFS means each Reducer writes it’s output to local disk first, then to Repl Level – 1 other nodes. • If intermediate data record sets are already sorted, the need to run an Identity Mapper/Shuffle/Merge Sort phase to join to sorted record sets is very expensive.