What is Web Scraping in Node.js ?

Last Updated : 29 Jul, 2024

Web scraping is the automated process of extracting data from websites. It involves using a script or a program to collect information from web pages, which can then be stored or used for various purposes such as data analysis, research, or application development. In Node.js, web scraping is commonly performed using libraries and tools that facilitate HTTP requests and HTML parsing.

Why Use Web Scraping?

Data Collection: Gather data from multiple sources for research, analysis, or machine learning.
Market Research: Track competitors' pricing and product details.
Content Aggregation: Compile information from different websites into a single platform.
Automation: Automate repetitive tasks like checking website updates.

Tools and Libraries for Web Scraping in Node.js

Here are some popular tools and libraries used for web scraping in Node.js:

Axios: For making HTTP requests.
Cheerio: For parsing and manipulating HTML.
Puppeteer: For scraping JavaScript-heavy websites using a headless browser.
Node-fetch: A lightweight HTTP request library.
Request-promise: A promise-based HTTP request library.

Puppeteer

In Node.js, there are many modules for Web Scraping but one of the easy-to-implement & popular modules is Puppeteer. Puppeteer provides many methods that make the whole process of Web Scraping & Web Automation much easier. We can install this module in our project directory by typing the command.

npm install puppeteer

Installation Steps

Step 1: Make a folder structure for the project.

mkdir myapp

Step 2: Navigate to the project directory

cd myapp

Step 3: Initialize the NodeJs project inside the myapp folder.

npm init -y

Step 4: Install the required dependencies by the following command:

npm install puppeteer

The updated dependencies in package.json file will look like:

"dependencies": {
  "puppeteer": "^22.12.1"
  }

Step 5: Make an async function

async function webScraper() {
    ...
};

webScraper();

Step 6: Inside the function, create two constants, first is a browser const that is used to launch Puppeteer, and the second is a page const that is used to browse & open a new page for scraping purposes.

async function webScraper() {
    const browser = await puppeteer.launch({})
       const page = await browser.newPage()
};
webScraper();

Step 7: Using the goto method, open the website which we want to scrape, then select the element that text we want, then extract text from that element & log the text into the console.

await page.goto(
'https://www.geeksforgeeks.org/explain-the-mechanism-of-event-loop-in-node-js/')
let element = await page.waitFor("h1")
let text = await page.evaluate(element => element.textContent, element)
console.log(text)
browser.close()

Example: Implementation to show web scraping in Node.js

JavaScript

// app.js

const puppeteer = require('puppeteer');

async function webScraper() {
    const browser = await puppeteer.launch({})
    const page = await browser.newPage()
    await page.goto(
'https://www.geeksforgeeks.org/explain-the-mechanism-of-event-loop-in-node-js/')
    let element = await page.waitFor("h1")
    let text = await page.evaluate(
        element => element.textContent, element)
    console.log(text)
    browser.close()
};

webScraper();

Step to run the application: Open the terminal and type the following command.