Node.js and Playwright: A Comprehensive Step-by-Step Guide to Web Scraping

Node.js and Playwright: A Comprehensive Step-by-Step Guide to Web Scraping

Web scraping is a powerful tool for extracting data from websites, but it can be challenging to get started with if you're unfamiliar with the right tools and technologies. Today, we'll be exploring how to use Node.js and Playwright to create a web scraping script that can extract data from any website. Our blog post will cover the following points:

  • What is web scraping?
  • Why use Node.js and Playwright for web scraping?
  • Creating a Playwright script: launching a browser, creating a page, navigating to a website, and interacting with the web page.
  • Parsing and extracting data from the web page.


By the end of this blog post, you'll have a solid understanding of how to use Node.js and Playwright to scrape data from any website.

Our objective is to scrape information about the blogs on the esterox.com website, including their URL, image URL, name, and creation date. To accomplish this, we will navigate to the blogs page of the esterox.com website 


What is web scraping?

Web scraping is the process of automatically extracting data from websites. It involves writing code that can programmatically visit web pages, parse the HTML or other structured data on those pages, and extract the data that is needed.

Web scraping can be used for a variety of purposes, such as collecting data for research, monitoring websites for changes or updates, or gathering data for business intelligence purposes. Some examples of data that can be extracted using web scraping include product information, customer reviews, job listings, real estate listings, and news articles.

Overall, web scraping is a powerful tool for data collection and analysis, but it requires careful consideration of ethical and legal considerations, as well as technical expertise in the tools and technologies used for scraping.


Why use Node.js and Playwright for web scraping?


Node.js and Playwright are great tools for web scraping for several reasons. Node.js is a lightweight, efficient, and popular JavaScript runtime that makes it easy to write server-side applications in JavaScript. Playwright is a Node.js library that provides a high-level API for automating web browsers and supports multiple browser types. It is highly efficient and offers built-in support for modern web technologies. These features make it easy to extract data from web pages and save it in the desired format, making Node.js and Playwright an excellent choice for web scraping.

 

Setting up the environment


Before we can start scraping with Node.js and Playwright, we need to set up our development environment. First, we need to install Node.js on our machine.
We can download and install Node.js from the official Node.js website by following the installation wizard for our specific operating system.

Once Node.js is installed, we can create a new project directory for our scraping project. We can create a new directory using the mkdir command in the terminal. For example, to create a new directory called my-project, we can run the following command:

mkdir scrapping-project
cd scrapping-project

 

Next, we need to initialize our project as a Node.js project using the npm init command. This command will prompt us for some basic information about our project, such as the project name, version, description, and entry point. We can accept the default settings by running the following command:

npm init -y

With our project initialized, we can now install the Playwright package using the npm install command. We can run the following command in the terminal:

npm install playwright


Basic scraping with Playwright

First, we need to create our root file as scraping.js and a folder and file for setup the browser by running the following commands.


 

touch scraping.js

mkdir browser
cd browser
touch index.js

 

Now we need to code our browser/index.js file. First, we need to require a browser from the Playwright. In our case it is chromium

const { chromium } = require('playwright');

We will create two functions with the name Open Browser and Close browser.

The first function is openBrowser. Here we are launching our browser".The headless option is set to false, which means that the browser will launch in a visible window rather than running in the background. We create a context from the browser later to open a page with it.

async function openBrowser() {
const browser = await chromium.launch({ headless: false });
 const c browser.newContext();
    return { browser, context };
}

The second function is closeBrowser, which takes a browser object as an argument and uses the close method to close the browser.

async function closeBrowser(browser) {
    await browser.close();
}

Finally, we need to export the  functions so that they can be imported and used in other files.

module.exports = { openBrowser, closeBrowser };

Here index.js file for browser

const { chromium } = require('playwright');

async function openBrowser() {
    const browser = await chromium.launch({ headless: false });
     const c browser.newContext();
    return { browser, context };
}

async function closeBrowser(browser) {
await browser.close();
}

module.exports = { openBrowser, closeBrowser };

Now we will code our main scrapping.js for doing scrap


We need to import functions from browser/index.js for opening and closing the browser.

const {closeBrowser, openBrowser} = require('./browser/index');

We create a function named startScraping. 

async function startScraping() {}

To start scraping we will need to open the browser and page.

const {browser, context} = (await openBrowser());
const page = await context.newPage();

After opening the page in the browser using our chosen web automation library, we need to navigate to the esterox.com website. To accomplish this, we can use the page.goto function with the URL of the desired website passed as a parameter.

await page.goto('https://www.esterox.com/');

To find the cookie acceptance button on the esterox.com website, we can use the page.$$ method with a selector that searches for HTML elements containing the desired text. Once we have located the button element, we can use the element.click() function to programmatically click on it.

const acceptButt page.$$("button:text('Accept')");
if (acceptButton.length) {
    await acceptButton[0].click();
}

After accepting the cookies we need to navigate to the blog page by clicking on the link with the text BLOG.

await page.click("a:text('Blog')");

We need to wait until the content on the page will load. For that, there are two ways

1 await page.waitForTimeout(6000); 

//wait 6 seconds 


2 await page.waitForSelector('.blog-block__image');

//wait for selector is exist

We will use the second version

await page.waitForLoadState('load');

When the content of the page was loaded, we inject script with js into the page by page.evaluate function and this script will get data from the page.

const result = await page.evaluate(() => {
    const blogs = document.querySelectorAll('.blog-block');
    const results = [];
    for (let i = 0; i < blogs.length; i++) {
        const result = {};
        const image = blogs[i].querySelector(".blog-block__image").getAttribute("src");
        const name = blogs[i].querySelector(".blog-block__name").innerText;
        const createdAt = blogs[i].querySelector(".blog-block__date").innerText;
        const url = `https://esterox.com${blogs[i].querySelector(".blog-block__link").getAttribute("href")}`;
        result['img'] = image;
        result['name'] = name;
        result['createdAt'] = createdAt;
        result['url'] = url;
        results.push(result);
    }
    return results;
});

We can do with the result what we want to store in DB or write in the file or only show in the terminal.

When scrapping is finished successfully, we close the browser

await closeBrowser(browser);

 


After that, we create async function for call startScraping function.

(async () => {
    try {
        await startScraping();
    } catch (e) {
        console.log('error ------>', e)
    }
})();

Here is the scrapping.js file


 

const {closeBrowser, openBrowser} = require('./browser/index');

async function startScraping() {
    const {browser, context} = (await openBrowser());
    const page = await context.newPage();

    await page.goto('https://www.esterox.com/');

    const acceptButt page.$$("button:text('Accept')");
    if (acceptButton.length) {
        await acceptButton[0].click();
    }

    await page.click("a:text('Blog')");


    await page.waitForSelector('.blog-block__image');
    const result = await page.evaluate(() => {
        const blogs = document.querySelectorAll('.blog-block');
        const results = [];
        for (let i = 0; i < blogs.length; i++) {
            const result = {};
            const image = blogs[i].querySelector(".blog-block__image").getAttribute("src");
            const name = blogs[i].querySelector(".blog-block__name").innerText;
            const createdAt = blogs[i].querySelector(".blog-block__date").innerText;
            const url = `https://esterox.com${blogs[i].querySelector(".blog-block__link").getAttribute("href")}`;
            result['img'] = image;
            result['name'] = name;
            result['createdAt'] = createdAt;
            result['url'] = url;
            results.push(result);
        }
        return results;
    });
    await closeBrowser(browser);
}


(async () => {
    try {
        await startScraping();
    } catch (e) {
        console.log('error ------>', e)
    }
})();

You can run it with the following command

node scrapping

With the knowledge gained from this article, you now have the foundation to start building your own web scraping scripts using Node.js and Playwright. Remember to always consider the ethical and legal implications of web scraping and be mindful of the terms of service of the websites you scrape. Happy scraping!

 


P.S. In case you need the main code, it's here: https://github.com/arthur-sahakyan/scraping-project