npmpackage.info

Gathering detailed insights and metrics for puppeteer-cluster

Other packages similar to puppeteer-cluster

@toologin/puppeteer-cluster

0.24.6

Cluster management for puppeteer

@zhaow-de/puppeteer-cluster

0.37.0

Cluster management for puppeteer

@drtz/puppeteer-cluster

1.0.3

Cluster management for puppeteer

@pricerobot/puppeteer-cluster

0.23.7

Cluster management for puppeteer

puppeteer-cluster

Puppeteer Pool, run a cluster of instances in parallel

0.24.0

3,322

MIT

TypeScript

7.00 kB

10,337,426 2.5

Installations

npm install puppeteer-cluster

Developer Guide

BETA

Typescript

Yes

Module System

CommonJS

Score

65.8

Supply Chain

97.1

Quality

75.1

Maintenance

100

Vulnerability

98.9

License

Pull Requests

Open

19

Total

289

Closed

219

Merged

51

Issues

Open

114

Total

263

Closed

149

Releases

v0.24.0

Updated on Mar 17, 2024

v0.23.0

Updated on Jan 23, 2022

v0.22.0

Updated on Jan 23, 2022

v0.21.0

Updated on May 24, 2020

v0.20.0

Updated on May 24, 2020

v0.19.0

Updated on May 24, 2020

View All 25 releases

Contributors

Unable to fetch Contributors

View All 12 Contributors

Languages

TypeScript

JavaScript

TypeScript (99.43%)

JavaScript (0.57%)

Love this project? Help keep it running — sponsor us today! 🚀

Developer

thomasdondorf

Download Statistics

Total Downloads

10,337,426

Last Day

12,239

Last Week

77,268

Last Month

330,241

Last Year

3,264,730

GitHub Statistics

MIT License

3,322 Stars

364 Commits

312 Forks

49 Watchers

24 Branches

12 Contributors

Updated on Feb 14, 2025

Bundle Size

25.34 kB

Minified

7.00 kB

Minified + Gzipped

Bundlephobia

Maintainers

Package Meta Information

Latest Version

0.24.0

Package Id

puppeteer-cluster@0.24.0

Unpacked Size

119.30 kB

Size

30.77 kB

File Count

Published on

Mar 17, 2024

Total Downloads

Cumulative downloads

Total Downloads

10,337,426

Last Day

-11%

12,239

Compared to previous day

Last Week

-6.2%

77,268

Compared to previous week

Last Month

24.8%

330,241

Compared to previous month

Last Year

44.2%

3,264,730

Compared to previous year

Daily Downloads

Weekly Downloads

Monthly Downloads

Yearly Downloads

Puppeteer Cluster

Create a cluster of puppeteer workers. This library spawns a pool of Chromium instances via Puppeteer and helps to keep track of jobs and errors. This is helpful if you want to crawl multiple pages or run tests in parallel. Puppeteer Cluster takes care of reusing Chromium and restarting the browser in case of errors.

Installation
Usage
Examples
Concurrency implementations
Typings for input/output (via TypeScript Generics)
Debugging
API
License

What does this library do?

Handling of crawling errors
Auto restarts the browser in case of a crash
Can automatically retry if a job fails
Different concurrency models to choose from (pages, contexts, browsers)
Simple to use, small boilerplate
Progress view and monitoring statistics (see below)

Installation

Install using your favorite package manager:

1npm install --save puppeteer # in case you don't already have it installed 
2npm install --save puppeteer-cluster

Alternatively, use yarn:

1yarn add puppeteer puppeteer-cluster

Usage

The following is a typical example of using puppeteer-cluster. A cluster is created with 2 concurrent workers. Then a task is defined which includes going to the URL and taking a screenshot. We then queue two jobs and wait for the cluster to finish.

1const { Cluster } = require('puppeteer-cluster');
2
3(async () => {
4  const cluster = await Cluster.launch({
5    concurrency: Cluster.CONCURRENCY_CONTEXT,
6    maxConcurrency: 2,
7  });
8
9  await cluster.task(async ({ page, data: url }) => {
10    await page.goto(url);
11    const screen = await page.screenshot();
12    // Store screenshot, do something else
13  });
14
15  cluster.queue('http://www.google.com/');
16  cluster.queue('http://www.wikipedia.org/');
17  // many more pages
18
19  await cluster.idle();
20  await cluster.close();
21})();

Examples

Concurrency implementations

There are different concurrency models, which define how isolated each job is run. You can set it in the options when calling Cluster.launch. The default option is Cluster.CONCURRENCY_CONTEXT, but it is recommended to always specify which one you want to use.

Concurrency	Description	Shared data
`CONCURRENCY_PAGE`	One Page for each URL	Shares everything (cookies, localStorage, etc.) between jobs.
`CONCURRENCY_CONTEXT`	Incognito page (see BrowserContext) for each URL	No shared data.
`CONCURRENCY_BROWSER`	One browser (using an incognito page) per URL. If one browser instance crashes for any reason, this will not affect other jobs.	No shared data.
Custom concurrency (experimental)	You can create your own concurrency implementation. Copy one of the files of the `concurrency/built-in` directory and implement `ConcurrencyImplementation`. Then provide the class to the option `concurrency`. This part of the library is currently experimental and might break in the future, even in a minor version upgrade while the version has not reached 1.0.	Depends on your implementation

Typings for input/output (via TypeScript Generics)

To allow proper type checks with TypeScript you can provide generics. In case no types are provided, any is assumed for input and output. See the following minimal example or check out the more complex typings example for more information.

1  const cluster: Cluster<string, number> = await Cluster.launch(/* ... */);
2
3  await cluster.task(async ({ page, data }) => {
4    // TypeScript knows that data is a string and expects this function to return a number
5    return 123;
6  });
7
8  // Typescript expects a string as argument ...
9  cluster.queue('http://...');
10
11  // ... and will return a number when execute is called.
12  const result = await cluster.execute('https://www.google.com');

Debugging

Try to checkout the puppeteer debugging tips first. Your problem might not be related to puppeteer-cluster, but puppteer itself. Additionally, you can enable verbose logging to see which data is consumed by which worker and some other cluster information. Set the DEBUG environment variable to puppeteer-cluster:*. See an example below or checkout the debug docs for more information.

1# Linux
2DEBUG='puppeteer-cluster:*' node examples/minimal
3# Windows Powershell
4$env:DEBUG='puppeteer-cluster:*';node examples/minimal

API

class: Cluster

class: Cluster

Cluster module provides a method to launch a cluster of Chromium instances.

event: 'taskerror'

Emitted when a queued task ends in an error for some reason. Reasons might be a network error, your code throwing an error, timeout hit, etc. The first argument will the error itself. The second argument is the URL or data of the job (as given to Cluster.queue). If retryLimit is set to a value greater than 0, the cluster will automatically requeue the job and retry it again later. The third argument is a boolean which indicates whether this task will be retried. In case the task was queued via Cluster.execute there will be no event fired.

1  cluster.on('taskerror', (err, data, willRetry) => {
2      if (willRetry) {
3        console.warn(`Encountered an error while crawling ${data}. ${err.message}\nThis job will be retried`);
4      } else {
5        console.error(`Failed to crawl ${data}: ${err.message}`);
6      }
7  });

event: 'queue'

<?Object>
<?function>

Emitted when a task is queued via Cluster.queue or Cluster.execute. The first argument is the object containing the data (if any data is provided). The second argument is the queued function (if any). In case only a function is provided via Cluster.queue or Cluster.execute, the first argument will be undefined. If only data is provided, the second argument will be undefined.

Cluster.launch(options)

options <Object> Set of configurable options for the cluster. Can have the following fields:
- concurrency <Cluster.CONCURRENCY_PAGE|Cluster.CONCURRENCY_CONTEXT|Cluster.CONCURRENCY_BROWSER|ConcurrencyImplementation> The chosen concurrency model. See Concurreny models for more information. Defaults to Cluster.CONCURRENCY_CONTEXT. Alternatively you can provide a class implementing ConcurrencyImplementation.
- maxConcurrency <number> Maximal number of parallel workers. Defaults to 1.
- puppeteerOptions <Object> Object passed to puppeteer.launch. See puppeteer documentation for more information. Defaults to {}.
- perBrowserOptions <Array<Object>> Object passed to puppeteer.launch for each individual browser. If set, puppeteerOptions will be ignored. Defaults to undefined (meaning that puppeteerOptions will be used).
- retryLimit <number> How often do you want to retry a job before marking it as failed. Ignored by tasks queued via Cluster.execute. Defaults to 0.
- retryDelay <number> How much time should pass at minimum between the job execution and its retry. Ignored by tasks queued via Cluster.execute. Defaults to 0.
- sameDomainDelay <number> How much time should pass at minimum between two requests to the same domain. If you use this field, the queued data must be your URL or data must be an object containing a field called url.
- skipDuplicateUrls <boolean> If set to true, will skip URLs which were already crawled by the cluster. Defaults to false. If you use this field, the queued data must be your URL or data must be an object containing a field called url.
- timeout <number> Specify a timeout for all tasks. Defaults to 30000 (30 seconds).
- monitor <boolean> If set to true, will provide a small command line output to provide information about the crawling process. Defaults to false.
- workerCreationDelay <number> Time between creation of two workers. Set this to a value like 100 (0.1 seconds) in case you want some time to pass before another worker is created. You can use this to prevent a network peak right at the start. Defaults to 0 (no delay).
- puppeteer <Object> In case you want to use a different puppeteer library (like puppeteer-core or puppeteer-extra), pass the object here. If not set, will default to using puppeteer. When using puppeteer-core, make sure to also provide puppeteerOptions.executablePath.
returns: <Promise<Cluster>>

The method launches a cluster instance.

cluster.task(taskFunction)

taskFunction <function(string|Object, Page, Object)> Sets the function, which will be called for each job. The function will be called with an object having the following fields:
- page <Page> The page given by puppeteer, which provides methods to interact with a single tab in Chromium.
- data The data of the job you provided to Cluster.queue.
- worker <Object> An object containing information about the worker executing the current job.
  - id <number> ID of the worker. Worker IDs start at 0.
returns: <Promise>

Specifies a task for the cluster. A task is called for each job you queue via Cluster.queue. Alternatively you can directly queue the function that you want to be executed. See Cluster.queue for an example.

cluster.queue([data] [, taskFunction])

data Data to be queued. This might be your URL (a string) or a more complex object containing data. The data given will be provided to your task function(s). See [examples] for a more complex usage of this argument.
taskFunction <function> Function like the one given to Cluster.task. If a function is provided, this function will be called (only for this job) instead of the function provided to Cluster.task. The function will be called with an object having the following fields:
- page <Page> The page given by puppeteer, which provides methods to interact with a single tab in Chromium.
- data The data of the job you provided as first argument to Cluster.queue. This might be undefined in case you only specified a function.
- worker <Object> An object containing information about the worker executing the current job.
  - id <number> ID of the worker. Worker IDs start at 0.
returns: <Promise>

Puts a URL or data into the queue. Alternatively (or even additionally) you can queue functions. See the examples about function queuing for more information: (Simple function queuing, complex function queuing).

Be aware that this function only returns a Promise for backward compatibility reasons. This function does not run asynchronously and will immediately return.

cluster.execute([data] [, taskFunction])

data Data to be queued. This might be your URL (a string) or a more complex object containing data. The data given will be provided to your task function(s). See [examples] for a more complex usage of this argument.
taskFunction <function> Function like the one given to Cluster.task. If a function is provided, this function will be called (only for this job) instead of the function provided to Cluster.task. The function will be called with an object having the following fields:
- page <Page> The page given by puppeteer, which provides methods to interact with a single tab in Chromium.
- data The data of the job you provided as first argument to Cluster.queue. This might be undefined in case you only specified a function.
- worker <Object> An object containing information about the worker executing the current job.
  - id <number> ID of the worker. Worker IDs start at 0.
returns: <Promise>

Works like Cluster.queue, but this function returns a Promise which will be resolved after the task is executed. That means, that the job is still queued, but the script will wait for it to be finished. In case an error happens during the execution, this function will reject the Promise with the thrown error. There will be no "taskerror" event fired. In addition, tasks queued via execute will ignore "retryLimit" and "retryDelay". For an example see the Execute example.

cluster.idle()

returns: <Promise>

Promise is resolved when the queue becomes empty.

cluster.close()

returns: <Promise>

Closes the cluster and all opened Chromium instances including all open pages (if any were opened). It is recommended to run Cluster.idle before calling this function. The Cluster object itself is considered to be disposed and cannot be used anymore.

License

MIT license.

No vulnerabilities found.

10

Binary-Artifacts

Determines if the project has generated executable (binary) artifacts in the source repository.

10

Dangerous-Workflow

Determines if the project's GitHub Action workflows avoid dangerous patterns.

10

License

Determines if the project has defined a license.

0

Code-Review

Determines if the project requires human code review before pull requests (aka merge requests) are merged.

0

Maintained

Determines if the project is "actively maintained".

0

Token-Permissions

Determines if the project's workflows follow the principle of least privilege.

0

Pinned-Dependencies

Determines if the project has declared and pinned the dependencies of its build process.

0

CII-Best-Practices

Determines if the project has an OpenSSF (formerly CII) Best Practices Badge.

0

Security-Policy

Determines if the project has published a security policy.

0

Fuzzing

Determines if the project uses fuzzing.

0

Branch-Protection

Determines if the default and release branches are protected with GitHub's branch protection settings.

0

SAST

Determines if the project uses static code analysis.

0

Vulnerabilities

Determines if the project has open, known unfixed vulnerabilities.

Score

2.5

/10

Last Scanned on 2025-02-10

The Open Source Security Foundation is a cross-industry collaboration to improve the security of open source software (OSS). The Scorecard provides security health metrics for open source projects.

Learn More

Other packages similar to puppeteer-cluster

puppeteer-cluster

Installations

Developer Guide

Yes

CommonJS

Score

Pull Requests

19

289

219

51

Issues

114

263

149

Releases

Contributors

Unable to fetch Contributors

Languages

Developer

thomasdondorf

Download Statistics

GitHub Statistics

Bundle Size

25.34 kB

7.00 kB

Maintainers

Package Meta Information

Total Downloads

10,337,426

Daily Downloads

Weekly Downloads

Monthly Downloads

Yearly Downloads

Dependencies

Peer Dependencies

Dev Dependencies

Puppeteer Cluster

What does this library do?

Installation

Usage

Examples

Concurrency implementations

Typings for input/output (via TypeScript Generics)

Debugging

API

class: Cluster

event: 'taskerror'

event: 'queue'

Cluster.launch(options)

cluster.task(taskFunction)

cluster.queue([data] [, taskFunction])

cluster.execute([data] [, taskFunction])

cluster.idle()

cluster.close()

License

10

10

10

0

0

0

0

0

0

0

0

0

0

2.5

/10