Gathering detailed insights and metrics for robots-txt-parser
Gathering detailed insights and metrics for robots-txt-parser
Gathering detailed insights and metrics for robots-txt-parser
Gathering detailed insights and metrics for robots-txt-parser
A lightweight robots.txt parser for Node.js with support for wildcards, caching and promises.
npm install robots-txt-parser
Typescript
Module System
Node Version
NPM Version
JavaScript (100%)
Total Downloads
0
Last Day
0
Last Week
0
Last Month
0
Last Year
0
MIT License
14 Stars
80 Commits
9 Forks
1 Watchers
3 Branches
2 Contributors
Updated on Jul 09, 2025
Latest Version
2.0.3
Package Id
robots-txt-parser@2.0.3
Unpacked Size
58.23 kB
Size
14.52 kB
File Count
41
NPM Version
7.14.0
Node Version
16.2.0
Cumulative downloads
Total Downloads
Last Day
0%
NaN
Compared to previous day
Last Week
0%
NaN
Compared to previous week
Last Month
0%
NaN
Compared to previous month
Last Year
0%
NaN
Compared to previous year
3
6
A lightweight robots.txt parser for Node.js with support for wildcards, caching and promises.
Via NPM: npm install robots-txt-parser --save
.
After installing robots-txt-parser it needs to be required and initialised:
1const robotsParser = require('robots-txt-parser');
2const robots = robotsParser(
3 {
4 userAgent: 'Googlebot', // The default user agent to use when looking for allow/disallow rules, if this agent isn't listed in the active robots.txt, we use *.
5 allowOnNeutral: false // The value to use when the robots.txt rule's for allow and disallow are balanced on whether a link can be crawled.
6 });
Example Usage:
1const robotsParser = require('robots-txt-parser');
2
3const robots = robotsParser(
4 {
5 userAgent: 'Googlebot', // The default user agent to use when looking for allow/disallow rules, if this agent isn't listed in the active robots.txt, we use *.
6 allowOnNeutral: false, // The value to use when the robots.txt rule's for allow and disallow are balanced on whether a link can be crawled.
7 },
8);
9
10robots.useRobotsFor('http://example.com')
11 .then(() => {
12 robots.canCrawlSync('http://example.com/news'); // Returns true if the link can be crawled, false if not.
13 robots.canCrawl('http://example.com/news', (value) => {
14 console.log('Crawlable: ', value);
15 }); // Calls the callback with true if the link is crawlable, false if not.
16 robots.canCrawl('http://example.com/news') // If no callback is provided, returns a promise which resolves with true if the link is crawlable, false if not.
17 .then((value) => {
18 console.log('Crawlable: ', value);
19 });
20 });
Below is a condensed form of the documentation, each is a function that can be found on the robotsParser object.
Method | Parameters | Return |
---|---|---|
parseRobots(key, string) | key:String, string:String | None |
isCached(domain) | domain:String | Boolean for whether robots.txt for url is cached. |
fetch(url) | url:String | Promise, resolved when robots.txt retrieved. |
useRobotsFor(url) | url:String | Promise, resolved when robots.txt is fetched. |
canCrawl(url) | url:String, callback:Func (Opt) | Promise, resolves with Boolean. |
getSitemaps() | callback:Func (Opt) | Promise if no callback provided, resolves with [String]. |
getCrawlDelay() | callback:Func (Opt) | Promise if no callback provided, resolves with Number. |
getCrawlableLinks(links) | links:[String], callback:Func (Opt) | Promise if no callback provided, resolves with [String]. |
getPreferredHost() | callback:Func (Opt) | Promise if no callback provided, resolves with String. |
setUserAgent(userAgent) | userAgent:String | None. |
setAllowOnNeutral(allow) | allow:Boolean | None. |
clearCache() | None | None. |
robots.parseRobots(key, string)
Parses a string representation of a robots.txt file and cache's it with the given key.
None.
1robots.parseRobots('https://example.com',
2 `
3 User-agent: *
4 Allow: /*.php$
5 Disallow: /
6 `);
robots.isCached(domain)
A method used to check if a robots.txt has already been fetched and parsed.
Returns true if a robots.txt has already been fetched and cached by the robots-txt-parser.
1robots.isCached('https://example.com'); // true or false 2robots.isCached('example.com'); // Attempts to check the cache for only http:// and returns true or false.
robots.fetch(url)
Attempts to fetch and parse a robots.txt file located at the url, this method avoids checking the built-in cache and will always attempt to retrieve a fresh copy of the robots.txt.
Returns a Promise which will resolve once the robots.txt has been fetched with the parsed robots.txt.
1robots.fetch('https://example.com/robots.txt') 2 .then((tree) => { 3 console.log(Object.keys(tree)); // Will log sitemap and any user agents. 4 });
robots.useRobotsFor(url)
Attempts to download and use the robots.txt at the given url, if the robots.txt has already been downloaded, reads from the cached copy instead.
Returns a promsise that resolves once the URL is fetched and parsed.
1robots.useRobotsFor('https://example.com/news') 2 .then(() => { 3 // Logic to check if links are crawlable. 4 });
robots.canCrawl(url, callback)
Tests whether a url can be crawled for the current active robots.txt and user agent. If a robots.txt isn't cached for the domain of the url, it is fetched and parsed before returning a boolean value.
Returns a Promise which will resolve with a boolean value.
1robots.canCrawl('https://example.com/news') 2 .then((crawlable) => { 3 console.log(crawlable); // Will log a boolean value. 4 });
robots.getSitemaps(callback)
Returns a list of sitemaps present on the active robots.txt.
Returns a Promise which will resolve with an array of strings.
1robots.getSitemaps() 2 .then((sitemaps) => { 3 console.log(sitemaps); // Will log an list of strings. 4 });
robots.getCrawlDelay(callback)
Returns the crawl delay on requests to the current active robots.txt.
Returns a Promise which will resolve with an Integer.
1robots.getCrawlDelay() 2 .then((crawlDelay) => { 3 console.log(crawlDelay); // Will be an Integer greater than or equal to 0. 4 });
robots.getCrawlableLinks(links, callback)
Takes an array of links and returns an array of links which are crawlable
for the current active robots.txt.
A Promise that will resolve to contain an Array of all the crawlable links.
1robots.getCrawlableLinks([]) 2 .then((links) => { 3 console.log(links); 4 });
robots.getPreferredHost(callback)
Returns the preferred host name specified in the active robots.txt's host: directive or null if there isn't one.
An String if the host is defined, undefined otherwise.
1robots.getPreferredHost() 2 .then((host) => { 3 console.log(host); 4 });
robots.setUserAgent(userAgent)
Sets the current user agent to use when checking if a link can be crawled.
undefined
1robots.setUserAgent('exampleBot'); // When interacting with the robots.txt we now look for records for 'exampleBot'. 2robots.setUserAgent('testBot'); // When interacting with the robots.txt we now look for records for 'testBot'.
robots.setAllowOnNeutral(allow)
Sets the canCrawl behaviour to return true or false when the robots.txt rules are balanced on whether a link should be crawled or not.
undefined
1robots.setAllowOnNeutral(true); // If the allow/disallow rules are balanced, canCrawl returns true.
2robots.setAllowOnNeutral(false); // If the allow/disallow rules are balanced, canCrawl returns false.
robots.clearCache()
The cache can get extremely long over extended crawling, this simple method resets the cache.
None
None
1robots.clearCache();
Synchronous variants of the API, will be deprecated in a future version.
robots.canCrawlSync(url)
Tests whether a url can be crawled for the current active robots.txt and user agent. This won't attempt to fetch the robots.txt if it is not cached.
Returns a boolean value depending on whether the url is crawlable. If there is no cached robots.txt for this url, it will always return true.
1robots.canCrawlSync('https://example.com/news') // true or false.
robots.getSitemapsSync()
Returns a list of sitemaps present on the active robots.txt.
None
An Array of Strings.
1robots.getSitemapsSync(); // Will be an array e.g. ['http://example.com/sitemap1.xml', 'http://example.com/sitemap2.xml'].
robots.getCrawlDelaySync()
Returns the crawl delay on specified in the active robots.txt's for the active user agent
None
An Integer greater than or equal to 0.
1robots.getCrawlDelaySync(); // Will be an Integer.
robots.getCrawlableLinksSync(links)
Takes an array of links and returns an array of links which are crawlable
for the current active robots.txt.
An Array of all the links are crawlable.
1robots.getCrawlableLinks(['example.com/test/news', 'example.com/test/news/article']); // Will return an array of the links that can be crawled.
robots.getPreferredHostSync()
Returns the preferred host name specified in the active robots.txt's host: directive or undefined if there isn't one.
None
An String if the host is defined, undefined otherwise.
1robots.getPreferredHostSync(); // Will be a string if the host directive is defined .
See LICENSE file.
No vulnerabilities found.
Reason
no dangerous workflow patterns detected
Reason
no binaries found in the repo
Reason
0 existing vulnerabilities detected
Reason
license file detected
Details
Reason
Found 1/29 approved changesets -- score normalized to 0
Reason
0 commit(s) and 0 issue activity found in the last 90 days -- score normalized to 0
Reason
detected GitHub workflow tokens with excessive permissions
Details
Reason
dependency not pinned by hash detected -- score normalized to 0
Details
Reason
no effort to earn an OpenSSF best practices badge detected
Reason
security policy file not detected
Details
Reason
project is not fuzzed
Details
Reason
branch protection not enabled on development/release branches
Details
Reason
SAST tool is not run on all commits -- score normalized to 0
Details
Score
Last Scanned on 2025-07-14
The Open Source Security Foundation is a cross-industry collaboration to improve the security of open source software (OSS). The Scorecard provides security health metrics for open source projects.
Learn More