To extract main article from given URL with Node.js
Installations
npm install @extractus/article-extractor
Developer Guide
Typescript
Yes
Module System
ESM, UMD
Min. Node Version
>= 18
Node Version
22.11.0
NPM Version
10.9.0
Score
57.6
Supply Chain
97.3
Quality
87.6
Maintenance
100
Vulnerability
98.9
License
Releases
Contributors
Unable to fetch Contributors
Languages
JavaScript (60.54%)
HTML (39.46%)
Developer
Download Statistics
Total Downloads
288,571
Last Day
168
Last Week
4,572
Last Month
17,877
Last Year
193,746
GitHub Statistics
1,620 Stars
755 Commits
140 Forks
17 Watching
1 Branches
16 Contributors
Package Meta Information
Latest Version
8.0.16
Package Id
@extractus/article-extractor@8.0.16
Unpacked Size
83.98 kB
Size
25.61 kB
File Count
35
NPM Version
10.9.0
Node Version
22.11.0
Publised On
09 Nov 2024
Total Downloads
Cumulative downloads
Total Downloads
288,571
Last day
-30.9%
168
Compared to previous day
Last week
19.7%
4,572
Compared to previous week
Last month
33.5%
17,877
Compared to previous month
Last year
106.3%
193,746
Compared to previous year
Daily Downloads
Weekly Downloads
Monthly Downloads
Yearly Downloads
Dev Dependencies
5
@extractus/article-extractor
Extract main article, main image and meta data from URL.
(This library is derived from article-parser renamed.)
Demo
Install & Usage
Node.js
1npm i @extractus/article-extractor 2 3# pnpm 4pnpm i @extractus/article-extractor 5 6# yarn 7yarn add @extractus/article-extractor
1// es6 module 2import { extract } from '@extractus/article-extractor'
Deno
1import { extract } from 'https://esm.sh/@extractus/article-extractor' 2 3// deno > 1.28 4import { extract } from 'npm:@extractus/article-extractor'
Browser
1import { extract } from 'https://esm.sh/@extractus/article-extractor'
Please check the examples for reference.
APIs
extract()
Load and extract article data. Return a Promise object.
Syntax
1extract(String input) 2extract(String input, Object parserOptions) 3extract(String input, Object parserOptions, Object fetchOptions)
Example:
1import { extract } from '@extractus/article-extractor' 2 3const input = 'https://www.cnbc.com/2022/09/21/what-another-major-rate-hike-by-the-federal-reserve-means-to-you.html' 4 5// here we use top-level await, assume current platform supports it 6try { 7 const article = await extract(input) 8 console.log(article) 9} catch (err) { 10 console.error(err) 11}
The result - article
- can be null
or an object with the following structure:
1{ 2 url: String, 3 title: String, 4 description: String, 5 image: String, 6 author: String, 7 favicon: String, 8 content: String, 9 published: Date String, 10 type: String, // page type 11 source: String, // original publisher 12 links: Array, // list of alternative links 13 ttr: Number, // time to read in second, 0 = unknown 14}
Parameters
input
required
URL string links to the article or HTML content of that web page.
parserOptions
optional
Object with all or several of the following properties:
wordsPerMinute
: Number, to estimate time to read. Default300
.descriptionTruncateLen
: Number, max num of chars generated for description. Default210
.descriptionLengthThreshold
: Number, min num of chars required for description. Default180
.contentLengthThreshold
: Number, min num of chars required for content. Default200
.
For example:
1import { extract } from '@extractus/article-extractor' 2 3const article = await extract('https://www.cnbc.com/2022/09/21/what-another-major-rate-hike-by-the-federal-reserve-means-to-you.html', { 4 descriptionLengthThreshold: 120, 5 contentLengthThreshold: 500 6}) 7 8console.log(article)
fetchOptions
optional
fetchOptions
is an object that can have the following properties:
headers
: to set request headersproxy
: another endpoint to forward the request toagent
: a HTTP proxy agentsignal
: AbortController signal or AbortSignal timeout to terminate the request
For example, you can use this param to set request headers to fetch as below:
1import { extract } from '@extractus/article-extractor' 2 3const url = 'https://www.cnbc.com/2022/09/21/what-another-major-rate-hike-by-the-federal-reserve-means-to-you.html' 4const article = await extract(url, {}, { 5 headers: { 6 'user-agent': 'Opera/9.60 (Windows NT 6.0; U; en) Presto/2.1.1' 7 } 8}) 9 10console.log(article)
You can also specify a proxy endpoint to load remote content, instead of fetching directly.
For example:
1import { extract } from '@extractus/article-extractor' 2 3const url = 'https://www.cnbc.com/2022/09/21/what-another-major-rate-hike-by-the-federal-reserve-means-to-you.html' 4 5await extract(url, {}, { 6 headers: { 7 'user-agent': 'Opera/9.60 (Windows NT 6.0; U; en) Presto/2.1.1' 8 }, 9 proxy: { 10 target: 'https://your-secret-proxy.io/loadXml?url=', 11 headers: { 12 'Proxy-Authorization': 'Bearer YWxhZGRpbjpvcGVuc2VzYW1l...' 13 }, 14 } 15})
Passing requests to proxy is useful while running @extractus/article-extractor
on browser. View examples/browser-article-parser as reference example.
For more info about proxy authentication, please refer HTTP authentication
For a deeper customization, you can consider using Proxy to replace fetch
behaviors with your own handlers.
Another way to work with proxy is use agent
option instead of proxy
as below:
1import { extract } from '@extractus/article-extractor' 2 3import { HttpsProxyAgent } from 'https-proxy-agent' 4 5const proxy = 'http://abc:RaNdoMpasswORd_country-France@proxy.packetstream.io:31113' 6 7const url = 'https://www.cnbc.com/2022/09/21/what-another-major-rate-hike-by-the-federal-reserve-means-to-you.html' 8 9const article = await extract(url, {}, { 10 agent: new HttpsProxyAgent(proxy), 11}) 12console.log('Run article-extractor with proxy:', proxy) 13console.log(article)
For more info about https-proxy-agent, check its repo.
By default, there is no request timeout. You can use the option signal
to cancel request at the right time.
The common way is to use AbortControler:
1const controller = new AbortController() 2 3// stop after 5 seconds 4setTimeout(() => { 5 controller.abort() 6}, 5000) 7 8const data = await extract(url, null, { 9 signal: controller.signal, 10})
A newer solution is AbortSignal's timeout()
static method:
1// stop after 5 seconds 2const data = await extract(url, null, { 3 signal: AbortSignal.timeout(5000), 4})
For more info:
extractFromHtml()
Extract article data from HTML string. Return a Promise object as same as extract()
method above.
Syntax
1extractFromHtml(String html)
2extractFromHtml(String html, String url)
3extractFromHtml(String html, String url, Object parserOptions)
Example:
1import { extractFromHtml } from '@extractus/article-extractor' 2 3const url = 'https://www.cnbc.com/2022/09/21/what-another-major-rate-hike-by-the-federal-reserve-means-to-you.html' 4 5const res = await fetch(url) 6const html = await res.text() 7 8// you can do whatever with this raw html here: clean up, remove ads banner, etc 9// just ensure a html string returned 10 11const article = await extractFromHtml(html, url) 12console.log(article)
Parameters
html
required
HTML string which contains the article you want to extract.
url
optional
URL string that indicates the source of that HTML content.
article-extractor
may use this info to handle internal/relative links.
parserOptions
optional
See parserOptions above.
Transformations
Sometimes the default extraction algorithm may not work well. That is the time when we need transformations.
By adding some functions before and after the main extraction step, we aim to come up with a better result as much as possible.
There are 2 methods to play with transformations:
addTransformations(Object transformation | Array transformations)
removeTransformations(Array patterns)
At first, let's talk about transformation
object.
transformation
object
In @extractus/article-extractor
, transformation
is an object with the following properties:
patterns
: required, a list of regexps to match the URLspre
: optional, a function to process raw HTMLpost
: optional, a function to process extracted article
Basically, the meaning of transformation
can be interpreted like this:
with the urls which match these
patterns
let's runpre
function to normalize HTML content
then extract main article content with normalized HTML, and if success
let's runpost
function to normalize extracted article content
Here is an example transformation:
1{ 2 patterns: [ 3 /([\w]+.)?domain.tld\/*/, 4 /domain.tld\/articles\/*/ 5 ], 6 pre: (document) => { 7 // remove all .advertise-area and its siblings from raw HTML content 8 document.querySelectorAll('.advertise-area').forEach((element) => { 9 if (element.nodeName === 'DIV') { 10 while (element.nextSibling) { 11 element.parentNode.removeChild(element.nextSibling) 12 } 13 element.parentNode.removeChild(element) 14 } 15 }) 16 return document 17 }, 18 post: (document) => { 19 // with extracted article, replace all h4 tags with h2 20 document.querySelectorAll('h4').forEach((element) => { 21 const h2Element = document.createElement('h2') 22 h2Element.innerHTML = element.innerHTML 23 element.parentNode.replaceChild(h2Element, element) 24 }) 25 // change small sized images to original version 26 document.querySelectorAll('img').forEach((element) => { 27 const src = element.getAttribute('src') 28 if (src.includes('domain.tld/pics/150x120/')) { 29 const fullSrc = src.replace('/pics/150x120/', '/pics/original/') 30 element.setAttribute('src', fullSrc) 31 } 32 }) 33 return document 34 } 35}
- To write better transformation logic, please refer linkedom and Document Object.
addTransformations(Object transformation | Array transformations)
Add a single transformation or a list of transformations. For example:
1import { addTransformations } from '@extractus/article-extractor' 2 3addTransformations({ 4 patterns: [ 5 /([\w]+.)?abc.tld\/*/ 6 ], 7 pre: (document) => { 8 // do something with document 9 return document 10 }, 11 post: (document) => { 12 // do something with document 13 return document 14 } 15}) 16 17addTransformations([ 18 { 19 patterns: [ 20 /([\w]+.)?def.tld\/*/ 21 ], 22 pre: (document) => { 23 // do something with document 24 return document 25 }, 26 post: (document) => { 27 // do something with document 28 return document 29 } 30 }, 31 { 32 patterns: [ 33 /([\w]+.)?xyz.tld\/*/ 34 ], 35 pre: (document) => { 36 // do something with document 37 return document 38 }, 39 post: (document) => { 40 // do something with document 41 return document 42 } 43 } 44])
The transformations without patterns
will be ignored.
removeTransformations(Array patterns)
To remove transformations that match the specific patterns.
For example, we can remove all added transformations above:
1import { removeTransformations } from '@extractus/article-extractor' 2 3removeTransformations([ 4 /([\w]+.)?abc.tld\/*/, 5 /([\w]+.)?def.tld\/*/, 6 /([\w]+.)?xyz.tld\/*/ 7])
Calling removeTransformations()
without parameter will remove all current transformations.
Priority order
While processing an article, more than one transformation can be applied.
Suppose that we have the following transformations:
1[ 2 { 3 patterns: [ 4 /http(s?):\/\/google.com\/*/, 5 /http(s?):\/\/goo.gl\/*/ 6 ], 7 pre: function_one, 8 post: function_two 9 }, 10 { 11 patterns: [ 12 /http(s?):\/\/goo.gl\/*/, 13 /http(s?):\/\/google.inc\/*/ 14 ], 15 pre: function_three, 16 post: function_four 17 } 18]
As you can see, an article from goo.gl
certainly matches both them.
In this scenario, @extractus/article-extractor
will execute both transformations, one by one:
function_one
-> function_three
-> extraction -> function_two
-> function_four
sanitize-html
's options
@extractus/article-extractor
uses sanitize-html to make a clean sweep of HTML content.
Here is the default options
Depending on the needs of your content system, you might want to gather some HTML tags/attributes, while ignoring others.
There are 2 methods to access and modify these options in @extractus/article-extractor
.
getSanitizeHtmlOptions()
setSanitizeHtmlOptions(Object sanitizeHtmlOptions)
Read sanitize-html docs for more info.
Test
1git clone https://github.com/extractus/article-extractor.git 2cd article-extractor 3pnpm i 4pnpm test
Quick evaluation
1git clone https://github.com/extractus/article-extractor.git 2cd article-extractor 3pnpm i 4pnpm eval {URL_TO_PARSE_ARTICLE}
License
The MIT License (MIT)
Support the project
If you find value from this open source project, you can support in the following ways:
- Give it a star ⭐
- Buy me a coffee: https://paypal.me/ndaidong 🍵
- Subscribe Article Extractor service on RapidAPI 😉
Thank you.
No vulnerabilities found.
No security vulnerabilities found.