npmpackage.info

Gathering detailed insights and metrics for word-extractor

Other packages similar to word-extractor

@types/word-extractor

1.0.6

TypeScript definitions for word-extractor

@gmr-fms/word-extractor

0.3.0

Node.js package to read Word .doc files

office-text-extractor

3.0.3

Yet another library to extract text from MS Office and PDF files

@lambda121/word-extractor

1.0.0

Node.js package to read Word .doc files

Gathering detailed insights and metrics for word-extractor

word-extractor - 1.0.4 | npmpackage.info

word-extractor

Read data from a Word document using node.js

1.0.4

145

MIT

JavaScript

72.47 kB

2.6

Installations

npm install word-extractor

Developer Guide

BETA

Typescript

No

Module System

CommonJS

Node Version

12.16.1

NPM Version

6.14.4 Pull Requests

Open

2

Total

23

Closed

5

Merged

16

Issues

Open

7

Total

45

Closed

38

Releases

Unable to fetch releases

Languages

JavaScript

JavaScript (100%)

Developer

morungos

Download Statistics

Total Downloads

Last Day

Last Week

Last Month

Last Year

GitHub Statistics

MIT License

145 Stars

260 Commits

32 Forks

5 Watchers

3 Branches

5 Contributors

Updated on Jul 09, 2025

Maintainers

View All 5 Contributors

Package Meta Information

Latest Version

1.0.4

Package Id

word-extractor@1.0.4

Unpacked Size

72.47 kB

Size

18.37 kB

File Count

NPM Version

6.14.4

Node Version

12.16.1

Total Downloads

Cumulative downloads

Total Downloads

NaN

Last Day

NaN

Compared to previous day

Last Week

NaN

Compared to previous week

Last Month

NaN

Compared to previous month

Last Year

NaN

Compared to previous year

Weekly Downloads

Monthly Downloads

Yearly Downloads

Dependencies

saxes yauzl

Dev Dependencies

eslint jest jest-specific-snapshot jsdoc

word-extractor

Read data from a Word document (.doc or .docx) using Node.js

Why use this module?

There are a fair number of npm components which can extract text from Word .doc files, but they often appear to require some external helper program, and involve either spawning a process or communicating with a persistent one. That raises the installation and deployment burden as well as the runtime one.

This module is intended to provide a much faster way of reading the text from a Word file, without leaving the Node.js environment.

This means you do not need to install Word, Office, or anything else, and the module will work on all platforms, without any native binary code requirements.

As of version 1.0, this module supports both traditional, OLE-based, Word files (usually .doc), and modern, Open Office-style, ECMA-376 Word files (usually .docx). It can be used both with files and with file contents in a Node.js Buffer.

How do I install this module?

1yarn add word-extractor
2
3# Or using npm... 
4npm install word-extractor

How do I use this module?

const WordExtractor = require("word-extractor"); 
const extractor = new WordExtractor();
const extracted = extractor.extract("file.doc");

extracted.then(function(doc) { console.log(doc.getBody()); });

The object returned from the extract() method is a promise that resolves to a document object, which then provides several views onto different parts of the document contents.

Methods

WordExtractor#extract(<filename> | <Buffer>)

Main method to open a Word file and retrieve the data. Returns a promise which resolves to a Document. If a Buffer is passed instead of a filename, then the buffer is used directly, instad of reading a disk from the file system.

Document#getBody()

Retrieves the content text from a Word document. This will handle UNICODE characters correctly, so if there are accented or non-Latin-1 characters present in the document, they'll show as is in the returned string.

Document#getFootnotes()

Retrieves the footnote text from a Word document. This will handle UNICODE characters correctly, so if there are accented or non-Latin-1 characters present in the document, they'll show as is in the returned string.

Document#getEndnotes()

Retrieves the endnote text from a Word document. This will handle UNICODE characters correctly, so if there are accented or non-Latin-1 characters present in the document, they'll show as is in the returned string.

Document#getHeaders(options?)

Retrieves the header and footer text from a Word document. This will handle UNICODE characters correctly, so if there are accented or non-Latin-1 characters present in the document, they'll show as is in the returned string.

Note that by default, getHeaders() returns one string, containing all headers and footers. This is compatible with previous versions. If you want to separate headers and footers, use getHeaders({includeFooters: false}), to return only the headers, and the new method getFooters() (from version 1.0.1) to return the footers separately.

Document#getFooters()

From version 1.0.1. Retrieves the footer text from a Word document. This will handle UNICODE characters correctly, so if there are accented or non-Latin-1 characters present in the document, they'll show as is in the returned string.

Document#getAnnotations()

Retrieves the comment bubble text from a Word document. This will handle UNICODE characters correctly, so if there are accented or non-Latin-1 characters present in the document, they'll show as is in the returned string.

Document#getTextboxes(options?)

Retrieves the textbox contenttext from a Word document. This will handle UNICODE characters correctly, so if there are accented or non-Latin-1 characters present in the document, they'll show as is in the returned string.

Note that by default, getTextboxes() returns one string, containing all textbox content from both main document and the headers and footers. You can control what gets included by using the options includeHeadersAndFooters (which defaults to true) and includeBody (also defaults to true). So, as an example, if you only want the body text box content, use: doc.getTextboxes({includeHeadersAndFooters: false}).

License

Licensed under the MIT License.

No vulnerabilities found.

10

Dangerous-Workflow

Determines if the project's GitHub Action workflows avoid dangerous patterns.

10

Binary-Artifacts

Determines if the project has generated executable (binary) artifacts in the source repository.

10

License

Determines if the project has defined a license.

1

Code-Review

Determines if the project requires human code review before pull requests (aka merge requests) are merged.

0

Maintained

Determines if the project is "actively maintained".

0

Token-Permissions

Determines if the project's workflows follow the principle of least privilege.

0

Pinned-Dependencies

Determines if the project has declared and pinned the dependencies of its build process.

0

CII-Best-Practices

Determines if the project has an OpenSSF (formerly CII) Best Practices Badge.

0

Security-Policy

Determines if the project has published a security policy.

0

Fuzzing

Determines if the project uses fuzzing.

0

Branch-Protection

Determines if the default and release branches are protected with GitHub's branch protection settings.

0

SAST

Determines if the project uses static code analysis.

0

Vulnerabilities

Determines if the project has open, known unfixed vulnerabilities.

Score

2.6

/10

Last Scanned on 2025-07-07

The Open Source Security Foundation is a cross-industry collaboration to improve the security of open source software (OSS). The Scorecard provides security health metrics for open source projects.

Learn More