npmpackage.info

Gathering detailed insights and metrics for office-text-extractor

Other packages similar to office-text-extractor

office-text-extractor-browser

3.1.4

Fork of office-text-extractor with unreleased changes that include browser support

@mtamayo/office-text-extractor

3.0.4

Yet another library to extract text from MS Office and PDF files

doxtract

1.0.13

Very Fast Pure JS Text Extractor For Your Office Files

doc-textify

1.0.1

A Node.js library to extract text from office documents (docx, pptx, xlsx, odt, odp, ods, pdf, text, html ...)

Gathering detailed insights and metrics for office-text-extractor

office-text-extractor - 3.0.3 | npmpackage.info

office-text-extractor

Yet another library to extract text from MS Office and PDF files

3.0.3

ISC

TypeScript

33.37 kB

473,424 3.1

Installations

npm install office-text-extractor

Developer Guide

BETA

Typescript

Yes

Module System

ESM

Min. Node Version

>= 16

Node Version

20.12.2

NPM Version

10.5.0 Score

64.5

Supply Chain

93.4

Quality

76.9

Maintenance

100

Vulnerability

96.2

License

Pull Requests

Open

0

Total

3

Closed

1

Merged

2

Issues

Open

7

Total

15

Closed

8

Releases

v3.0.3

Updated on Apr 19, 2024

v3.0.2

Updated on Oct 09, 2023

v3.0.1

Updated on Jul 10, 2023

v3.0.0

Updated on Jul 10, 2023

v2.0.0

Updated on Jun 14, 2022

v1.5.0

Updated on Jun 12, 2021

View All 14 releases

Languages

TypeScript

TypeScript (100%)

Developer

gamemaker1

Download Statistics

Total Downloads

473,424

Last Day

412

Last Week

10,863

Last Month

36,468

Last Year

298,423

GitHub Statistics

ISC License

78 Stars

86 Commits

7 Forks

1 Watchers

1 Branches

1 Contributors

Updated on May 14, 2025

Maintainers

View All 1 Contributors

Package Meta Information

Latest Version

3.0.3

Package Id

office-text-extractor@3.0.3

Unpacked Size

33.37 kB

Size

9.36 kB

File Count

NPM Version

10.5.0

Node Version

20.12.2

Published on

Apr 19, 2024

Total Downloads

Cumulative downloads

Total Downloads

473,424

Last Day

-18.3%

412

Compared to previous day

Last Week

31.5%

10,863

Compared to previous week

Last Month

-14%

36,468

Compared to previous month

Last Year

133.5%

298,423

Compared to previous year

Weekly Downloads

Monthly Downloads

Yearly Downloads

Dependencies

fflate file-type got js-yaml mammoth pdf-parse text-encoding xlsx xml2js

Dev Dependencies

@types/js-yaml @types/node @types/text-encoding @types/xml2js ava np npm-run-all prettier tsx typescript xo

office-text-extractor

yet another library to extract text from docx, pptx, xlsx, and pdf files.

similar libraries

there are other great libraries that do the same job and have inspired this project, such as:

however, office-text-extractor has the following differences:

parses file based on its mime type, not its file extension.
does not spawn a child process to use a tool installed on the device.
reads and returns text from the file if it contains plain text.

libraries used

this package uses some amazing existing libraries that perform better than the ones that originally existed in this module, and are therefore used instead:

pdf-parse, for parsing pdf files
xlsx, for parsing xlsx files
mammoth, for parsing docx files

a big thank you to the contributors of these projects!

installation

node

from version 2.0.0 onwards, this package is pure esm. please read this article for a guide on how to ensure your project can import this library.

to use office-text-extractor in an Node project, install it using npm/pnpm/yarn:

1> npm install office-text-extractor
2> pnpm add office-text-extractor
3> yarn add office-text-extractor

browser

the library currently cannot be used in the browser due to its usage of the node:buffer library. pull requests that can replace node:buffer with a different library are welcome!

usage

an example of using the library to extract text is as follows:

1import { readFile } from 'node:fs/promises'
2import { getTextExtractor } from 'office-text-extractor'
3
4// this function returns a new instance of the `TextExtractor` class, with the default
5// extraction methods (docx, pptx, xlsx, pdf) registered.
6const extractor = getTextExtractor()
7
8// extract text from a url, because that's a neat first example :p
9const url = 'https://raw.githubusercontent.com/gamemaker1/office-text-extractor/rewrite/test/fixtures/docs/pptx.pptx'
10const text = await extractor.extractText({ input: url, type: 'url' })
11
12// you can extract text from a file too, like so:
13const path = 'stuff/boring.pdf'
14const text = await extractor.extractText({ input: path, type: 'file' })
15
16// if you have a buffer with the file in it, you can pass that too:
17const buffer = await readFile(path)
18const text = await extractor.extractText({ input: buffer, type: 'buffer' })
19
20console.log(text)

the following is an example of how to create and use your own text extraction method:

1import { type Buffer } from 'node:buffer'
2import { TextExtractor, type TextExtractionMethod } from 'office-text-extractor'
3
4/**
5 * Extracts text from images.
6 */
7class ImageExtractor implements TextExtractionMethod {
8  /**
9   * The mime types of the file that the extractor accepts.
10   */
11  mimes = ['image/png', 'image/jpeg']
12
13  /**
14   * Extracts text from the image file passed by the user.
15   */
16  apply = async (input: Buffer): Promise<string> {
17    const text = await processImage(input)
18    return text
19  }
20}
21
22// create a new extractor and register our extraction method
23const extractor = new TextExtractor()
24extractor.addMethod(new ImageExtractor())
25
26// then use it like you would normally
27const text = await extractor.extractText({ input: '...', type: '...' }
28console.log(text)

license

this project is licensed under the ISC license. please see license.md for more details.

No vulnerabilities found.

10

Dangerous-Workflow

Determines if the project's GitHub Action workflows avoid dangerous patterns.

10

Binary-Artifacts

Determines if the project has generated executable (binary) artifacts in the source repository.

10

License

Determines if the project has defined a license.

6

Vulnerabilities

Determines if the project has open, known unfixed vulnerabilities.

0

Maintained

Determines if the project is "actively maintained".

0

Code-Review

Determines if the project requires human code review before pull requests (aka merge requests) are merged.

0

Token-Permissions

Determines if the project's workflows follow the principle of least privilege.

0

CII-Best-Practices

Determines if the project has an OpenSSF (formerly CII) Best Practices Badge.

0

Pinned-Dependencies

Determines if the project has declared and pinned the dependencies of its build process.

0

Security-Policy

Determines if the project has published a security policy.

0

Fuzzing

Determines if the project uses fuzzing.

0

Branch-Protection

Determines if the default and release branches are protected with GitHub's branch protection settings.

0

SAST

Determines if the project uses static code analysis.

Score

3.1

/10

Last Scanned on 2025-06-30

The Open Source Security Foundation is a cross-industry collaboration to improve the security of open source software (OSS). The Scorecard provides security health metrics for open source projects.

Learn More