Installations

npm install cldr-segmentation

Pull Requests

Open

0

Total

6

Closed

0

Merged

6

Issues

Open

0

Total

12

Closed

12

Releases

Unable to fetch releases

Developer

camertron

Developer Guide

BETA

Module System

CommonJS

Min. Node Version

Typescript Support

No

Node Version

21.1.0

NPM Version

10.2.0 Statistics

38 Stars

43 Commits

5 Forks

5 Watching

2 Branches

4 Contributors

Updated on 24 Jul 2024

Languages

JavaScript (98.67%)

Ruby (1.08%)

HTML (0.25%)

Total Downloads

Cumulative downloads

Total Downloads

498,582

Last day

-26.2%

1,084

Compared to previous day

Last week

2.3%

6,936

Compared to previous week

Last month

63.9%

23,145

Compared to previous month

Last year

17.3%

208,379

Compared to previous year

Daily Downloads

Weekly Downloads

Monthly Downloads

Yearly Downloads

Dependencies

1

utfstring

Dev Dependencies

11

babel-core babel-plugin-transform-class-properties babel-plugin-transform-es2015-modules-umd babel-preset-es2015 cldrSegmentation grunt grunt-babel grunt-contrib-concat grunt-contrib-uglify jasmine-node load-grunt-tasks

Versions

cldr-segmentation

Text segmentation library for JavaScript.

What is this thing?

This library provides CLDR-based text segmentation capabilities in JavaScript. Text segmentation is the process of identifying word, sentence, and other boundaries in a text. The segmentation rules are published by the Unicode consortium as part of the Common Locale Data Repository, or CLDR, and made freely available to the public.

Why not just split on spaces or periods?

Good question. Most of the time, that'll probably work fine. However, it's not always obvious where words or sentences should start or end. Consider this sentence:

1I like Mrs. Murphy. She's nice.

Splitting only on periods will give you ["I like Mrs. ", "Murphy. ", "She's nice."], which probably isn't what you wanted - the period after Mrs doesn't indicate the end of the sentence.

In addition, other languages use different segmentation rules than English. For example, identifying sentence boundaries in Japanese is a little more difficult because sentences tend to end with \u3002 - the ideographic full stop - as opposed to a period. The CLDR contains support for hundreds of languages, meaning you don't have to consider every language when dealing with international text.

Examples

Cldr-segmentation is published as both a UMD module and an ES6 module, meaning it should work in node via require or import and the browser via a <script> tag. In the browser, use window.cldrSegmentation to access the library's functionality.

UMD module:

1const cldrSegmentation = require("cldr-segmentation");

ES6 module:

1import * as cldrSegmentation from 'cldr-segmentation'

Sentence Segmentation

1cldrSegmentation.sentenceSplit("I like Mrs. Murphy. She's nice.");
2// => ["I like Mrs. ", "Murphy. ", "She's nice."]

You'll notice that Mrs. was treated as the end of a sentence. To avoid this, use the suppressions for the language you care about. Suppressions are essentially arrays of strings. Each string represents a series of characters after which there should not be a break. Using the English suppressions for the example above yields better results:

1var supp = cldrSegmentation.suppressions.en;
2cldrSegmentation.sentenceSplit("I like Mrs. Murphy. She's nice.", supp);
3// => ["I like Mrs. Murphy. ", "She's nice."]

If you'd like to iterate over each sentence instead of splitting, use a BreakIterator:

1var breakIter = new cldrSegmentation.BreakIterator(supp);
2var str = "I like Mrs. Murphy, she's nice.";
3
4breakIter.eachSentence(str, (sentence, start, stop) => {
5  // do something
6});

Suppressions for all languages are available via cldrSegmentation.suppressions.all.

Other Types of Segmentation

Word, line, and grapheme cluster segmentation are supported:

1cldrSegmentation.wordSplit("I like Mrs. Murphy. She's nice.");
2// => ["I", " ", "like", " ", "Mrs",  ".", " ", "Murphy", ".", "She's", " ", "nice", "."]

Also available are the lineSplit and graphemeSplit functions.

When using a break iterator:

1var breakIter = new cldrSegmentation.BreakIterator(supp);
2var str = "I like Mrs. Murphy, she's nice.";
3
4breakIter.eachWord(str, (word, start, stop) => {
5  // do something
6});

Also available are the eachLine and eachGraphemeCluster functions.

Custom Suppressions

Suppressions are just strings after which a break should not occur. This library comes with a set of common suppressions for a variety of languages, but you may want to add your own. Suppression objects can be merged. For example, here's how to add "Dr." to the set of English suppressions:

1var customSupps = cldrSegmentation.Suppressions.create(['Dr.']);
2var supps = cldrSegmentation.suppressions.en.merge(customSupps);
3cldrSegmentation.sentenceSplit("We love Dr. Strange. He's cool.", supps);

Custom Suppression Objects

Suppression objects are just plain 'ol Javascript objects with a single shouldBreak function that returns a boolean. The function is passed a cursor object positioned at the index of the proposed break. Cursors deal exclusively with Unicode codepoints, meaning your custom suppression logic will need to be implemented in those terms. For example, let's create a custom suppression function that doesn't allow breaks after sentences that end with the letter 't'.

1class TeeSuppression {
2  shouldBreak(cursor) {
3    var position = cursor.logicalPosition;
4
5    // skip backwards past spaces and periods
6    do {
7      let cp = cursor.getCodePoint(position);
8      position --;
9    } while (cp === 32 || cp === 46);
10
11    // we skipped one too many in the loop
12    position ++;
13
14    // if the ending character is 't', return false;
15    // otherwise return true
16    return cursor.getCodePoint(position) !== 116;
17  }
18}

Note that you don't have to use ES6 classes. It's equally valid to create a simple object:

1let teeSuppression = {
2  shouldBreak: (cursor) => {
3    // logic here
4  }
5}

Running Tests

Tests are written in Jasmine and can be executed via jasmine-node:

npm install -g jasmine-node
jasmine-node spec

Authors

Written and maintained by Cameron C. Dutro (@camertron).

License

No vulnerabilities found.

10

Binary-Artifacts

Determines if the project has generated executable (binary) artifacts in the source repository.

10

Vulnerabilities

Determines if the project has open, known unfixed vulnerabilities.

10

License

Determines if the project has defined a license.

1

Code-Review

Determines if the project requires human code review before pull requests (aka merge requests) are merged.

0

Maintained

Determines if the project is "actively maintained".

0

CII-Best-Practices

Determines if the project has an OpenSSF (formerly CII) Best Practices Badge.

0

Security-Policy

Determines if the project has published a security policy.

0

Fuzzing

Determines if the project uses fuzzing.

0

Branch-Protection

Determines if the default and release branches are protected with GitHub's branch protection settings.

0

SAST

Determines if the project uses static code analysis.

Score

3.2

/10

Last Scanned on 2024-11-25

The Open Source Security Foundation is a cross-industry collaboration to improve the security of open source software (OSS). The Scorecard provides security health metrics for open source projects.

Learn More