Gathering detailed insights and metrics for cldr-segmentation
Gathering detailed insights and metrics for cldr-segmentation
Gathering detailed insights and metrics for cldr-segmentation
Gathering detailed insights and metrics for cldr-segmentation
npm install cldr-segmentation
Module System
Min. Node Version
Typescript Support
Node Version
NPM Version
38 Stars
43 Commits
5 Forks
5 Watching
2 Branches
4 Contributors
Updated on 24 Jul 2024
JavaScript (98.67%)
Ruby (1.08%)
HTML (0.25%)
Cumulative downloads
Total Downloads
Last day
-26.2%
1,084
Compared to previous day
Last week
2.3%
6,936
Compared to previous week
Last month
63.9%
23,145
Compared to previous month
Last year
17.3%
208,379
Compared to previous year
Text segmentation library for JavaScript.
This library provides CLDR-based text segmentation capabilities in JavaScript. Text segmentation is the process of identifying word, sentence, and other boundaries in a text. The segmentation rules are published by the Unicode consortium as part of the Common Locale Data Repository, or CLDR, and made freely available to the public.
Good question. Most of the time, that'll probably work fine. However, it's not always obvious where words or sentences should start or end. Consider this sentence:
1I like Mrs. Murphy. She's nice.
Splitting only on periods will give you ["I like Mrs. ", "Murphy. ", "She's nice."]
, which probably isn't what you wanted - the period after Mrs
doesn't indicate the end of the sentence.
In addition, other languages use different segmentation rules than English. For example, identifying sentence boundaries in Japanese is a little more difficult because sentences tend to end with \u3002
- the ideographic full stop - as opposed to a period. The CLDR contains support for hundreds of languages, meaning you don't have to consider every language when dealing with international text.
Cldr-segmentation is published as both a UMD module and an ES6 module, meaning it should work in node via require
or import
and the browser via a <script>
tag. In the browser, use window.cldrSegmentation
to access the library's functionality.
UMD module:
1const cldrSegmentation = require("cldr-segmentation");
ES6 module:
1import * as cldrSegmentation from 'cldr-segmentation'
1cldrSegmentation.sentenceSplit("I like Mrs. Murphy. She's nice."); 2// => ["I like Mrs. ", "Murphy. ", "She's nice."]
You'll notice that Mrs.
was treated as the end of a sentence. To avoid this, use the suppressions for the language you care about. Suppressions are essentially arrays of strings. Each string represents a series of characters after which there should not be a break. Using the English suppressions for the example above yields better results:
1var supp = cldrSegmentation.suppressions.en; 2cldrSegmentation.sentenceSplit("I like Mrs. Murphy. She's nice.", supp); 3// => ["I like Mrs. Murphy. ", "She's nice."]
If you'd like to iterate over each sentence instead of splitting, use a BreakIterator
:
1var breakIter = new cldrSegmentation.BreakIterator(supp); 2var str = "I like Mrs. Murphy, she's nice."; 3 4breakIter.eachSentence(str, (sentence, start, stop) => { 5 // do something 6});
Suppressions for all languages are available via cldrSegmentation.suppressions.all
.
Word, line, and grapheme cluster segmentation are supported:
1cldrSegmentation.wordSplit("I like Mrs. Murphy. She's nice."); 2// => ["I", " ", "like", " ", "Mrs", ".", " ", "Murphy", ".", "She's", " ", "nice", "."]
Also available are the lineSplit
and graphemeSplit
functions.
When using a break iterator:
1var breakIter = new cldrSegmentation.BreakIterator(supp); 2var str = "I like Mrs. Murphy, she's nice."; 3 4breakIter.eachWord(str, (word, start, stop) => { 5 // do something 6});
Also available are the eachLine
and eachGraphemeCluster
functions.
Suppressions are just strings after which a break should not occur. This library comes with a set of common suppressions for a variety of languages, but you may want to add your own. Suppression objects can be merged. For example, here's how to add "Dr." to the set of English suppressions:
1var customSupps = cldrSegmentation.Suppressions.create(['Dr.']); 2var supps = cldrSegmentation.suppressions.en.merge(customSupps); 3cldrSegmentation.sentenceSplit("We love Dr. Strange. He's cool.", supps);
Suppression objects are just plain 'ol Javascript objects with a single shouldBreak
function that returns a boolean. The function is passed a cursor object positioned at the index of the proposed break. Cursors deal exclusively with Unicode codepoints, meaning your custom suppression logic will need to be implemented in those terms. For example, let's create a custom suppression function that doesn't allow breaks after sentences that end with the letter 't'.
1class TeeSuppression { 2 shouldBreak(cursor) { 3 var position = cursor.logicalPosition; 4 5 // skip backwards past spaces and periods 6 do { 7 let cp = cursor.getCodePoint(position); 8 position --; 9 } while (cp === 32 || cp === 46); 10 11 // we skipped one too many in the loop 12 position ++; 13 14 // if the ending character is 't', return false; 15 // otherwise return true 16 return cursor.getCodePoint(position) !== 116; 17 } 18}
Note that you don't have to use ES6 classes. It's equally valid to create a simple object:
1let teeSuppression = { 2 shouldBreak: (cursor) => { 3 // logic here 4 } 5}
Tests are written in Jasmine and can be executed via jasmine-node:
npm install -g jasmine-node
jasmine-node spec
Written and maintained by Cameron C. Dutro (@camertron).
Copyright 2017 Cameron Dutro, licensed under the MIT license.
No vulnerabilities found.
Reason
no binaries found in the repo
Reason
0 existing vulnerabilities detected
Reason
license file detected
Details
Reason
Found 3/19 approved changesets -- score normalized to 1
Reason
0 commit(s) and 0 issue activity found in the last 90 days -- score normalized to 0
Reason
no effort to earn an OpenSSF best practices badge detected
Reason
security policy file not detected
Details
Reason
project is not fuzzed
Details
Reason
branch protection not enabled on development/release branches
Details
Reason
SAST tool is not run on all commits -- score normalized to 0
Details
Score
Last Scanned on 2024-11-25
The Open Source Security Foundation is a cross-industry collaboration to improve the security of open source software (OSS). The Scorecard provides security health metrics for open source projects.
Learn More