npmpackage.info

Gathering detailed insights and metrics for grapheme-splitter

Other packages similar to grapheme-splitter

graphemer

1.4.0

A JavaScript library that breaks strings into their individual user-perceived characters (including emojis!)

graphemesplit

2.4.4

A JavaScript implementation of the Unicode 14.0 grapheme cluster breaking algorithm. ([UAX #29](http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries))

text-segmentation

1.0.3

text-segmentation ==============

grapheme-breaker

0.3.2

An implementation of the Unicode grapheme cluster breaking algorithm (UAX #29)

Gathering detailed insights and metrics for grapheme-splitter

grapheme-splitter

A JavaScript library that breaks strings into their individual user-perceived characters.

1.0.4

929

MIT

JavaScript

7.27 kB

1,234,414,398 3.4

Installations

npm install grapheme-splitter

Pull Requests

Open

1

Total

12

Closed

2

Merged

9

Issues

Open

7

Total

20

Closed

13

Releases

Unable to fetch releases

Developer

orling

Developer Guide

BETA

Module System

CommonJS

Min. Node Version

Typescript Support

No

Node Version

6.11.0

NPM Version

6.4.0 Statistics

929 Stars

42 Commits

45 Forks

19 Watching

1 Branches

9 Contributors

Updated on 18 Nov 2024

Bundle Size

22.66 kB

Minified

7.27 kB

Minified + Gzipped

Bundlephobia

Languages

JavaScript (100%)

Total Downloads

Cumulative downloads

Total Downloads

1,234,414,398

Last day

-0%

1,316,292

Compared to previous day

Last week

3.2%

7,201,481

Compared to previous week

Last month

14.4%

28,781,415

Compared to previous month

Last year

-43.5%

349,219,446

Compared to previous year

Daily Downloads

Weekly Downloads

Monthly Downloads

Yearly Downloads

Dev Dependencies

tape

Versions

Background

In JavaScript there is not always a one-to-one relationship between string characters and what a user would call a separate visual "letter". Some symbols are represented by several characters. This can cause issues when splitting strings and inadvertently cutting a multi-char letter in half, or when you need the actual number of letters in a string.

For example, emoji characters like "🌷","🎁","💩","😜" and "👍" are represented by two JavaScript characters each (high surrogate and low surrogate). That is,

1"🌷".length == 2

The combined emoji are even longer:

1"🏳️‍🌈".length == 6

What's more, some languages often include combining marks - characters that are used to modify the letters before them. Common examples are the German letter ü and the Spanish letter ñ. Sometimes they can be represented alternatively both as a single character and as a letter + combining mark, with both forms equally valid:

1var two = "ñ"; // unnormalized two-char n+◌̃  , i.e. "\u006E\u0303";
2var one = "ñ"; // normalized single-char, i.e. "\u00F1"
3console.log(one!=two); // prints 'true'

Unicode normalization, as performed by the popular punycode.js library or ECMAScript 6's String.normalize, can sometimes fix those differences and turn two-char sequences into single characters. But it is not enough in all cases. Some languages like Hindi make extensive use of combining marks on their letters, that have no dedicated single-codepoint Unicode sequences, due to the sheer number of possible combinations. For example, the Hindi word "अनुच्छेद" is comprised of 5 letters and 3 combining marks:

अ + न + ु + च + ् + छ + े + द

which is in fact just 5 user-perceived letters:

अ + नु + च् + छे + द

and which Unicode normalization would not combine properly. There are also the unusual letter+combining mark combinations which have no dedicated Unicode codepoint. The string Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍A̴̵̜̰͔ͫ͗͢L̠ͨͧͩ͘G̴̻͈͍͔̹̑͗̎̅͛́Ǫ̵̹̻̝̳͂̌̌͘ obviously has 5 separate letters, but is in fact comprised of 58 JavaScript characters, most of which are combining marks.

Enter the grapheme-splitter.js library. It can be used to properly split JavaScript strings into what a human user would call separate letters (or "extended grapheme clusters" in Unicode terminology), no matter what their internal representation is. It is an implementation on the Default Grapheme Cluster Boundary of UAX #29.

Installation

You can use the index.js file directly as-is. Or you you can install grapheme-splitter to your project using the NPM command below:

$ npm install --save grapheme-splitter

Tests

To run the tests on grapheme-splitter, use the command below:

$ npm test

Usage

Just initialize and use:

1var splitter = new GraphemeSplitter();
2
3// split the string to an array of grapheme clusters (one string each)
4var graphemes = splitter.splitGraphemes(string);
5
6// iterate the string to an iterable iterator of grapheme clusters (one string each)
7var graphemes = splitter.iterateGraphemes(string);
8
9// or do this if you just need their number
10var graphemeCount = splitter.countGraphemes(string);

Examples

1var splitter = new GraphemeSplitter();
2
3// plain latin alphabet - nothing spectacular
4splitter.splitGraphemes("abcd"); // returns ["a", "b", "c", "d"]
5
6// two-char emojis and six-char combined emoji
7splitter.splitGraphemes("🌷🎁💩😜👍🏳️‍🌈"); // returns ["🌷","🎁","💩","😜","👍","🏳️‍🌈"]
8
9// diacritics as combining marks, 10 JavaScript chars
10splitter.splitGraphemes("Ĺo͂řȩm̅"); // returns ["Ĺ","o͂","ř","ȩ","m̅"]
11
12// individual Korean characters (Jamo), 4 JavaScript chars
13splitter.splitGraphemes("뎌쉐"); // returns ["뎌","쉐"]
14
15// Hindi text with combining marks, 8 JavaScript chars
16splitter.splitGraphemes("अनुच्छेद"); // returns ["अ","नु","च्","छे","द"]
17
18// demonic multiple combining marks, 75 JavaScript chars
19splitter.splitGraphemes("Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍A̴̵̜̰͔ͫ͗͢L̠ͨͧͩ͘G̴̻͈͍͔̹̑͗̎̅͛́Ǫ̵̹̻̝̳͂̌̌͘!͖̬̰̙̗̿̋ͥͥ̂ͣ̐́́͜͞"); // returns ["Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍","A̴̵̜̰͔ͫ͗͢","L̠ͨͧͩ͘","G̴̻͈͍͔̹̑͗̎̅͛́","Ǫ̵̹̻̝̳͂̌̌͘","!͖̬̰̙̗̿̋ͥͥ̂ͣ̐́́͜͞"]

TypeScript

Grapheme splitter includes TypeScript declarations.

1import GraphemeSplitter = require('grapheme-splitter')
2
3const splitter = new GraphemeSplitter()
4
5const split: string[] = splitter.splitGraphemes('Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍A̴̵̜̰͔ͫ͗͢L̠ͨͧͩ͘G̴̻͈͍͔̹̑͗̎̅͛́Ǫ̵̹̻̝̳͂̌̌͘!͖̬̰̙̗̿̋ͥͥ̂ͣ̐́́͜͞')

Acknowledgements

This library is heavily influenced by Devon Govett's excellent grapheme-breaker CoffeeScript library at https://github.com/devongovett/grapheme-breaker with an emphasis on ease of integration and pure JavaScript implementation.

No vulnerabilities found.

10

Binary-Artifacts

Determines if the project has generated executable (binary) artifacts in the source repository.

10

Vulnerabilities

Determines if the project has open, known unfixed vulnerabilities.

10

License

Determines if the project has defined a license.

3

Code-Review

Determines if the project requires human code review before pull requests (aka merge requests) are merged.

0

Maintained

Determines if the project is "actively maintained".

0

CII-Best-Practices

Determines if the project has an OpenSSF (formerly CII) Best Practices Badge.

0

Security-Policy

Determines if the project has published a security policy.

0

Fuzzing

Determines if the project uses fuzzing.

0

Branch-Protection

Determines if the default and release branches are protected with GitHub's branch protection settings.

0

SAST

Determines if the project uses static code analysis.

Score

3.4

/10

Last Scanned on 2024-11-18

The Open Source Security Foundation is a cross-industry collaboration to improve the security of open source software (OSS). The Scorecard provides security health metrics for open source projects.

Learn More