Gathering detailed insights and metrics for graphemer
Gathering detailed insights and metrics for graphemer
Gathering detailed insights and metrics for graphemer
Gathering detailed insights and metrics for graphemer
npm install graphemer
Module System
Min. Node Version
Typescript Support
Node Version
NPM Version
146 Stars
32 Commits
12 Forks
4 Watching
2 Branches
4 Contributors
Updated on 29 Oct 2024
Minified
Minified + Gzipped
TypeScript (99.08%)
JavaScript (0.92%)
Cumulative downloads
Total Downloads
Last day
-1.5%
5,280,657
Compared to previous day
Last week
4%
27,978,933
Compared to previous week
Last month
12.9%
114,926,472
Compared to previous month
Last year
269.7%
1,081,939,795
Compared to previous year
7
This library continues the work of Grapheme Splitter and supports the following unicode versions:
[v1.4.0]
[v1.3.0]
[v1.1.0]
[v1.0.0]
(Unicode 10 supported by grapheme-splitter
)In JavaScript there is not always a one-to-one relationship between string characters and what a user would call a separate visual "letter". Some symbols are represented by several characters. This can cause issues when splitting strings and inadvertently cutting a multi-char letter in half, or when you need the actual number of letters in a string.
For example, emoji characters like "🌷","🎁","💩","😜" and "👍" are represented by two JavaScript characters each (high surrogate and low surrogate). That is,
1'🌷'.length == 2;
The combined emoji are even longer:
1'🏳️🌈'.length == 6;
What's more, some languages often include combining marks - characters that are used to modify the letters before them. Common examples are the German letter ü and the Spanish letter ñ. Sometimes they can be represented alternatively both as a single character and as a letter + combining mark, with both forms equally valid:
1var two = 'ñ'; // unnormalized two-char n+◌̃, i.e. "\u006E\u0303"; 2var one = 'ñ'; // normalized single-char, i.e. "\u00F1" 3 4console.log(one != two); // prints 'true'
Unicode normalization, as performed by the popular punycode.js library or ECMAScript 6's String.normalize, can sometimes fix those differences and turn two-char sequences into single characters. But it is not enough in all cases. Some languages like Hindi make extensive use of combining marks on their letters, that have no dedicated single-codepoint Unicode sequences, due to the sheer number of possible combinations. For example, the Hindi word "अनुच्छेद" is comprised of 5 letters and 3 combining marks:
अ + न + ु + च + ् + छ + े + द
which is in fact just 5 user-perceived letters:
अ + नु + च् + छे + द
and which Unicode normalization would not combine properly. There are also the unusual letter+combining mark combinations which have no dedicated Unicode codepoint. The string Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍A̴̵̜̰͔ͫ͗͢L̠ͨͧͩ͘G̴̻͈͍͔̹̑͗̎̅͛́Ǫ̵̹̻̝̳͂̌̌͘ obviously has 5 separate letters, but is in fact comprised of 58 JavaScript characters, most of which are combining marks.
Enter the graphemer
library. It can be used to properly split JavaScript strings into what a human user would call separate letters (or "extended grapheme clusters" in Unicode terminology), no matter what their internal representation is. It is an implementation on the Default Grapheme Cluster Boundary of UAX #29.
Install graphemer
using the NPM command below:
$ npm i graphemer
If you're using Typescript or a compiler like Babel (or something like Create React App) things are pretty simple; just import, initialize and use!
1import Graphemer from 'graphemer'; 2 3const splitter = new Graphemer(); 4 5// split the string to an array of grapheme clusters (one string each) 6const graphemes = splitter.splitGraphemes(string); 7 8// iterate the string to an iterable iterator of grapheme clusters (one string each) 9const graphemeIterator = splitter.iterateGraphemes(string); 10 11// or do this if you just need their number 12const graphemeCount = splitter.countGraphemes(string);
If you're using vanilla Node you can use the require()
method.
1const Graphemer = require('graphemer').default; 2 3const splitter = new Graphemer(); 4 5const graphemes = splitter.splitGraphemes(string);
1import Graphemer from 'graphemer'; 2 3const splitter = new Graphemer(); 4 5// plain latin alphabet - nothing spectacular 6splitter.splitGraphemes('abcd'); // returns ["a", "b", "c", "d"] 7 8// two-char emojis and six-char combined emoji 9splitter.splitGraphemes('🌷🎁💩😜👍🏳️🌈'); // returns ["🌷","🎁","💩","😜","👍","🏳️🌈"] 10 11// diacritics as combining marks, 10 JavaScript chars 12splitter.splitGraphemes('Ĺo͂řȩm̅'); // returns ["Ĺ","o͂","ř","ȩ","m̅"] 13 14// individual Korean characters (Jamo), 4 JavaScript chars 15splitter.splitGraphemes('뎌쉐'); // returns ["뎌","쉐"] 16 17// Hindi text with combining marks, 8 JavaScript chars 18splitter.splitGraphemes('अनुच्छेद'); // returns ["अ","नु","च्","छे","द"] 19 20// demonic multiple combining marks, 75 JavaScript chars 21splitter.splitGraphemes('Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍A̴̵̜̰͔ͫ͗͢L̠ͨͧͩ͘G̴̻͈͍͔̹̑͗̎̅͛́Ǫ̵̹̻̝̳͂̌̌͘!͖̬̰̙̗̿̋ͥͥ̂ͣ̐́́͜͞'); // returns ["Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍","A̴̵̜̰͔ͫ͗͢","L̠ͨͧͩ͘","G̴̻͈͍͔̹̑͗̎̅͛́","Ǫ̵̹̻̝̳͂̌̌͘","!͖̬̰̙̗̿̋ͥͥ̂ͣ̐́́͜͞"]
Graphemer is built with TypeScript and, of course, includes type declarations.
1import Graphemer from 'graphemer'; 2 3const splitter = new Graphemer(); 4 5const split: string[] = splitter.splitGraphemes('Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍A̴̵̜̰͔ͫ͗͢L̠ͨͧͩ͘G̴̻͈͍͔̹̑͗̎̅͛́Ǫ̵̹̻̝̳͂̌̌͘!͖̬̰̙̗̿̋ͥͥ̂ͣ̐́́͜͞');
See Contribution Guide.
This library is a fork of the incredible work done by Orlin Georgiev and Huáng Jùnliàng at https://github.com/orling/grapheme-splitter.
The original library was heavily influenced by Devon Govett's excellent grapheme-breaker CoffeeScript library.
No vulnerabilities found.
Reason
no binaries found in the repo
Reason
no dangerous workflow patterns detected
Reason
license file detected
Details
Reason
6 existing vulnerabilities detected
Details
Reason
dependency not pinned by hash detected -- score normalized to 3
Details
Reason
Found 4/20 approved changesets -- score normalized to 2
Reason
project is archived
Details
Reason
detected GitHub workflow tokens with excessive permissions
Details
Reason
no effort to earn an OpenSSF best practices badge detected
Reason
security policy file not detected
Details
Reason
project is not fuzzed
Details
Reason
SAST tool is not run on all commits -- score normalized to 0
Details
Score
Last Scanned on 2024-11-25
The Open Source Security Foundation is a cross-industry collaboration to improve the security of open source software (OSS). The Scorecard provides security health metrics for open source projects.
Learn Moresegmenter
Work with grapheme, words, and sentences with small, simple, and fast API using Intl.Segmenter
@externdefs/bluesky-client
Lightweight API client for Bluesky/AT Protocol
@umdify/graphemer
A JavaScript library that breaks strings into their individual user-perceived characters (including emojis!)