npmpackage.info

Gathering detailed insights and metrics for graphemer

Other packages similar to graphemer

segmenter

2.0.1

Work with grapheme, words, and sentences with small, simple, and fast API using Intl.Segmenter

@externdefs/bluesky-client

0.5.25

Lightweight API client for Bluesky/AT Protocol

@umdify/graphemer

1.4.0-1675741140097

A JavaScript library that breaks strings into their individual user-perceived characters (including emojis!)

Gathering detailed insights and metrics for graphemer

graphemer

Unicode character splitter

1.4.0

146

MIT

TypeScript

14.81 kB

1,374,943,590 3.6

Installations

npm install graphemer

Pull Requests

Open

0

Total

4

Closed

0

Merged

4

Issues

Open

5

Total

7

Closed

2

Releases

1.4.0

Published on 19 Sept 2022

1.3.0

Published on 13 Dec 2021

1.2.0

Published on 29 Jan 2021

1.1.0

Published on 14 Sept 2020

1.0.0

Published on 14 Sept 2020

View all 5 releases

Developer

flmnt

Developer Guide

BETA

Module System

CommonJS

Min. Node Version

Typescript Support

Yes

Node Version

12.22.12

NPM Version

6.14.16 Statistics

146 Stars

32 Commits

12 Forks

4 Watching

2 Branches

4 Contributors

Updated on 29 Oct 2024

Bundle Size

93.06 kB

Minified

14.81 kB

Minified + Gzipped

Bundlephobia

Languages

TypeScript (99.08%)

JavaScript (0.92%)

Total Downloads

Cumulative downloads

Total Downloads

1,374,943,590

Last day

-1.5%

5,280,657

Compared to previous day

Last week

27,978,933

Compared to previous week

Last month

12.9%

114,926,472

Compared to previous month

Last year

269.7%

1,081,939,795

Compared to previous year

Daily Downloads

Weekly Downloads

Monthly Downloads

Yearly Downloads

Dev Dependencies

@types/tape husky lint-staged prettier tape ts-node typescript

Versions

Graphemer: Unicode Character Splitter 🪓

Introduction

This library continues the work of Grapheme Splitter and supports the following unicode versions:

Unicode 15 and below [v1.4.0]
Unicode 14 and below [v1.3.0]
Unicode 13 and below [v1.1.0]
Unicode 11 and below [v1.0.0] (Unicode 10 supported by grapheme-splitter)

In JavaScript there is not always a one-to-one relationship between string characters and what a user would call a separate visual "letter". Some symbols are represented by several characters. This can cause issues when splitting strings and inadvertently cutting a multi-char letter in half, or when you need the actual number of letters in a string.

For example, emoji characters like "🌷","🎁","💩","😜" and "👍" are represented by two JavaScript characters each (high surrogate and low surrogate). That is,

1'🌷'.length == 2;

The combined emoji are even longer:

1'🏳️‍🌈'.length == 6;

What's more, some languages often include combining marks - characters that are used to modify the letters before them. Common examples are the German letter ü and the Spanish letter ñ. Sometimes they can be represented alternatively both as a single character and as a letter + combining mark, with both forms equally valid:

1var two = 'ñ'; // unnormalized two-char n+◌̃, i.e. "\u006E\u0303";
2var one = 'ñ'; // normalized single-char, i.e. "\u00F1"
3
4console.log(one != two); // prints 'true'

Unicode normalization, as performed by the popular punycode.js library or ECMAScript 6's String.normalize, can sometimes fix those differences and turn two-char sequences into single characters. But it is not enough in all cases. Some languages like Hindi make extensive use of combining marks on their letters, that have no dedicated single-codepoint Unicode sequences, due to the sheer number of possible combinations. For example, the Hindi word "अनुच्छेद" is comprised of 5 letters and 3 combining marks:

अ + न + ु + च + ् + छ + े + द

which is in fact just 5 user-perceived letters:

अ + नु + च् + छे + द

and which Unicode normalization would not combine properly. There are also the unusual letter+combining mark combinations which have no dedicated Unicode codepoint. The string Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍A̴̵̜̰͔ͫ͗͢L̠ͨͧͩ͘G̴̻͈͍͔̹̑͗̎̅͛́Ǫ̵̹̻̝̳͂̌̌͘ obviously has 5 separate letters, but is in fact comprised of 58 JavaScript characters, most of which are combining marks.

Enter the graphemer library. It can be used to properly split JavaScript strings into what a human user would call separate letters (or "extended grapheme clusters" in Unicode terminology), no matter what their internal representation is. It is an implementation on the Default Grapheme Cluster Boundary of UAX #29.

Installation

Install graphemer using the NPM command below:

$ npm i graphemer

Usage

If you're using Typescript or a compiler like Babel (or something like Create React App) things are pretty simple; just import, initialize and use!

1import Graphemer from 'graphemer';
2
3const splitter = new Graphemer();
4
5// split the string to an array of grapheme clusters (one string each)
6const graphemes = splitter.splitGraphemes(string);
7
8// iterate the string to an iterable iterator of grapheme clusters (one string each)
9const graphemeIterator = splitter.iterateGraphemes(string);
10
11// or do this if you just need their number
12const graphemeCount = splitter.countGraphemes(string);

If you're using vanilla Node you can use the require() method.

1const Graphemer = require('graphemer').default;
2
3const splitter = new Graphemer();
4
5const graphemes = splitter.splitGraphemes(string);

Examples

1import Graphemer from 'graphemer';
2
3const splitter = new Graphemer();
4
5// plain latin alphabet - nothing spectacular
6splitter.splitGraphemes('abcd'); // returns ["a", "b", "c", "d"]
7
8// two-char emojis and six-char combined emoji
9splitter.splitGraphemes('🌷🎁💩😜👍🏳️‍🌈'); // returns ["🌷","🎁","💩","😜","👍","🏳️‍🌈"]
10
11// diacritics as combining marks, 10 JavaScript chars
12splitter.splitGraphemes('Ĺo͂řȩm̅'); // returns ["Ĺ","o͂","ř","ȩ","m̅"]
13
14// individual Korean characters (Jamo), 4 JavaScript chars
15splitter.splitGraphemes('뎌쉐'); // returns ["뎌","쉐"]
16
17// Hindi text with combining marks, 8 JavaScript chars
18splitter.splitGraphemes('अनुच्छेद'); // returns ["अ","नु","च्","छे","द"]
19
20// demonic multiple combining marks, 75 JavaScript chars
21splitter.splitGraphemes('Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍A̴̵̜̰͔ͫ͗͢L̠ͨͧͩ͘G̴̻͈͍͔̹̑͗̎̅͛́Ǫ̵̹̻̝̳͂̌̌͘!͖̬̰̙̗̿̋ͥͥ̂ͣ̐́́͜͞'); // returns ["Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍","A̴̵̜̰͔ͫ͗͢","L̠ͨͧͩ͘","G̴̻͈͍͔̹̑͗̎̅͛́","Ǫ̵̹̻̝̳͂̌̌͘","!͖̬̰̙̗̿̋ͥͥ̂ͣ̐́́͜͞"]

TypeScript

Graphemer is built with TypeScript and, of course, includes type declarations.

1import Graphemer from 'graphemer';
2
3const splitter = new Graphemer();
4
5const split: string[] = splitter.splitGraphemes('Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍A̴̵̜̰͔ͫ͗͢L̠ͨͧͩ͘G̴̻͈͍͔̹̑͗̎̅͛́Ǫ̵̹̻̝̳͂̌̌͘!͖̬̰̙̗̿̋ͥͥ̂ͣ̐́́͜͞');

Contributing

See Contribution Guide.

Acknowledgements

This library is a fork of the incredible work done by Orlin Georgiev and Huáng Jùnliàng at https://github.com/orling/grapheme-splitter.

The original library was heavily influenced by Devon Govett's excellent grapheme-breaker CoffeeScript library.

No vulnerabilities found.

10

Binary-Artifacts

Determines if the project has generated executable (binary) artifacts in the source repository.

10

Dangerous-Workflow

Determines if the project's GitHub Action workflows avoid dangerous patterns.

10

License

Determines if the project has defined a license.

4

Vulnerabilities

Determines if the project has open, known unfixed vulnerabilities.

3

Pinned-Dependencies

Determines if the project has declared and pinned the dependencies of its build process.

2

Code-Review

Determines if the project requires human code review before pull requests (aka merge requests) are merged.

0

Maintained

Determines if the project is "actively maintained".

0

Token-Permissions

Determines if the project's workflows follow the principle of least privilege.

0

CII-Best-Practices

Determines if the project has an OpenSSF (formerly CII) Best Practices Badge.

0

Security-Policy

Determines if the project has published a security policy.

0

Fuzzing

Determines if the project uses fuzzing.

0

SAST

Determines if the project uses static code analysis.

Score

3.6

/10

Last Scanned on 2024-11-25

The Open Source Security Foundation is a cross-industry collaboration to improve the security of open source software (OSS). The Scorecard provides security health metrics for open source projects.

Learn More