npmpackage.info

Gathering detailed insights and metrics for stringzilla

stringzilla

Up to 10x faster strings for C, C++, Python, Rust, and Swift, leveraging NEON, AVX2, AVX-512, and SWAR to accelerate search, sort, edit distances, alignment scores, etc 🦖

2.0.4

2,314

Apache-2.0

C++

1.04 kB

811

Installations

npm install stringzilla

Developer Guide

BETA

Typescript

No

Module System

ESM

Min. Node Version

~10 >=10.20 || >=12.17

Node Version

18.19.0

NPM Version

10.2.3 Score

59.2

Supply Chain

95.7

Quality

76.2

Maintenance

100

Vulnerability

100

License

Pull Requests

Open

6

Total

119

Closed

9

Merged

104

Issues

Open

26

Total

86

Closed

60

Releases

Release v3.11.2

v3.11.2

Published on 19 Dec 2024

v3.11.1: Matching N3322 for `memcpy` UB in C2y

v3.11.1

Published on 11 Dec 2024

v3.11.0: Checksums in AVX-512, AVX2, NEON

v3.11.0

Published on 01 Dec 2024

Release v3.10.11

v3.10.11

Published on 30 Nov 2024

Release v3.10.10

v3.10.10

Published on 12 Nov 2024

Release v3.10.9

v3.10.9

Published on 09 Nov 2024

View all 68 releases

Contributors

View all 40 contributors

Languages

C++

Rust

Python

Jupyter Notebook

Swift

CMake

Shell

JavaScript

C++ (60.3%)

C (22.13%)

Rust (5.18%)

Python (4.64%)

Jupyter Notebook (4.5%)

Swift (1.46%)

CMake (1.31%)

Shell (0.26%)

JavaScript (0.22%)

Developer

ashvardanian

Download Statistics

Total Downloads

811

Last Day

Last Week

Last Month

Last Year

312

GitHub Statistics

2,314 Stars

859 Commits

82 Forks

26 Watching

3 Branches

40 Contributors

Bundle Size

2.27 kB

Minified

1.04 kB

Minified + Gzipped

Bundlephobia

Maintainers

Package Meta Information

Latest Version

2.0.4

Package Id

stringzilla@2.0.4

Unpacked Size

81.11 kB

Size

21.34 kB

File Count

NPM Version

10.2.3

Node Version

18.19.0

Publised On

04 Jan 2024

Total Downloads

Cumulative downloads

Total Downloads

811

Last day

Compared to previous day

Last week

Compared to previous week

Last month

-85.7%

Compared to previous month

Last year

-37.5%

312

Compared to previous year

Daily Downloads

Weekly Downloads

Monthly Downloads

Yearly Downloads

Dependencies

@types/node bindings node-addon-api

Dev Dependencies

@semantic-release/exec @semantic-release/git conventional-changelog-eslint semantic-release typescript

Versions

StringZilla 🦖

StringZilla is the Godzilla of string libraries, splitting, sorting, and shuffling large textual datasets faster than you can say "Tokyo Tower" 😅

✅ Single-header pure C 99 implementation docs
✅ Direct CPython bindings with minimal call latency docs
✅ SWAR and SIMD acceleration on x86 (AVX2) and ARM (NEON)
✅ Radix-like sorting faster than C++ std::sort
✅ Memory-mapping to work with larger-than-RAM datasets
✅ Memory-efficient compressed arrays to work with sequences
🔜 JavaScript bindings are on their way.

This library saved me tens of thousands of dollars pre-processing large datasets for machine learning, even on the scale of a single experiment. So if you want to process the 6 Billion images from LAION, or the 250 Billion web pages from the CommonCrawl, or even just a few million lines of server logs, and haunted by Python's open(...).readlines() and str().splitlines() taking forever, this should help 😊

Performance

StringZilla is built on a very simple heuristic:

If the first 4 bytes of the string are the same, the strings are likely to be equal. Similarly, the first 4 bytes of the strings can be used to determine their relative order most of the time.

Thanks to that it can avoid scalar code processing one char at a time and use hyper-scalar code to achieve memcpy speeds. The implementation fits into a single C 99 header file and uses different SIMD flavors and SWAR on older platforms.

Substring Search

Backend \ Device	IoT	Laptop	Server
Speed Comparison 🐇
Python `for` loop	4 MB/s	14 MB/s	11 MB/s
C++ `for` loop	520 MB/s	1.0 GB/s	900 MB/s
C++ `string.find`	560 MB/s	1.2 GB/s	1.3 GB/s
Scalar StringZilla	2 GB/s	3.3 GB/s	3.5 GB/s
Hyper-Scalar StringZilla	4.3 GB/s	12 GB/s	12.1 GB/s
Efficiency Metrics 📊
CPU Specs	8-core ARM, 0.5 W/core	8-core Intel, 5.6 W/core	22-core Intel, 6.3 W/core
Performance/Core	2.1 - 3.3 GB/s	11 GB/s	10.5 GB/s
Bytes/Joule	4.2 GB/J	2 GB/J	1.6 GB/J

Split, Partition, Sort, and Shuffle

Coming soon.

Quick Start: Python 🐍

Install via pip: pip install stringzilla
Import the classes you need: from stringzilla import Str, Strs, File

Basic Usage

StringZilla offers two mostly interchangeable core classes:

1from stringzilla import Str, File
2
3text_from_str = Str('some-string')
4text_from_file = Str(File('some-file.txt'))

The Str is designed to replace long Python str strings and wrap our C-level API. On the other hand, the File memory-maps a file from persistent memory without loading its copy into RAM. The contents of that file would remain immutable, and the mapping can be shared by multiple Python processes simultaneously. A standard dataset pre-processing use case would be to map a sizeable textual dataset like Common Crawl into memory, spawn child processes, and split the job between them.

Basic Operations

Length: len(text) -> int
Indexing: text[42] -> str
Slicing: text[42:46] -> Str
String conversion: str(text) -> str
Substring check: 'substring' in text -> bool
Hashing: hash(text) -> int

Advanced Operations

text.contains('substring', start=0, end=9223372036854775807) -> bool
text.find('substring', start=0, end=9223372036854775807) -> int
text.count('substring', start=0, end=9223372036854775807, allowoverlap=False) -> int
text.splitlines(keeplinebreaks=False, separator='\n') -> Strs
text.split(separator=' ', maxsplit=9223372036854775807, keepseparator=False) -> Strs

Collection-Level Operations

Once split into a Strs object, you can sort, shuffle, and reorganize the slices.

1lines: Strs = text.split(separator='\n')
2lines.sort()
3lines.shuffle(seed=42)

Need copies?

1sorted_copy: Strs = lines.sorted()
2shuffled_copy: Strs = lines.shuffled(seed=42)

Basic list-like operations are also supported:

1lines.append('Pythonic string')
2lines.extend(shuffled_copy)

Low-Level Python API

The StringZilla CPython bindings implement vector-call conventions for faster calls.

1import stringzilla as sz
2
3contains: bool = sz.contains("haystack", "needle", start=0, end=9223372036854775807)
4offset: int = sz.find("haystack", "needle", start=0, end=9223372036854775807)
5count: int = sz.count("haystack", "needle", start=0, end=9223372036854775807, allowoverlap=False)
6levenstein: int = sz.levenstein("needle", "nidl")

Quick Start: C 🛠️

There is an ABI-stable C 99 interface, in case you have a database, an operating system, or a runtime you want to integrate with StringZilla.

1#include "stringzilla.h"
2
3// Initialize your haystack and needle
4sz_string_view_t haystack = {your_text, your_text_length};
5sz_string_view_t needle = {your_subtext, your_subtext_length};
6
7// Perform string-level operations
8sz_size_t character_count = sz_count_char(haystack.start, haystack.length, "a");
9sz_size_t substring_position = sz_find_substring(haystack.start, haystack.length, needle.start, needle.length);
10
11// Hash strings
12sz_u32_t crc32 = sz_hash_crc32(haystack.start, haystack.length);
13
14// Perform collection level operations
15sz_sequence_t array = {your_order, your_count, your_get_start, your_get_length, your_handle};
16sz_sort(&array, &your_config);

Contributing 👾

Future development plans include:

Replace PyBind11 with CPython, blog
Bindings for JavaScript
Faster string sorting algorithm
Reverse-order operations in Python
Splitting with multiple separators at once
Splitting CSV rows into columns
UTF-8 validation.
Arm SVE backend
Bindings for Java and Rust

Here's how to set up your dev environment and run some tests.

Development

CPython:

1# Clean up, install, and test!
2rm -rf build && pip install -e . && pytest scripts/ -s -x
3
4# Install without dependencies
5pip install -e . --no-index --no-deps

NodeJS:

1npm install && npm test

Benchmarking

To benchmark on some custom file and pattern combinations:

1python scripts/bench.py --haystack_path "your file" --needle "your pattern"

To benchmark on synthetic data:

1python scripts/bench.py --haystack_pattern "abcd" --haystack_length 1e9 --needle "abce"

Packaging

To validate packaging:

1cibuildwheel --platform linux

Compiling C++ Tests

1cmake -B ./build_release -DSTRINGZILLA_BUILD_TEST=1 && make -C ./build_release -j && ./build_release/stringzilla_test

On MacOS it's recommended to use non-default toolchain:

1# Install dependencies
2brew install libomp llvm
3
4# Compile and run tests
5cmake -B ./build_release \
6    -DCMAKE_C_COMPILER="/opt/homebrew/opt/llvm/bin/clang" \
7    -DCMAKE_CXX_COMPILER="/opt/homebrew/opt/llvm/bin/clang++" \
8    -DSTRINGZILLA_USE_OPENMP=1 \
9    -DSTRINGZILLA_BUILD_TEST=1 \
10    && \
11    make -C ./build_release -j && ./build_release/stringzilla_test

License 📜

Feel free to use the project under Apache 2.0 or the Three-clause BSD license at your preference.

If you like this project, you may also enjoy USearch, UCall, UForm, UStore, SimSIMD, and TenPack 🤗

No vulnerabilities found.

No security vulnerabilities found.