npmpackage.info

Gathering detailed insights and metrics for stringzilla

stringzilla - 2.0.4 | npmpackage.info

stringzilla

Up to 10x faster strings for C, C++, Python, Rust, Swift & Go, leveraging NEON, AVX2, AVX-512, SVE, & SWAR to accelerate search, hashing, sort, edit distances, and memory ops 🦖

2.0.4

2,617

Apache-2.0

81.11 kB

Installations

npm install stringzilla

Developer Guide

BETA

Typescript

No

Module System

ESM

Min. Node Version

~10 >=10.20 || >=12.17

Node Version

18.19.0

NPM Version

10.2.3 Pull Requests

Open

5

Total

127

Closed

10

Merged

112

Issues

Open

36

Total

98

Closed

62

Releases

Release v3.12.5

v3.12.5

Updated on Apr 18, 2025

Release v3.12.4

v3.12.4

Updated on Apr 12, 2025

Release v3.12.3

v3.12.3

Updated on Mar 09, 2025

Release v3.12.2

v3.12.2

Updated on Mar 02, 2025

Release v3.12.1

v3.12.1

Updated on Feb 26, 2025

GoLang support in StringZilla v3.12 🥳

v3.12.0

Updated on Feb 23, 2025

View All 75 releases

Languages

C++

Rust

Python

Jupyter Notebook

Swift

CMake

JavaScript

C (51.89%)

C++ (29.74%)

Rust (5.12%)

Python (4.59%)

Jupyter Notebook (4.45%)

Swift (1.44%)

CMake (1.29%)

Go (1.25%)

JavaScript (0.22%)

Developer

ashvardanian

Download Statistics

Total Downloads

Last Day

Last Week

Last Month

Last Year

GitHub Statistics

Apache-2.0 License

2,617 Stars

881 Commits

88 Forks

25 Watchers

5 Branches

43 Contributors

Updated on Jul 10, 2025

Maintainers

View All 43 Contributors

Package Meta Information

Latest Version

2.0.4

Package Id

stringzilla@2.0.4

Unpacked Size

81.11 kB

Size

21.34 kB

File Count

NPM Version

10.2.3

Node Version

18.19.0

Published on

Jan 04, 2024

Total Downloads

Cumulative downloads

Total Downloads

NaN

Last Day

NaN

Compared to previous day

Last Week

NaN

Compared to previous week

Last Month

NaN

Compared to previous month

Last Year

NaN

Compared to previous year

Weekly Downloads

Monthly Downloads

Yearly Downloads

Dependencies

@types/node bindings node-addon-api

Dev Dependencies

@semantic-release/exec @semantic-release/git conventional-changelog-eslint semantic-release typescript

StringZilla 🦖

StringZilla is the Godzilla of string libraries, splitting, sorting, and shuffling large textual datasets faster than you can say "Tokyo Tower" 😅

✅ Single-header pure C 99 implementation docs
✅ Direct CPython bindings with minimal call latency docs
✅ SWAR and SIMD acceleration on x86 (AVX2) and ARM (NEON)
✅ Radix-like sorting faster than C++ std::sort
✅ Memory-mapping to work with larger-than-RAM datasets
✅ Memory-efficient compressed arrays to work with sequences
🔜 JavaScript bindings are on their way.

This library saved me tens of thousands of dollars pre-processing large datasets for machine learning, even on the scale of a single experiment. So if you want to process the 6 Billion images from LAION, or the 250 Billion web pages from the CommonCrawl, or even just a few million lines of server logs, and haunted by Python's open(...).readlines() and str().splitlines() taking forever, this should help 😊

Performance

StringZilla is built on a very simple heuristic:

If the first 4 bytes of the string are the same, the strings are likely to be equal. Similarly, the first 4 bytes of the strings can be used to determine their relative order most of the time.

Thanks to that it can avoid scalar code processing one char at a time and use hyper-scalar code to achieve memcpy speeds. The implementation fits into a single C 99 header file and uses different SIMD flavors and SWAR on older platforms.

Substring Search

Backend \ Device	IoT	Laptop	Server
Speed Comparison 🐇
Python `for` loop	4 MB/s	14 MB/s	11 MB/s
C++ `for` loop	520 MB/s	1.0 GB/s	900 MB/s
C++ `string.find`	560 MB/s	1.2 GB/s	1.3 GB/s
Scalar StringZilla	2 GB/s	3.3 GB/s	3.5 GB/s
Hyper-Scalar StringZilla	4.3 GB/s	12 GB/s	12.1 GB/s
Efficiency Metrics 📊
CPU Specs	8-core ARM, 0.5 W/core	8-core Intel, 5.6 W/core	22-core Intel, 6.3 W/core
Performance/Core	2.1 - 3.3 GB/s	11 GB/s	10.5 GB/s
Bytes/Joule	4.2 GB/J	2 GB/J	1.6 GB/J

Split, Partition, Sort, and Shuffle

Coming soon.

Quick Start: Python 🐍

Install via pip: pip install stringzilla
Import the classes you need: from stringzilla import Str, Strs, File

Basic Usage

StringZilla offers two mostly interchangeable core classes:

1from stringzilla import Str, File
2
3text_from_str = Str('some-string')
4text_from_file = Str(File('some-file.txt'))

The Str is designed to replace long Python str strings and wrap our C-level API. On the other hand, the File memory-maps a file from persistent memory without loading its copy into RAM. The contents of that file would remain immutable, and the mapping can be shared by multiple Python processes simultaneously. A standard dataset pre-processing use case would be to map a sizeable textual dataset like Common Crawl into memory, spawn child processes, and split the job between them.

Basic Operations

Length: len(text) -> int
Indexing: text[42] -> str
Slicing: text[42:46] -> Str
String conversion: str(text) -> str
Substring check: 'substring' in text -> bool
Hashing: hash(text) -> int

Advanced Operations

text.contains('substring', start=0, end=9223372036854775807) -> bool
text.find('substring', start=0, end=9223372036854775807) -> int
text.count('substring', start=0, end=9223372036854775807, allowoverlap=False) -> int
text.splitlines(keeplinebreaks=False, separator='\n') -> Strs
text.split(separator=' ', maxsplit=9223372036854775807, keepseparator=False) -> Strs

Collection-Level Operations

Once split into a Strs object, you can sort, shuffle, and reorganize the slices.

1lines: Strs = text.split(separator='\n')
2lines.sort()
3lines.shuffle(seed=42)

Need copies?

1sorted_copy: Strs = lines.sorted()
2shuffled_copy: Strs = lines.shuffled(seed=42)

Basic list-like operations are also supported:

1lines.append('Pythonic string')
2lines.extend(shuffled_copy)

Low-Level Python API

The StringZilla CPython bindings implement vector-call conventions for faster calls.

1import stringzilla as sz
2
3contains: bool = sz.contains("haystack", "needle", start=0, end=9223372036854775807)
4offset: int = sz.find("haystack", "needle", start=0, end=9223372036854775807)
5count: int = sz.count("haystack", "needle", start=0, end=9223372036854775807, allowoverlap=False)
6levenstein: int = sz.levenstein("needle", "nidl")

Quick Start: C 🛠️

There is an ABI-stable C 99 interface, in case you have a database, an operating system, or a runtime you want to integrate with StringZilla.

1#include "stringzilla.h"
2
3// Initialize your haystack and needle
4sz_string_view_t haystack = {your_text, your_text_length};
5sz_string_view_t needle = {your_subtext, your_subtext_length};
6
7// Perform string-level operations
8sz_size_t character_count = sz_count_char(haystack.start, haystack.length, "a");
9sz_size_t substring_position = sz_find_substring(haystack.start, haystack.length, needle.start, needle.length);
10
11// Hash strings
12sz_u32_t crc32 = sz_hash_crc32(haystack.start, haystack.length);
13
14// Perform collection level operations
15sz_sequence_t array = {your_order, your_count, your_get_start, your_get_length, your_handle};
16sz_sort(&array, &your_config);

Contributing 👾

Future development plans include:

Replace PyBind11 with CPython, blog
Bindings for JavaScript
Faster string sorting algorithm
Reverse-order operations in Python
Splitting with multiple separators at once
Splitting CSV rows into columns
UTF-8 validation.
Arm SVE backend
Bindings for Java and Rust

Here's how to set up your dev environment and run some tests.

Development

CPython:

1# Clean up, install, and test!
2rm -rf build && pip install -e . && pytest scripts/ -s -x
3
4# Install without dependencies
5pip install -e . --no-index --no-deps

NodeJS:

1npm install && npm test

Benchmarking

To benchmark on some custom file and pattern combinations:

1python scripts/bench.py --haystack_path "your file" --needle "your pattern"

To benchmark on synthetic data:

1python scripts/bench.py --haystack_pattern "abcd" --haystack_length 1e9 --needle "abce"

Packaging

To validate packaging:

1cibuildwheel --platform linux

Compiling C++ Tests

1cmake -B ./build_release -DSTRINGZILLA_BUILD_TEST=1 && make -C ./build_release -j && ./build_release/stringzilla_test

On MacOS it's recommended to use non-default toolchain:

1# Install dependencies
2brew install libomp llvm
3
4# Compile and run tests
5cmake -B ./build_release \
6    -DCMAKE_C_COMPILER="/opt/homebrew/opt/llvm/bin/clang" \
7    -DCMAKE_CXX_COMPILER="/opt/homebrew/opt/llvm/bin/clang++" \
8    -DSTRINGZILLA_USE_OPENMP=1 \
9    -DSTRINGZILLA_BUILD_TEST=1 \
10    && \
11    make -C ./build_release -j && ./build_release/stringzilla_test

License 📜

Feel free to use the project under Apache 2.0 or the Three-clause BSD license at your preference.

If you like this project, you may also enjoy USearch, UCall, UForm, UStore, SimSIMD, and TenPack 🤗

No vulnerabilities found.

No security vulnerabilities found.