Gathering detailed insights and metrics for stringzilla
Gathering detailed insights and metrics for stringzilla
Gathering detailed insights and metrics for stringzilla
Gathering detailed insights and metrics for stringzilla
Up to 10x faster strings for C, C++, Python, Rust, and Swift, leveraging NEON, AVX2, AVX-512, and SWAR to accelerate search, sort, edit distances, alignment scores, etc 🦖
npm install stringzilla
Typescript
Module System
Min. Node Version
Node Version
NPM Version
59.2
Supply Chain
95.7
Quality
76.2
Maintenance
100
Vulnerability
100
License
Release v3.11.2
Published on 19 Dec 2024
v3.11.1: Matching N3322 for `memcpy` UB in C2y
Published on 11 Dec 2024
v3.11.0: Checksums in AVX-512, AVX2, NEON
Published on 01 Dec 2024
Release v3.10.11
Published on 30 Nov 2024
Release v3.10.10
Published on 12 Nov 2024
Release v3.10.9
Published on 09 Nov 2024
C++ (60.3%)
C (22.13%)
Rust (5.18%)
Python (4.64%)
Jupyter Notebook (4.5%)
Swift (1.46%)
CMake (1.31%)
Shell (0.26%)
JavaScript (0.22%)
Total Downloads
811
Last Day
1
Last Week
1
Last Month
2
Last Year
312
2,314 Stars
859 Commits
82 Forks
26 Watching
3 Branches
40 Contributors
Minified
Minified + Gzipped
Latest Version
2.0.4
Package Id
stringzilla@2.0.4
Unpacked Size
81.11 kB
Size
21.34 kB
File Count
8
NPM Version
10.2.3
Node Version
18.19.0
Publised On
04 Jan 2024
Cumulative downloads
Total Downloads
Last day
0%
1
Compared to previous day
Last week
0%
1
Compared to previous week
Last month
-85.7%
2
Compared to previous month
Last year
-37.5%
312
Compared to previous year
StringZilla is the Godzilla of string libraries, splitting, sorting, and shuffling large textual datasets faster than you can say "Tokyo Tower" 😅
std::sort
This library saved me tens of thousands of dollars pre-processing large datasets for machine learning, even on the scale of a single experiment.
So if you want to process the 6 Billion images from LAION, or the 250 Billion web pages from the CommonCrawl, or even just a few million lines of server logs, and haunted by Python's open(...).readlines()
and str().splitlines()
taking forever, this should help 😊
StringZilla is built on a very simple heuristic:
If the first 4 bytes of the string are the same, the strings are likely to be equal. Similarly, the first 4 bytes of the strings can be used to determine their relative order most of the time.
Thanks to that it can avoid scalar code processing one char
at a time and use hyper-scalar code to achieve memcpy
speeds.
The implementation fits into a single C 99 header file and uses different SIMD flavors and SWAR on older platforms.
Backend \ Device | IoT | Laptop | Server |
---|---|---|---|
Speed Comparison 🐇 | |||
Python for loop | 4 MB/s | 14 MB/s | 11 MB/s |
C++ for loop | 520 MB/s | 1.0 GB/s | 900 MB/s |
C++ string.find | 560 MB/s | 1.2 GB/s | 1.3 GB/s |
Scalar StringZilla | 2 GB/s | 3.3 GB/s | 3.5 GB/s |
Hyper-Scalar StringZilla | 4.3 GB/s | 12 GB/s | 12.1 GB/s |
Efficiency Metrics 📊 | |||
CPU Specs | 8-core ARM, 0.5 W/core | 8-core Intel, 5.6 W/core | 22-core Intel, 6.3 W/core |
Performance/Core | 2.1 - 3.3 GB/s | 11 GB/s | 10.5 GB/s |
Bytes/Joule | 4.2 GB/J | 2 GB/J | 1.6 GB/J |
Coming soon.
pip install stringzilla
from stringzilla import Str, Strs, File
StringZilla offers two mostly interchangeable core classes:
1from stringzilla import Str, File 2 3text_from_str = Str('some-string') 4text_from_file = Str(File('some-file.txt'))
The Str
is designed to replace long Python str
strings and wrap our C-level API.
On the other hand, the File
memory-maps a file from persistent memory without loading its copy into RAM.
The contents of that file would remain immutable, and the mapping can be shared by multiple Python processes simultaneously.
A standard dataset pre-processing use case would be to map a sizeable textual dataset like Common Crawl into memory, spawn child processes, and split the job between them.
len(text) -> int
text[42] -> str
text[42:46] -> Str
str(text) -> str
'substring' in text -> bool
hash(text) -> int
text.contains('substring', start=0, end=9223372036854775807) -> bool
text.find('substring', start=0, end=9223372036854775807) -> int
text.count('substring', start=0, end=9223372036854775807, allowoverlap=False) -> int
text.splitlines(keeplinebreaks=False, separator='\n') -> Strs
text.split(separator=' ', maxsplit=9223372036854775807, keepseparator=False) -> Strs
Once split into a Strs
object, you can sort, shuffle, and reorganize the slices.
1lines: Strs = text.split(separator='\n') 2lines.sort() 3lines.shuffle(seed=42)
Need copies?
1sorted_copy: Strs = lines.sorted() 2shuffled_copy: Strs = lines.shuffled(seed=42)
Basic list
-like operations are also supported:
1lines.append('Pythonic string') 2lines.extend(shuffled_copy)
The StringZilla CPython bindings implement vector-call conventions for faster calls.
1import stringzilla as sz 2 3contains: bool = sz.contains("haystack", "needle", start=0, end=9223372036854775807) 4offset: int = sz.find("haystack", "needle", start=0, end=9223372036854775807) 5count: int = sz.count("haystack", "needle", start=0, end=9223372036854775807, allowoverlap=False) 6levenstein: int = sz.levenstein("needle", "nidl")
There is an ABI-stable C 99 interface, in case you have a database, an operating system, or a runtime you want to integrate with StringZilla.
1#include "stringzilla.h" 2 3// Initialize your haystack and needle 4sz_string_view_t haystack = {your_text, your_text_length}; 5sz_string_view_t needle = {your_subtext, your_subtext_length}; 6 7// Perform string-level operations 8sz_size_t character_count = sz_count_char(haystack.start, haystack.length, "a"); 9sz_size_t substring_position = sz_find_substring(haystack.start, haystack.length, needle.start, needle.length); 10 11// Hash strings 12sz_u32_t crc32 = sz_hash_crc32(haystack.start, haystack.length); 13 14// Perform collection level operations 15sz_sequence_t array = {your_order, your_count, your_get_start, your_get_length, your_handle}; 16sz_sort(&array, &your_config);
Future development plans include:
Here's how to set up your dev environment and run some tests.
CPython:
1# Clean up, install, and test! 2rm -rf build && pip install -e . && pytest scripts/ -s -x 3 4# Install without dependencies 5pip install -e . --no-index --no-deps
NodeJS:
1npm install && npm test
To benchmark on some custom file and pattern combinations:
1python scripts/bench.py --haystack_path "your file" --needle "your pattern"
To benchmark on synthetic data:
1python scripts/bench.py --haystack_pattern "abcd" --haystack_length 1e9 --needle "abce"
To validate packaging:
1cibuildwheel --platform linux
1cmake -B ./build_release -DSTRINGZILLA_BUILD_TEST=1 && make -C ./build_release -j && ./build_release/stringzilla_test
On MacOS it's recommended to use non-default toolchain:
1# Install dependencies 2brew install libomp llvm 3 4# Compile and run tests 5cmake -B ./build_release \ 6 -DCMAKE_C_COMPILER="/opt/homebrew/opt/llvm/bin/clang" \ 7 -DCMAKE_CXX_COMPILER="/opt/homebrew/opt/llvm/bin/clang++" \ 8 -DSTRINGZILLA_USE_OPENMP=1 \ 9 -DSTRINGZILLA_BUILD_TEST=1 \ 10 && \ 11 make -C ./build_release -j && ./build_release/stringzilla_test
Feel free to use the project under Apache 2.0 or the Three-clause BSD license at your preference.
If you like this project, you may also enjoy USearch, UCall, UForm, UStore, SimSIMD, and TenPack 🤗
No vulnerabilities found.
No security vulnerabilities found.