llm-text-splitter
A lightweight TypeScript library for splitting text into chunks based on various criteria, ideal for preparing text for Large Language Models (LLMs) and for RAG applications.
Features
- Multiple Splitting Methods: Split text by sentences, paragraphs, markdown headings, or custom regular expressions.
- Size Control: Define minimum and maximum chunk lengths.
- Overlap: Introduce overlapping text between chunks to maintain context.
- Extra Space Removal: Optionally remove extra whitespace within chunks.
- Robust API: A class-based API for better configuration and error handling.
Installation
npm install llm-text-splitter
Usage
Creating a Splitter Instance
To use the Splitter
, first create an instance with your desired options:
import { Splitter } from 'llm-text-splitter';
const splitter = new Splitter({
maxLength: 30,
overlap: 5,
splitter: 'sentence',
});
Splitting Text by Sentences with Overlap
const text =
'This is the first sentence. This is the second, slightly longer sentence. And a final short one.';
const chunks = splitter.split(text);
console.log(chunks);
// Expected output (may vary slightly depending on overlap handling):
// [
// "This is the first sentence.",
// "sentence. This is the",
// "the second, slightly longer",
// "longer sentence. And a",
// "And a final short one."
// ]
Splitting Text by Paragraphs
const text2 =
'This is the first paragraph.\n\nThis is the second paragraph, which is a bit longer.\n\nShort paragraph.';
const chunks2 = splitter.split(text2, { splitter: 'paragraph' });
console.log(chunks2);
// Expected output:
// [
// "This is the first paragraph.",
// "This is the second paragraph, which is a bit longer.",
// "Short paragraph."
// ]
Splitting Text by Markdown Headings
const text3 =
'# Heading 1\nThis is some text under heading 1.\n\n## Heading 2\nMore text here.\n\n### Heading 3\nAnd even more text.';
const chunks3 = splitter.split(text3, { splitter: 'markdown' });
console.log(chunks3);
// Expected output:
// [
// "# Heading 1\nThis is some text under heading 1.",
// "## Heading 2\nMore text here.",
// "### Heading 3\nAnd even more text."
// ]
Custom Regex Splitter
const text4 = 'Item 1; Item 2; Item 3; Item 4';
const chunks4 = splitter.split(text4, { regex: /[;]/ });
console.log(chunks4);
// Expected output:
// ["Item 1", " Item 2", " Item 3", " Item 4"]
Removing Extra Spaces
const text5 =
'This is a string with extra spaces. This is another string with extra spaces.';
const chunks5 = splitter.split(text5, { removeExtraSpaces: true });
console.log(chunks5);
// Expected output:
// [
// "This is a string with extra spaces.",
// "This is another string with extra spaces."
// ]
API
Splitter
The main class for splitting text.
Constructor
constructor(options: SplitOptions = {})
options
: An optional object with the following properties:
minLength
: The minimum length of a chunk (default: 0).
maxLength
: The maximum length of a chunk (default: 5000).
overlap
: The number of overlapping characters between chunks (default: 0).
splitter
: The splitting method ('sentence', 'paragraph', 'markdown') (default: 'sentence').
regex
: A custom regular expression for splitting. Overrides splitter
if provided.
removeExtraSpaces
: Whether to remove extra spaces within chunks (default: false).
Method
split(text: string, options: SplitOptions = {}): string[]
text
: The input text string.
options
: An optional object to override the default options for this specific split.
Return Value
An array of strings, where each string is a chunk of the input text.
Error Handling
The Splitter
class will throw an error if minLength
is greater than maxLength
during instantiation.
Contributing
Contributions are welcome! Please open an issue or submit a pull request.