Gathering detailed insights and metrics for streaming-sequence-extractor
Gathering detailed insights and metrics for streaming-sequence-extractor
Gathering detailed insights and metrics for streaming-sequence-extractor
Gathering detailed insights and metrics for streaming-sequence-extractor
Extract sequence data from GenBank, FASTA or plain-text formats in a streaming manner.
npm install streaming-sequence-extractor
Typescript
Module System
Node Version
NPM Version
JavaScript (100%)
Total Downloads
0
Last Day
0
Last Week
0
Last Month
0
Last Year
0
1 Stars
19 Commits
3 Watchers
1 Branches
2 Contributors
Updated on Aug 20, 2018
Latest Version
0.0.4
Package Id
streaming-sequence-extractor@0.0.4
Size
29.09 kB
NPM Version
2.15.0
Node Version
4.4.2
Cumulative downloads
Total Downloads
Last Day
0%
NaN
Compared to previous day
Last Week
0%
NaN
Compared to previous week
Last Month
0%
NaN
Compared to previous month
Last Year
0%
NaN
Compared to previous year
Stream processor that takes GenBank, FASTA or SBOL 2.x formats as input and streams out just the sequence data (DNA, RNA or Amino Acid sequences) with all formatting and meta-data removed.
Optionally you can specify a FASTA header to be appended to each sequence.
This module was written to facilitate the building of BLAST databases from large amounts of user-contributed sequence files, while using the appended FASTA header to reference the BLAST query results back to the original file.
This is not a strict parser. It will successfully parse things that only bear a vague resemblance to their correct formats. This parser is meant to be fast, asynchronous and platform-independent. If you need strict format validation look elsewhere.
var sse = require('streaming-sequence-extractor');
var fs = require('fs');
var seqStream = sse();
fs.createReadStream('myseq.gb').pipe(seqStream);
seqStream.on('data', function(data) {
console.log(data);
});
seqStream.on('end', function(data) {
console.log("stream ended");
});
type can be:
options (with defaults specified);
{
convertToExpected: false, // convert between T and U based on type 'DNA' or 'RNA'
stripUnexpected: false, // strip unexpected characters
errorOnUnexpected: true, // emit error if any unexpected chars encountered
header: '' // the FASTA header to prepend before each sequence. can be a function
inputEncoding: 'utf8', // decode input using this encoding
outputEncoding: inputEncoding // encode output using this encoding
}
If type is set to 'AA' and GenBank format is encountered then this will cause the parser to only look at the Amino Acid version of the sequence (GenBank allows specifying both translated and untranslated versions of the same sequence).
If convertToUnexpected
is true then if type is set to 'RNA' then all encountered 'U' characters will be converted to 'T' and vice-versa if type is set to 'DNA'.
If stripUnexpected
is true then any characters in the sequence that were not expected based on the type are stripped from the output, otherwise all sequence characters are kept.
If errorOnUnexpected
is true then the first unexpected character encountered in the sequence will result in an emitted error.
Do keep in mind that expected characters for Amino Acid sequences include all expected characters for DNA and RNA sequences so you will receive no errors if you set errorsOnUnexpected
to true and type to 'AA' and then receive DNA or RNA sequences.
If header
is set then each sequence will be output with a FASTA header in the >header
style. If header
is a non-empty string then that string will be used directly as the header for all sequences. If it is an object then it will be converted to JSON first. If it is a function then the function will be passed the sequence count (number of sequences since the stream was initialized, starting from 0) as the argument and is expected to synchronously return a string or sequence.
The output will consist of all nucleotide or amino acid characters encountered in the sequences, as specified by in the 'Allowed sequence characters' section with the optional FASTA header.
Allowed input characters for DNA are the IUPAC characters as per the GenBank Submissions Handbook plus the extra characters allowed by BLAST:
TGACRYMKSWHBVDN-
and for RNA:
UGACRYMKSWHBVDN-
and for Amino Acids:
ABCDEFGHIKLMNPQRSTUVWYZX*-
This parser additionally allows lower case versions of the allowed characters.
To identify SBOL format the parser looks for the pattern '<?xml' or '<rdf:RDF' (case insensitive) and then uses a streaming XML parser to find 'sbol:Elements' tags inside of 'sbol:Sequence' tags inside of the 'rdf:RDF' tag. It currently does not check the encoding.
It extracts all text nodes from within all 'sbol:Elements' tags.
This parser has only works with SBOL 2. It is currently not backwards compatible with SBOL 1.
To idenfity FASTA format the parser looks for the first non-empty line that begins with either >
or ;
and then assumes that the sequence begins at the first non-empty line after that which doesn't begin with a ;
.
Any subsequent lines beginning with >
are taken to mean that multiple sequences are present in the file.
To identify GenBank format the parser looks for the first non-empty line that begins with LOCUS
and then assumes that the sequence begins after the first line starting with ORIGIN
and ends either when encountering a line beginning with //
or end of file.
Subsequent lines starting with LOCUS
are taken to mean that multiple sequences are present in the file.
If auto
is specified as the type (the default) then for GenBank format it will output the DNA or RNA sequence if such a sequence is present, but will not transform the character T to U since GenBank format has no simple way of specifying if a sequence is wholly DNA or RNA, and if no DNA or RNA sequence is present then it will output nothing at all.
The parser currently is not able to auto-detect if an SBOL formatted stream contains Amino Acid sequences vs. DNA/RNA sequences.
The input stream is supposed to be able to contain a mix of the supported formats concatenated together. E.g. you should be able to stream a FASTA file followed by an SBOL file and then a GenBank file, as long as each FASTA or FASTQ file ends with a double-newline. However, currently the SBOL parser overconsumes in some cases (see examples/multi_fail.js) which can cause it to eat up the beginning of the next format. This is an issue with the sax npm module not stopping when it reaches the closing tag of the root node (which is fair if it was designed for parsing a single xml file per initialization).
The tests called *.nobrake.js
are piping input into streaming-sequence-extractor as fast as possible, which usually means that it is received as a single large chunk, a at the most as a few large chunks. The other tests are using the brake module to throttle the input rate such that failures associated with receiving the stream in small (usually single character) increments can be discovered.
License: AGPLv3
Copyright 2017 The BioBricks Foundation
No vulnerabilities found.
Reason
no binaries found in the repo
Reason
0 existing vulnerabilities detected
Reason
Found 0/19 approved changesets -- score normalized to 0
Reason
no SAST tool detected
Details
Reason
0 commit(s) and 0 issue activity found in the last 90 days -- score normalized to 0
Reason
no effort to earn an OpenSSF best practices badge detected
Reason
security policy file not detected
Details
Reason
license file not detected
Details
Reason
project is not fuzzed
Details
Reason
branch protection not enabled on development/release branches
Details
Score
Last Scanned on 2025-07-07
The Open Source Security Foundation is a cross-industry collaboration to improve the security of open source software (OSS). The Scorecard provides security health metrics for open source projects.
Learn More