Installations
npm install pelias-parser
Developer
Developer Guide
Module System
CommonJS
Min. Node Version
>= 10.0.0
Typescript Support
No
Node Version
16.20.2
NPM Version
6.14.18
Statistics
55 Stars
268 Commits
28 Forks
11 Watching
10 Branches
21 Contributors
Updated on 26 Nov 2024
Languages
JavaScript (97.78%)
HTML (1.91%)
Shell (0.21%)
Dockerfile (0.09%)
Total Downloads
Cumulative downloads
Total Downloads
55,172
Last day
130%
46
Compared to previous day
Last week
40.2%
129
Compared to previous week
Last month
-0.9%
426
Compared to previous month
Last year
23.5%
10,647
Compared to previous year
Daily Downloads
Weekly Downloads
Monthly Downloads
Yearly Downloads
A modular, open-source search engine for our world.
Pelias is a geocoder powered completely by open data, available freely to everyone.
Local Installation · Cloud Webservice · Documentation · Community Chat
What is Pelias?
Pelias is a search engine for places worldwide, powered by open data. It turns addresses and place names into geographic coordinates, and turns geographic coordinates into places and addresses. With Pelias, you’re able to turn your users’ place searches into actionable geodata and transform your geodata into real places.
We think open data, open source, and open strategy win over proprietary solutions at any part of the stack and we want to ensure the services we offer are in line with that vision. We believe that an open geocoder improves over the long-term only if the community can incorporate truly representative local knowledge.
Pelias Parser
A natural language classification engine for geocoding.
This library contains primitive 'building blocks' which can be composed together to produce a powerful and flexible natural language parser.
The project was designed and built to work with the Pelias geocoder, so it comes bundled with a parser called AddressParser
which can be included in other npm project independent of Pelias.
It is also possible to modify the configuration of AddressParser
, the dictionaries or the semantics. You can also easily create a completely new parser to suit your own domain.
AddressParser Example
30 w 26 st nyc 10010
(0.95) ➜ [
{ housenumber: '30' },
{ street: 'w 26 st' },
{ locality: 'nyc' },
{ postcode: '10010' }
]
Application Interfaces
You can access the library via three different interfaces:
- all parts of the codebase are available in
javascript
vianpm
- on the
command line
via thenode bin/cli.js
script - through a
web service
via thenode server/http.js
script
the web service provides an interactive demo at the URL
/parser/parse
Quick Start
A quick and easy way to get started with the library is to use the command-line interface:
node bin/cli.js West 26th Street, New York, NYC, 10010
Architecture Description
Please refer to the CLI screenshot above for a visual reference.
Tokenization
Tokenization is the process of splitting text into individual words.
The splitting process used by the engine maintains token positions, so it's able to 'remember' where each character was in the original input text.
Tokenization is coloured
blue
on the command-line.
Span
The most primitive element is called a span
, this is essentially just a single string of text with some metadata attached.
The terms word
, phrase
and section
(explained below) are all just ways of using a span
.
Section Boundaries
Some parsers like libpostal ignore characters such as comma
, tab
, newline
and quote
.
While it's unrealistic to expect commas always being present, it's very useful to record their positions when they are.
These boundary positions help to avoid parsing errors for queries such as Main St, East Village
being parsed as Main St East
in Village
.
Once sections are established there is no 'bleeding' of information between sections, avoiding the issue above.
Word Splitting
Each section is then split in to individual words
, by default this simply considers whitespace as a word boundary.
As per the section
, the original token positions are maintained.
Phrase Generation
May terms such as 'New York City' span multiple words, these multi-word tokens are called phrases
.
In order to be able to classify phrase
terms, permutations of adjacent words are generated.
Phrase generation is performed per-section, so it will not generate a phrase
which contains words from more than one section
.
Phrase generation is controlled by a configuration which specifies things like the minimum & maximum amount of words allowed in a phrase
.
Token Graph
A graph is used to associate word
, phrase
and section
elements to each other.
The graph is free-form, so it's easy to add a new relationship between terms in the future, as required.
Graph Example:
1// find the next word in this section 2word.findOne('next') 3 4// find all words in this phrase 5phrase.findAll('child')
Classification
Classification is the process of establishing that a word
or phrase
represents a 'concept' (such as a street name).
Classification can be based on:
- Dictionary matching (usually with normalization applied)
- Pattern matching (such as regular expressions)
- Composite matching (such as relative positioning)
- External API calls (such as calling other services)
- Other semantic matching techniques
Classification is coloured
green
andred
on the command-line.
Classifier Types
The library comes with three generic classifiers which can be extended in order to create a new classifier
:
- WordClassifier
- PhraseClassifier
- SectionClassifier
Classifiers
The library comes bundled with a range of classifiers out-of-the box.
You can find them in the /classifier
directory, dictionary-based classifiers usually store their data in the /resources
directory.
Example of some of the included classifiers:
1// word classifiers 2HouseNumberClassifier 3PostcodeClassifier 4StreetPrefixClassifier 5StreetSuffixClassifier 6CompoundStreetClassifier 7DirectionalClassifier 8OrdinalClassifier 9StopWordClassifier 10 11// phrase classifiers 12IntersectionClassifier 13PersonClassifier 14GivenNameClassifier 15SurnameClassifier 16PersonalSuffixClassifier 17PersonalTitleClassifier 18ChainClassifier 19PlaceClassifier 20WhosOnFirstClassifier
Solvers
Solving is the final process, where solutions
are generated based on all the classifications that have been made.
Each parse can contain multiple solutions
, each is provided with a confidence
score and is displayed sorted from highest scoring solution to lowest scoring.
The core of this process is the ExclusiveCartesianSolver
module.
This solver
generates all the possible permutations of the different classifications while taking care to:
- ensure the same
span
position is not used more than once - ensure that the same
classification
is not used more than once.
After the ExclusiveCartesianSolver
has run there are additional solvers which can:
- filter the
solutions
to remove inconsistencies - add new
solutions
to provide additional functionality (such as intersections)
Solution Masks
It is possible to produce a simple mask
for any generated solution, this is useful for comparing the solution
to the original text:
1VVV VVVV NN SSSSSSS AAAAAA PPPPP 2Foo Cafe 10 Main St London 10010 Earth
Contributing
Please fork and pull request against upstream master on a feature branch. Pretty please; provide unit tests.
Unit tests
You can run the unit test suite using the command:
1$ npm test
Continuous Integration
CI tests every release against all supported Node.js versions.
Versioning
We rely on semantic-release and Greenkeeper to maintain our module and dependency versions.
No vulnerabilities found.
Reason
no dangerous workflow patterns detected
Reason
no binaries found in the repo
Reason
0 existing vulnerabilities detected
Reason
license file detected
Details
- Info: project has a license file: LICENSE:0
- Info: FSF or OSI recognized license: MIT License: LICENSE:0
Reason
Found 15/24 approved changesets -- score normalized to 6
Reason
0 commit(s) and 3 issue activity found in the last 90 days -- score normalized to 2
Reason
detected GitHub workflow tokens with excessive permissions
Details
- Warn: no topLevel permission defined: .github/workflows/_test.yml:1
- Warn: no topLevel permission defined: .github/workflows/pull_request.yml:1
- Warn: no topLevel permission defined: .github/workflows/push.yml:1
- Info: no jobLevel write permissions found
Reason
no effort to earn an OpenSSF best practices badge detected
Reason
project is not fuzzed
Details
- Warn: no fuzzer integrations found
Reason
security policy file not detected
Details
- Warn: no security policy file detected
- Warn: no security file to analyze
- Warn: no security file to analyze
- Warn: no security file to analyze
Reason
dependency not pinned by hash detected -- score normalized to 0
Details
- Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/_test.yml:15: update your workflow using https://app.stepsecurity.io/secureworkflow/pelias/parser/_test.yml/master?enable=pin
- Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/_test.yml:17: update your workflow using https://app.stepsecurity.io/secureworkflow/pelias/parser/_test.yml/master?enable=pin
- Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/push.yml:11: update your workflow using https://app.stepsecurity.io/secureworkflow/pelias/parser/push.yml/master?enable=pin
- Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/push.yml:13: update your workflow using https://app.stepsecurity.io/secureworkflow/pelias/parser/push.yml/master?enable=pin
- Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/push.yml:31: update your workflow using https://app.stepsecurity.io/secureworkflow/pelias/parser/push.yml/master?enable=pin
- Warn: containerImage not pinned by hash: Dockerfile:2: pin your Docker image by updating pelias/baseimage to pelias/baseimage@sha256:a0203c7e348c26f4dec408467a3e3c338d22c46a93d0a800ca7f5870da909582
- Warn: npmCommand not pinned by hash: Dockerfile:10
- Warn: npmCommand not pinned by hash: .github/workflows/_test.yml:22
- Warn: downloadThenRun not pinned by hash: .github/workflows/push.yml:22
- Warn: downloadThenRun not pinned by hash: .github/workflows/push.yml:37
- Info: 0 out of 5 GitHub-owned GitHubAction dependencies pinned
- Info: 0 out of 2 downloadThenRun dependencies pinned
- Info: 0 out of 1 containerImage dependencies pinned
- Info: 1 out of 3 npmCommand dependencies pinned
Reason
SAST tool is not run on all commits -- score normalized to 0
Details
- Warn: 0 commits out of 30 are checked with a SAST tool
Score
4.6
/10
Last Scanned on 2024-11-18
The Open Source Security Foundation is a cross-industry collaboration to improve the security of open source software (OSS). The Scorecard provides security health metrics for open source projects.
Learn More