Gathering detailed insights and metrics for pg-copy-streams
Gathering detailed insights and metrics for pg-copy-streams
Gathering detailed insights and metrics for pg-copy-streams
Gathering detailed insights and metrics for pg-copy-streams
COPY FROM / COPY TO for node-postgres. Stream from one database to another, and stuff.
npm install pg-copy-streams
99.6
Supply Chain
99.5
Quality
79.8
Maintenance
100
Vulnerability
100
License
Module System
Min. Node Version
Typescript Support
Node Version
NPM Version
334 Stars
242 Commits
40 Forks
8 Watching
3 Branches
13 Contributors
Updated on 26 Nov 2024
JavaScript (99.53%)
Makefile (0.47%)
Cumulative downloads
Total Downloads
Last day
-15.7%
26,114
Compared to previous day
Last week
8.9%
161,335
Compared to previous week
Last month
17.3%
640,361
Compared to previous month
Last year
24.4%
6,374,342
Compared to previous year
COPY FROM / COPY TO for node-postgres. Stream from one database to another, and stuff.
Did you know that PostgreSQL supports streaming data directly into and out of a table? This means you can take your favorite CSV or TSV file and pipe it directly into an existing PostgreSQL table.
PostgreSQL supports text, csv/tsv and binary data. If you have data in another format (say for example JSON) convert it to one of the supported format and pipe it directly into an existing PostgreSQL table !
You can also take a table and pipe it directly to a file, another database, stdout, even to /dev/null
if you're crazy!
What this module gives you is a Readable or Writable stream directly into/out of a table in your database. This mode of interfacing with your table is very fast and very brittle. You are responsible for properly encoding and ordering all your columns. If anything is out of place PostgreSQL will send you back an error. The stream works within a transaction so you wont leave things in a 1/2 borked state, but it's still good to be aware of.
If you're not familiar with the feature (I wasn't either) you can read this for some good helps: https://www.postgresql.org/docs/current/sql-copy.html
1var { Pool } = require('pg') 2var { to as copyTo } = require('pg-copy-streams') 3 4var pool = new Pool() 5 6pool.connect(function (err, client, done) { 7 var stream = client.query(copyTo('COPY my_table TO STDOUT')) 8 stream.pipe(process.stdout) 9 stream.on('end', done) 10 stream.on('error', done) 11}) 12 13 14// async/await 15import { pipeline } from 'node:stream/promises' 16import pg from 'pg' 17import { to as copyTo } from 'pg-copy-streams' 18 19const pool = new pg.Pool() 20const client = await pool.connect() 21try { 22 const stream = client.query(copyTo('COPY my_table TO STDOUT')) 23 await pipeline(stream, process.stdout) 24} finally { 25 client.release() 26} 27await pool.end()
Important: When copying data out of postgresql, postgresql will chunk the data on 64kB boundaries. You should expect rows to be cut across the boundaries of these chunks (the end of a chunk will not always match the end of a row). If you are piping the csv output of postgres into a file, this might not be a problem. But if you are trying to analyse the csv output on-the-fly, you need to make sure that you correctly discover the lines of the csv output across the chunk boundaries. We are not recommending any specific streaming csv parser but csv-parser
and csv-parse
seem to correctly handle this.
1var fs = require('node:fs') 2var { Pool } = require('pg') 3var { from as copyFrom } = require('pg-copy-streams') 4 5var pool = new Pool() 6 7pool.connect(function (err, client, done) { 8 var stream = client.query(copyFrom('COPY my_table FROM STDIN')) 9 var fileStream = fs.createReadStream('some_file.tsv') 10 fileStream.on('error', done) 11 stream.on('error', done) 12 stream.on('finish', done) 13 fileStream.pipe(stream) 14}) 15 16 17// async/await 18import { pipeline } from 'node:stream/promises' 19import fs from 'node:fs' 20import pg from 'pg' 21import { from as copyFrom } from 'pg-copy-streams' 22 23const pool = new pg.Pool() 24const client = await pool.connect() 25try { 26 const ingestStream = client.query(copyFrom('COPY my_table FROM STDIN')) 27 const sourceStream = fs.createReadStream('some_file.tsv') 28 await pipeline(sourceStream, ingestStream) 29} finally { 30 client.release() 31} 32await pool.end()
Note: In version prior to 4.0.0, when copying data into postgresql, it was necessary to wait for the 'end' event of pg-copy-streams.from
to correctly detect the end of the COPY operation. This was necessary due to the internals of the module but non-standard. This is not true for versions including and after 4.0.0. The end of the COPY operation must now be detected via the standard 'finish' event. Users of 4.0.0+ should not wait for the 'end' event because it is not fired anymore.
In version 6.0.0+, If you have not yet finished ingesting data into a copyFrom stream and you want to ask postgresql to abort the process, you can call destroy()
on the stream (or let pipeline
do it for you if it detects an error in the pipeline). This will send a CopyFail message to the backend that will rollback the operation. Please take into account that this will not revert the operation if the CopyDone message has already been sent and is being processed by the backend.
This is a more advanded topic. Check the test/copy-both.js file for an example of how this can be used.
Note regarding logical decoding: Parsers for logical decoding scenarios are easier to write when copy-both.js pushes chunks that are aligned on the copyData protocol frames. This is not the default mode of operation of copy-both.js in order to increase the streaming performance. If you need the pushed chunks to be aligned on copyData frames, use the alignOnCopyDataFrame: true
option.
1$ npm install pg-copy-streams
This module only works with the pure JavaScript bindings. If you're using require('pg').native
please make sure to use normal require('pg')
or require('pg.js')
when you're using copy streams.
Before you set out on this magical piping journey, you really should read this: http://www.postgresql.org/docs/current/static/sql-copy.html, and you might want to take a look at the tests to get an idea of how things work.
Take note of the following warning in the PostgreSQL documentation:
COPY stops operation at the first error. This should not lead to problems in the event of a COPY TO, but the target table will already have received earlier rows in a COPY FROM. These rows will not be visible or accessible, but they still occupy disk space. This might amount to a considerable amount of wasted disk space if the failure happened well into a large copy operation. You might wish to invoke VACUUM to recover the wasted space.
The COPY command is commonly used to move huge sets of data. This can put some pressure on the node.js loop, the amount of CPU or the amount of memory used.
There is a bench/ directory in the repository where benchmark scripts are stored. If you have performance issues with pg-copy-stream
do not hesitate to write a new benchmark that highlights your issue. Please avoid to commit huge files (PR won't be accepted) and find other ways to generate huge datasets.
If you have a local instance of postgres on your machine, you can start a benchmark for example with
1$ cd bench 2$ PGPORT=5432 PGDATABASE=postgres node copy-from.js
In order to launch the test suite, you need to have a local instance of postgres running on your machine.
Since version 5.1.0 and the implementation of copy-both.js for logical decoding scenarios, your local postgres instance will need to be configured to accept replication scenarios :
postgresql.conf
wal_level = logical
max_wal_senders > 0
max_replication_slots > 0
pg_hba.conf
make sure your user can connect using the replication mode
1$ PGPORT=5432 PGDATABASE=postgres make test
Instead of adding a bunch more code to the already bloated node-postgres I am trying to make the internals extensible and work on adding edge-case features as 3rd party modules. This is one of those.
Please, if you have any issues with this, open an issue.
Better yet, submit a pull request. I love pull requests.
Generally how I work is if you submit a few pull requests and you're interested I'll make you a contributor and give you full access to everything.
Since this isn't a module with tons of installs and dependent modules I hope we can work together on this to iterate faster here and make something really useful.
pg
optional timeout mechanismpipeline
will automatically send a CopyFail message to the backend is a source triggers an error. cf #115This version is a major change because some users of the library may have been using other techniques in order to ask the backend to rollback the current operation.
Bugfix release handling a corner case when an empty stream is piped into copy-from
This version adds a Duplex stream implementation of the PostgreSQL copyBoth mode described on https://www.postgresql.org/docs/9.6/protocol-flow.html. This mode opens the possibility of dealing with replication and logical decoding scenarios.
This version's major change is a modification in the COPY TO implementation. The new implementation now extends Readable
while previous version where extending Transform
. This should not have an effect on how users use the module but was considered to justify a major version number because even if the test suite coverage is wide, it could have an impact on the streaming dynamics in certain edge cases that are not yet captured by the tests.
Readable
instead of Transform
This version's major change is a modification in the COPY FROM implementation. In previous version, copy-from was internally designed as a Transform
duplex stream. The user-facing API was writable, and the readable side of the Transform
was piped into the postgres connection stream to copy the data inside the database.
This led to an issue because Transform
was emitting its 'finish' too early after the writable side was ended. Postgres had not yet read all the data on the readable side and had not confirmed that the COPY operation was finished. The recommendation was to wait for the 'end' event on the readable side which correcly detected the end of the COPY operation and the fact that the pg connection was ready for new queries.
This recommendation worked ok but this way of detecting the end of a writable is not standard and was leading to different issues (interaction with the finished
and pipeline
API for example)
The new copy-from implementation extends writable and now emits 'finish' with the correct timing : after the COPY operation and after the postgres connection has reached the readyForQuery state.
Another big change in this version is that copy-to now shortcuts the core pg
parsing during the COPY operation. This avoids double-parsing and avoids the fact that pg
buffers whole postgres protocol messages.
Writable
instead of Transform
This version's major change is a modification in the COPY TO implementation. In the previous versions, a row could be pushed downstream only after the full row was gathered in memory. In many cases, rows are small and this is not an issue. But there are some use cases where rows can grow bigger (think of a row containing a 1MB raw image in a BYTEA field. cf issue #91). In these cases, the library was constantly trying to allocate very big buffers and this could lead to severe performance issues. In the new implementation, all the data payload received from a postgres chunk is sent downstream without waiting for full row boundaries.
Some users may in the past have relied on the fact the the copy-to chunk boundaries exactly matched row boundaries. A major difference in the 3.x version is that the module does not offer any guarantee that its chunk boundaries match row boundaries. A row data could (and you have to realize that this will happen) be split across 2 or more chunks depending on the size of the rows and on postgres's own chunking decisions.
As a consequence, when the copy-to stream is piped into a pipeline that does row/CSV parsing, you need to make sure that this pipeline correcly handles rows than span across chunk boundaries. For its tests, this module uses the csv-parser module
prettier
configuration following discussion on brianc/node-postgres#2172This version's major change is a modification in the COPY TO implementation. In the previous version, when a chunk was received from the database, it was analyzed and every row contained within that chunk was pushed individually down the stream pipeline. Small rows could lead to a "one chunk" / "thousands of row pushed" performance issue in node. Thanks to @rafatower & CartoDB for the patch. This is considered to be a major change since some people could be relying on the fact that each outgoing chunk is an individual row.
Other changes in this version
The MIT License (MIT)
Copyright (c) 2013 Brian M. Carlson
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
No vulnerabilities found.
Reason
no binaries found in the repo
Reason
4 existing vulnerabilities detected
Details
Reason
0 commit(s) and 2 issue activity found in the last 90 days -- score normalized to 1
Reason
Found 2/27 approved changesets -- score normalized to 0
Reason
no effort to earn an OpenSSF best practices badge detected
Reason
project is not fuzzed
Details
Reason
security policy file not detected
Details
Reason
license file not detected
Details
Reason
branch protection not enabled on development/release branches
Details
Reason
SAST tool is not run on all commits -- score normalized to 0
Details
Score
Last Scanned on 2024-11-25
The Open Source Security Foundation is a cross-industry collaboration to improve the security of open source software (OSS). The Scorecard provides security health metrics for open source projects.
Learn More