-
Notifications
You must be signed in to change notification settings - Fork 290
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do transform streams have to handle backpressure as well? #2695
Comments
Yes. Transform and Duplex are just glue for Readable / Writable so with respect to Readable just think of the rules for Readable. If you post a link to what you saw online we can dissect. Maybe there's some nuance to it. But it's equally likely that it's just wrong. Streams take some time to grasp but like most things, if you think about it long enough it starts to make sense. Ignoring push (or write) returning false is very bad. It Note that since a Promise is really just a callback wrapped in an object and you already have an object ("this"), then you could just do things the declarative way: _pushString(arrayOfString, ai, callback) {
if (ai === arrayOfString.length)
return callback();
if (this.push(arrayOfString[ai]) === false) {
return this.once('drain', () => {
_pushString(arrayOfString, ai + 1, callback);
});
}
_pushString(arrayOfString, ai + 1, callback)
}
_transform(chunk, encoding, callback) {
const arrayOfString = extractString(chunk);
this._pushString(arrayOfString, 0, callback);
} At least this could be a lot faster if the reader is slow because you're not creating lot's of Promise objects and doing unnecessary context switches. |
This would be even easier to read IMO. export class MyTransform extends stream.Transform {
constructor() {
super({ objectMode: true });
}
_pushWithBackpressure (chunk, encoding) {
if (!this.push(chunk, encoding))
this.once('drain', () => this._pushWithBackpressure(chunk, encoding))
}
}
_transform(chunk: any, encoding: string, callback: TransformCallback) {
const arrayOfStrings = extractStrings(chunk);
for (const string of arrayOfStrings) {
this._pushWithBackPressure(string)
}
}
} |
No! That will queue up all strings AND install extraneous drain handlers. When doing I/O JS is more declarative than functional. It's a fundamentally different way of thinking. That is what Promises try to solve but fail IMO. |
Yeah, I've tested this and I get a memory leak warning due to Also, regarding Promises; fundamentally, adding listeners or using promises they do the same thing. TBH I have come to realize that the streams documentation and stream API around backpressure handling is gorgeously inadequate. There's talk about backpressure handling and yet no fundamental mechanisms that can be leveraged to handle it gracefully. I get the fact that when |
Or, have documentation for handling common backpressure scenarios (i.e. Transform that pushes multiple times from a single chunk, etc.). |
For example, with my solution and your solution we'll still get the max listener warning for say a stream of 72K records in object mode coming through. Your solution is optimized for pushing multiple chunks/objects sequentially, but it still ends up adding an event listener for So the question is: How are we supposed to handle this so that we don't get the even listener warning nonsense and yet still respect backpressure? I guess just randomly set your For some context I have a scenario where I'm reading a |
@squarewav I see the genius of your solution; you only add a single NOTE: I've made some corrects from my original. Would this be a more generalized approach to your solution: /**
* Pushes one or more chunks to a stream stream while handling backpressure.
*
* @example
* [1, 2, 3, 4, 5, 6, 7, 8, 9 , 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
* pushWithBackpressure(stream, chunks, () => ...)
* pushWithBackpressure(stream, ['hi', 'hello'], 'utf8', () => ...)
* @param {Duplex} stream The Duplex stream the chunks will be pushed to
* @param {any} chunks The chunk or array of chunks to push to a stream stream
* @param {string|Function} [encoding] The encoding of each string chunk or callback function
* @param {Function} [callback] Callback function called after all chunks have been pushed to the stream
* @retrn {Duplex}
*/
const pushWithBackpressure = (stream, chunks, encoding, callback, { $index = 0 } = {}) => {
if (!(stream instanceof Duplex)) {
throw new TypeError('Argument "stream" must be an instance of Duplex')
}
chunks = [].concat(chunks).filter(x => x !== undefined)
if (typeof encoding === 'function') {
callback = encoding
encoding = undefined
}
if ($index >= chunks.length) {
if (typeof callback === 'function') {
callback()
}
return stream
} else if (!stream.push(chunks[$index], ...([encoding].filter(Boolean)))) {
console.error('BACKPRESSURE', $index)
return stream.once('drain', () => {
console.error('DRAIN')
pushWithBackpressure(stream, chunks, encoding, callback, { $index: $index + 1 })
})
}
return pushWithBackpressure(stream, chunks, encoding, callback, { $index: $index + 1 })
} |
I've tried several attempts to implement this function in a Transform stream and I've found that the My Transform only works if I iterate over the objects I want to push to the stream then call the callback. I noticed this passage in the Backpressure guidelines: https://nodejs.org/es/docs/guides/backpressuring-in-streams/Could it be that Transform streams automatically handle backpressure when calling push multiple times? |
I didn't really read through all of your comments but your understanding of write() is not accurate. There are two ways to call write(). One is with a callback in which case you should not call write() again until after the callback is called. So for a transform, this means that _transform will not be called repeatedly and therefore no drain events will build up (using my code). The other way to call write is to not use the callback and monitor the return value of write(). When it returns false, you stop writing. For an object mode stream, you would normally set writableHighWaterMark to something low (default is 16 I think) but in practice it might be like 2 or 3 depending on what exactly the transform is doing. So again, no major drain events will build up. You can call push multiple times for one chunk. That's what buffering is for. Streams are a relatively sophisticated feature. It takes some time to "get" it. Just keep going. The real problem with streams is error handling. You can propagate errors up but not down. If an error occurs in a stream that is not at the end of a pipeline of streams, it is borderline impossible to cleanup the downstream streams. The result is resource leaks. That is the Achilles heal of nodejs streams actually (and maybe nodejs in general since HTTP clients are effectively downstream streams and you have no control over what HTTP clients do or not do). |
@squarewav I honestly can't get your solution to work though (or mine for that matter). My Transform stream when calling Agreed, error handling with streams is a crap shoot, however using |
Hate to draw this out. However, I was able to write a contrived test script that doesn't have the However, a problem arises when I read from a CSV text file, parse it in a Transform#_transform and call /**
* Converts a CSV file into a stream of JSON records.
*
* The CSV stream will have its encoding set to "utf-8" when calling this function.
*
* Each JSON record will be keyed with the first CSV record in the file (i.e. the
* column/field names will be taken from the first line of the file).
*
* The returned Readable stream is readable in object mode, where each record as
* a JSON object is available on the stream.
*
* @example
* const r = FS.createReadableStream('data.csv')
* csvToJsonStream(r).on('data', record => console.log(record))
* @param {import('stream').Readable} csvStream The CSV string stream to convert into a JSON stream
* @param {{ columns?: string[] }} [options]
* @return {import('stream').Readable} The readable stream of JSON records
*/
const csvToJsonStream = (csvStream, { columns = null } = {}) => {
if (!(csvStream instanceof Readable)) {
throw new TypeError('CSV stream must be an instance of Readable')
}
return pipeline(
csvStream,
new CsvRowStream(),
new JsonStream(columns),
error => {
if (error) {
console.error('CsvExtractor error :: ' + error)
}
}
)
}
// Convert a CSV text stream into a JSON array of strings stream in object mode.
// The CSV headrs are read from the first line in the text stream.
class CsvRowStream extends Transform {
constructor () {
super({ readableObjectMode: true, defaultEncoding: 'utf8' })
this._buffer = ''
this._columnCount = 0
}
_transform (chunk, encoding, done) {
try {
this._buffer += chunk
const { records, lastIndex } = parseCsv(this._buffer, { colCount: this._columnCount })
if (lastIndex && this._buffer[lastIndex - 1] === '\n') {
this._buffer = this._buffer.slice(lastIndex)
}
if (!this._columnCount) {
this._columnCount = (records[0] || []).length
}
// Drain is never emitted! So this doesn't work.
pushWithBackpressure(this, records, done)
// for (const r of records) {
// this.push(r)
// }
// done()
} catch (error) {
done(error)
}
}
}
// Convert a JSON string array object mode stream into a JSON object object mode stream.
// Object field names are read from the first record/entry in the stream.
class JsonStream extends Transform {
constructor (columns) {
super({ objectMode: true })
this._columns = columns ? columns.slice() : columns
}
get columns () {
return this._columns ? this._columns.slice() : null
}
set columns (value) {
if (!Array.isArray(value) || value.some(c => c && typeof c !== 'string')) {
throw new TypeError('Columns must be an array of non-empty strings')
}
this._columns = value.slice()
}
_transform (fields, _, done) {
if (!Array.isArray(fields)) {
done(null, fields)
} else if (this._columns && this._columns.length) {
try {
if (fields.length !== this._columns.length) {
throw Object.assign(
new Error(`Column count mismatch. Expected ${this._columns.length} fields and only got ${fields.length}`),
{ name: 'ColumnMismatchError', fields: fields.slice(), columns: this._columns.slice() }
)
}
const entity = fields.reduce((obj, value, col) => {
return Object.assign(obj, { [this._columns[col]]: value })
}, {})
done(null, entity)
} catch (error) {
done(error)
}
} else if (!this._columns) {
try {
this.columns = fields
done(null)
} catch (error) {
done(error)
}
} else {
done(null, fields)
}
}
} |
@squarewav I've narrowed my problem down to completely different transform. So to conclude my blunder here; your solution and my generalized solution both work as advertised. My problem was due to some other area of my pipeline. I will report back when I discover my issue. |
@squarewav My problem came from me creating the source stream, then running some async code (awaiting) before I piped the source to the transforms and destination stream. My guess is that the source stream was filling the writeable side of the stream too quickly. By moving the async work before the source stream is opened, this solved my problem of no drain events being fired in my Transform instances. However, I have a few questions if you don't mind helping me with:
Example:
|
@squarewav Ok. After changing the order my async operations I thought I was out of the dark, but the fundamental problem still remains; Transfrom classes DO NOT emit the I have found that in order to listen to the This was the point I was trying to make above, although I got a bit wordy about it; there is inadequate documentation on how to properly handle backpressure when implementing a Transform class that needs to push several chunks on the read queue for every call to I get the whole backpressure dance when calling |
Here's an updated version of const { Duplex } = require('stream')
/**
* Pushes one or more chunks to a stream stream while handling backpressure.
*
* @example
* [1, 2, 3, 4, 5, 6, 7, 8, 9 , 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
* pushWithBackpressure(stream, chunks, () => ...)
* pushWithBackpressure(stream, ['hi', 'hello'], 'utf8', () => ...)
* @param {Duplex} stream The Duplex stream the chunks will be pushed to
* @param {any} chunks The chunk or array of chunks to push to a stream stream
* @param {string|Function} [encoding] The encoding of each string chunk or callback function
* @param {Function} [callback] Callback function called after all chunks have been pushed to the stream
* @retrn {Duplex}
*/
const pushWithBackpressure = (stream, chunks, encoding, callback = null, $index = 0) => {
if (!(stream instanceof Duplex)) {
throw new TypeError('Argument "stream" must be an instance of Duplex')
}
chunks = [].concat(chunks).filter(x => x !== undefined)
if (typeof encoding === 'function') {
callback = encoding
encoding = undefined
}
if ($index >= chunks.length) {
if (typeof callback === 'function') {
callback()
}
return stream
} else if (!stream.push(chunks[$index], ...([encoding].filter(Boolean)))) {
const pipedStreams = [].concat(
(stream._readableState || {}).pipes || stream
).filter(Boolean)
let listenerCalled = false
const drainListener = () => {
if (listenerCalled) {
return
}
listenerCalled = true
for (const stream of pipedStreams) {
stream.removeListener('drain', drainListener)
}
pushWithBackpressure(stream, chunks, encoding, callback, $index + 1)
}
for (const stream of pipedStreams) {
stream.once('drain', drainListener)
}
return stream
}
return pushWithBackpressure(stream, chunks, encoding, callback, $index + 1)
}
exports.pushWithBackpressure = pushWithBackpressure |
@squarewav I couldn't have done this without your help. Thanks, just wish there was a better way to go about this. |
OMG, I'm having exactly same issue and went through the same painful process. Really wish someone could update the documentation on how to properly handle backpressure, or add mechanism of handling it easily. |
It work's for me:
|
hello, i have carefully debugged the source code about readable 、writable code (transform stream rely on them). so don't need to determine the returned value of you only should use the transform stream like following code:
it's very important to call this.push and callback synchronously. why is that? you can understand simply that this.push only push previous data into the readable stream's buffer, and callback() will use these data immediately without passing writeable stream's buffer. so it's very fit to pipe method to control the back pressure. i can Implement a simplified pipe method in node.js:
i wrote a chinese article using exploring a similar principle,so forgive me for not being able to provide articles in English |
@vlopp is your issue resolved? |
It seems there has been no activity on this issue for a while, and it is being closed in 30 days. If you believe this issue should remain open, please leave a comment. |
It seems there has been no activity on this issue for a while, and it is being closed. If you believe this issue should remain open, please leave a comment. |
Do Node's Transform Streams have to watch the
push
's return value and wait for thedrain
event as well? Since under they mask they're just two read and write streams connected together, I'd assume so, nevertheless all online implementations seem to just push at will.The text was updated successfully, but these errors were encountered: