streams #30

davidmarkclements · 2014-05-22T20:27:11Z

Does the whole buffer have to load into run time memory or is it possible with the underlying C library to implementing a true streaming api, so that a file can be streamed from disk, through sharp stream to network socket or file read stream or whatever else.

ref: http://nodejs.org/api/stream.html

Neither graphicks magik, nor image magik wrappers are capable of streams - theres one module (gm I think) that has a stream api, but its faked somewhat, the module still has to load the full image into memory regardless. So a true streaming interface for sharp would be a massive usp

lovell · 2014-05-22T20:49:11Z

Hi David, when the input image is on the filesystem, libvips only reads data as it is needed further down the processing pipeline. It's not "streaming" in the Node.js sense of the word but it does minimise memory usage.

True streaming of compressed input data from a network socket would probably require modification to libvips. Summoning the vastly superior knowledge of @jcupitt, who may be able to offer help here.

jonathanong · 2014-05-22T20:51:10Z

btw you'd want to change the .write() command to something else for future stream support. maybe toFile() or writeTo()

jcupitt · 2014-05-23T03:13:39Z

Hi @davidmarkclements, as @lovell says, this is currently sort-of supported. When you process a large jpeg image (for example) libvips never decompresses the whole thing all at once, it just decompresses and then recompresses the area being processed as it works down the image.

If you want to work over unix pipes or network sockets, you'd need to add a new pair of load and save operations, something like vips_foreign_load_jpeg_fd( int fd, VipsImage **out, ... ) and vips_foreign_save_jpeg_fd( VipsImage *in, int fd, ... ). The saver would be trivial, you can just reuse the file saver, but the loader would be a bit more work, you'd need to make a new libjpeg input method.

You'd then need to expose that in sharp somehow, @lovell would be the expert there.

jcupitt · 2014-05-23T03:22:56Z

If you're curious, the load-jpeg-from-buffer class is defined here:

https://github.com/jcupitt/libvips/blob/master/libvips/foreign/jpegload.c#L249

And it calls this set of functions to do the work:

https://github.com/jcupitt/libvips/blob/master/libvips/foreign/jpeg2vips.c#L1042

So you'd probably need to add under 500 lines of C.

sebpiq · 2014-06-08T18:25:29Z

👍 for streams!

davidmarkclements · 2014-06-09T03:18:50Z

sorry for delay on this guys, look I can happily get involved in the JS side of things, in terms of creating the stream interfaces and hooking them up to the bindings (although I'd have to research that, haven't worked with bindings before). But as for actually editing the C code, it's not my area and I wouldn't want to screw it up - if someone else who's strong in C wants to team up let me know and we'll get this done - I think it would be a huge win to have true stream image processing available in JS. I really don't know of anything else right now that offers that.

jcupitt · 2014-06-09T07:31:14Z

OK, I'll try doing jpeg as an experiment. I've not done much with sockets, I guess I just read() and don't seek().

Would you always be reading from streams with a known type? Sniffing jpeg/png/etc. from a stream would be fiddly. I guess you'd read a mine header (something like that?) first to get the file type, or perhaps you'd be streaming from a source you controlled.

jcupitt · 2014-06-09T08:42:42Z

I've made an experimental branch with a quick hack to support load-jpeg-from-stream:

https://github.com/jcupitt/libvips/tree/load-from-stream

For example, you can do this:

$ cat k2.jpg | vips jpegload_fd 0 x.v

Call from C like this:

/**
 * vips_jpegload_fd:
 * @fd: descriptor to load from
 * @out: image to write
 * @...: %NULL-terminated list of optional named arguments
 *
 * Optional arguments:
 *
 * @shrink: shrink by this much on load
 * @fail: fail on warnings
 *
 * Read a JPEG-formatted stream from @fd into a VIPS image. Exactly as
 * vips_jpegload(), but read from a descriptor. This operation does not seek()
 * and therefore can read from a network socket. 
 *
 * See also: vips_jpegload().
 *
 * Returns: 0 on success, -1 on error.
 */
int
vips_jpegload_fd( int descriptor, VipsImage **out, ... )

We should try hooking this up to sharp somehow and see how it behaves.

jcupitt · 2014-06-09T10:41:24Z

I thought of a problem with this: it can only load a single image from a socket. If you want to send a series of images, you'll have to close and reopen the connection. I suppose this might be OK in some situations, but probably not over a slow network connection.

The reason is buffering. As you can see, it reads as many bytes as it can each time, up to the buffer capacity:

https://github.com/jcupitt/libvips/blob/load-from-stream/libvips/foreign/jpeg2vips.c#L1349

So it can read the end of one image and the first part of the next. When image1 is complete, we will have unused bytes buffered which we would have to be saved somehow to be used for the next image.

It's unclear how this could be done neatly. vips does not support image streams (reading many images from one source) and adding it would be hard, so that's probably out. We'd probably need to add some sort of intermediate buffer object which supported "unread", ie. allowed unused bytes to be pushed back into it.

How does node tackle this sort of issue? Do they have a socket object that supports pushback?

lovell · 2014-06-09T19:26:06Z

Thanks @jcupitt, Node.js' net.Socket object is our friend here.

I'll try to make some time over the next few days to experiment with the new vips_jpegload_fd API and work out the most elegant way for sharp to expose it.

davidmarkclements · 2014-06-09T22:53:00Z

Using a socket would make the streams API super easier, essentially you just expose the socket directly or add methods that act as proxies to the sockets methods. But is that really the right thing to do? The normative way as I understanding is to use C bindings http://nodejs.org/api/addons.html

And then wrap a stream interface around the exposed parts of the underlying C lib. Using a socket will add unnecessary overhead IMO

davidmarkclements · 2014-06-09T22:58:19Z

Sorry on mobile didn't see @jcupitts comment what I said May or may not be relevant. Btw pushback can be done when creating streams using the Stream module but don't know if that fits what we're talking about

lovell · 2014-06-10T11:03:30Z

@jcupitt If libvips provided an API that accepted partial chunks of image data then that would fit with @davidmarkclements suggestion about managing the Streams in JavaScript.

As soon as the chunk(s) that comprise the image header data have arrived then processing can begin, but a thread(s) may have to block waiting for further chunks of image data.

This is essentially the approach taken by the Node.js wrapper around LAME, which takes advantage of the underlying library's chunk-aware methods such as lame_encode_buffer.

Sockets and Streams both involve buffering data somewhere; something will block on a lack of data :)

Shall we trial the Sockets method first as that currently feels like the quickest way to a proof-of-concept?

jcupitt · 2014-06-10T11:49:18Z

I was thinking about this again.

It would work if you're OK only being able to read a single image off a socket. That might be useful in some situations.

The problem is that vips (or rather the libjpeg read system) HAS to have a buffer, and therefore you can't send more then one image down a socket, since the end of the first image will eat the first part of the second.

We need to have a reusable intermediate read object to handle the buffering, and we need to be able to hand the libjpeg read system pointers into it. Therefore I think the buffer has to be a new vips object. In C, the API might look like:

int descriptor;

VipsInputStream *stream = vips_input_stream_new( descriptor );

for( i = 0; i < 10; i++ ) {
  VipsImage *x;

  if( vips_jpegload_stream( stream, &x, NULL ) )
    return( -1 );
  ... do something with image x
  g_object_unref( x );
}

g_object_unref( stream );

So I think we just need to get a file descriptor from the node.js stream.

lovell · 2014-06-18T20:02:10Z

After playing with Node.js Streams for the last couple of evenings I now believe the best approach is to avoid file descriptors and instead have sharp extend the stream.Transform class, buffering data in JavaScript-land first. This should then provide support for multiple images passing through the same pipe.

I've got some sample code working locally that needs tuning, documenting and benchmarking, but the early signs are that we should be able to support JPEG, PNG and WebP (but not TIFF) as both input and output streams without any modification to libvips. Hooray!

davidmarkclements · 2014-06-18T23:03:46Z

YES! That's exactly the right approach

On 19 Jun 2014, at 06:02, Lovell Fuller [email protected] wrote:

After playing with Node.js Streams for the last couple of evenings I now believe the best approach is to avoid file descriptors and instead have sharp extend the stream.Transform class, buffering data in JavaScript-land first. This should then provide support for multiple images passing through the same pipe.

I've got some sample code working locally that needs tuning, documenting and benchmarking, but the early signs are that we should be able to support JPEG, PNG and WebP (but not TIFF) as both input and output streams without any modification to libvips. Hooray!

—
Reply to this email directly or view it on GitHub.

davidmarkclements · 2014-06-18T23:21:19Z

In terms of buffering the data in JS land, it would be great to ensure the transform stream honours the highWaterMark option (ie the capacity of the buffer) - that way a dev can control the trade off between speed and memory usage.

That's one of the problems with graphicks/image magick wrappers - lack of control over memory usage leads to crashing on under resourced servers. Whereas a stream with a configurable highWaterMark allows the dev to operate and lower resources with slower delivery time. And vice versa, if they've plenty to spare they can up the watermark number (as long as the CPU can keep up).

If there's anyway I can help let me know (peer review?)

jcupitt · 2014-06-19T08:01:38Z

This sounds great, but I don't really understand how it can work :-( but please prove me wrong, of course.

libjpeg's read system just has a thing for "I need more data", it does not tell you how much more data it would like. So you always read 4096 bytes (or whatever) and hand over that buffer.

After it has finished reading an image, there will be some unused bytes left in libjpeg's input buffer. These unused bytes will contain the beginning of the next image (perhaps part of a mime header?). They can't be thrown away, they somehow have to be reused. They need to be extracted from libjpeg and given to the input system of whatever image reader you need next.

So ... I still think we'll need a VipsInputStream object or equivalent to be able to move input buffers between image readers.

jcupitt · 2014-06-19T08:06:08Z

libpng's input system works in the same way, it asks for more data but won't tell you how much.

libtiff needs to be able to seek on the input, so it can't read from a stream, unfortunately.

webp, again, just asks for more data.

lovell · 2014-06-19T09:48:44Z

In summary, we need to buffer image data somewhere, be that within the underlying decompression libraries themselves, in libvips or in sharp.

The memory usage problems of *magick are usually due to them holding the entire uncompressed image in memory at once. If we buffer compressed image data only, that's a big improvement over existing solutions.

What I'm suggesting is that, to start with, we have sharp manage Buffer concatenation of compressed image data as chunks arrive (I may move memory management to the C++ part of sharp if Node.js Buffer performance is poor). Then, when libvips provides support for John's suggested VipsInputStream, we can switch to take advantage of that.

The external API of sharp will remain the same and can slowly get a bit more efficient over time, by pushing the responsibility for buffering upstream.

Here's a usage example of how elegant the Stream-based API of sharp should hopefully be, in this case resizing to 100px wide whilst maintaining the same format.

var transformer = sharp().resize(100);
inputStream.pipe(transformer).pipe(outputStream);

This unlocks the ability to create memory-efficient command line tools that pipe/transform image data from stdin to stdout.

jcupitt · 2014-06-19T10:00:17Z

Ah, I see. Sure, sounds great. This is using vips_image_new_from_buffer(), I guess?

https://github.com/jcupitt/libvips/blob/master/libvips/iofuncs/image.c#L1916

I'll add a quick hack of the input stream stuff. You should be able to subclass VipsStream and hook it up to the node.js stream system, hopefully.

edit if you've not come across it, there's a write to buffer operation too you could maybe use for the output side:

https://github.com/jcupitt/libvips/blob/master/libvips/iofuncs/image.c#L2205

davidmarkclements · 2014-06-19T10:02:56Z

@lovell I love how you've totally groked the awesomeness.

So. Something else we need to consider is handling backpressure. Since it's a transform stream it's going to need to do two things in order to be a stream that plays nice within a stream chain (and indeed within the larger OS environment).

if incoming data is arriving too fast (e.g. say from.. really really fast storage) the write method should return false to apply back pressure to input stream
if the read part of the transform stream receives false from the write stream its piping into (e.g. if the destination streams write method returns false) then it needs to pause until the destination stream is ready - you tell if a write stream is ready again by listening for a drain event

jcupitt · 2014-06-19T10:11:18Z

@davidmarkclements I think your 1 and 2 should happen automatically. They do on Linux at least.

VipsInput only does a read()when it needs more data, so the OS will apply backpressure for us when the amount of stuff queued up hits the high water mark.
Again, if the write side is producing data too quickly, the OS will suspend it to prevent overbuffering.

davidmarkclements · 2014-06-19T10:23:18Z

@jcupitt But if the stream is being piped to another stream (say the other stream was slow http connection for instance), the other stream may need to apply back pressure to sharp stream,
at a JavaScript level this should ideally be catered to otherwise streams its piping to will be overwhelmed and data will begin to queue up in memory which is of course contrary to the point of streams.

As for the writing side.. we're talking about a source node stream writing to the sharp node stream,
e.g. extending @lovells example a little

var fs = require('fs');
var http = require('http');
var inputStream = fs.createReadStream('/some/image.jpg');
var transformer = sharp().resize(100);

http.createServer(function (req, res) {
  var outputStream = res;
  inputStream.pipe(transformer).pipe(outputStream);
});

So say we the read stream from disk is coming in too fast,
at a JS level we need to apply back pressure to let the
inputStream know it needs to pause and wait for a drain event.

I don't see how the OS can suspend the stream in this case,
because the sharp library isn't reading the file here, nor is
really operating at an OS level - the OS doesn't have control
or knowledge of the streams in Node. That has to be managed
at a JS level.

Further the input stream doesn't even need to be a file, it could be this scenario for instance:

var fs = require('fs');
var http = require('http');
var inputStream = http.get('http://some.image.com/image.png');
var transformer = sharp().resize(100);

http.createServer(function (req, res) {
  var outputStream = res;
  inputStream.pipe(transformer).pipe(outputStream);
});

Each stream implementation in Node has to manage and handle backpressure,
if they expect the OS to they'll quickly run into memory issues

lovell · 2014-06-19T10:37:28Z

Given initially we'll be placing input data into a memory buffer, I think we can rely on the OS for read-side back-pressure.

For write-side, sharp will have to manage any back-pressure applied to it by the OS. I've got a couple of ideas here, mostly involving reducing the output buffer data into exponentially smaller chunks.

The rather handy brake module can use to emulate back-pressure to test this.

jcupitt · 2014-06-19T10:39:15Z

Ah, I thought these were network sockets and local pipes.

You're right if you want it to work for JS software streams you'd need a bit more logic. This could go in the VipsStream subclass I guess (as you maybe said).

lovell · 2014-07-31T11:45:23Z

I think I've worked out how to correctly inherit from stream.Duplex with some minor API breakage.

The jpeg, png and webp methods need to become simple property setters and therefore can no longer trigger the processing pipeline without an explicit call to toFile, toBuffer or pipe.

Currently writing lots of tests.

davidmarkclements · 2014-07-31T14:01:03Z

Hey @lovell sounds awesome, just to confirm, in case there's anything else that needs hooking up, the .on('data', fn) and/or stream.Duplex.prototype.read would also trigger the processing pipeline right?

lovell · 2014-07-31T14:39:32Z

Yes, the pipeline will be triggered lazily only when _read is called internally to notify us that the consumer is ready. Sorry for any confusion.

lovell · 2014-07-31T20:47:37Z

Commit 6c9f3de in the streams2 branch implements the native stream.Duplex object.

The tests from line 272 onwards provide good examples of the various methods of input and output.

Comments welcome!

lovell · 2014-08-07T13:37:20Z

I've updated the streams2 branch to not (yet) break the backwards compatibility of accepting a callback with the png, jpeg and webp methods. They will still, for now, trigger the pipeline but also log a deprecation warning that this feature will be removed in the next major version.

Any final thoughts from @sebpiq, @jonathanong and @davidmarkclements before this hits master?

lovell · 2014-08-18T19:43:43Z

The streams2 branch is now merged to master and will be included in the forthcoming v0.6.0 release.

Thanks to everyone involved for all the help getting this far.

davidmarkclements · 2014-08-18T19:51:24Z

hey many that's awesome - sorry for lack of final thoughts

lovell · 2014-08-21T10:33:52Z

Closing this feature request - thank you everyone for all the suggestions and feedback - I look forward to hearing all about your use of Streams.

jcupitt · 2019-10-12T17:11:48Z

Hello, I've spent a few days implementing experimental support for true streaming:

libvips/libvips#1443

It seems to work. Any testing or feedback very welcome.

lovell added the enhancement label May 22, 2014

lovell added a commit that referenced this issue May 29, 2014

Replace write() with toFile() to allow streams in the future #30

906311d

lovell added the work-in-progress label Jun 18, 2014

lovell added this to the v0.6.0 milestone Jun 18, 2014

lovell removed the work-in-progress label Aug 18, 2014

lenliu mentioned this issue Aug 19, 2014

Lack of pipe method? #82

Closed

lovell closed this as completed Aug 21, 2014

jcupitt mentioned this issue Aug 22, 2014

2 Command line questions libvips/libvips#159

Closed

lovell mentioned this issue Dec 14, 2014

NodeJS crashes when processing 50 images simultaneously #138

Closed

lovell mentioned this issue Mar 6, 2015

Emit multiple data chunks from output Streams (was: Proper streaming) #179

Open

jcupitt mentioned this issue Jul 21, 2015

Use the same base sharp instance for multiple transformation chains #235

Closed

jcupitt mentioned this issue Apr 19, 2016

resize just one section of an image? libvips/libvips#425

Closed

jcupitt mentioned this issue May 6, 2016

Vips CLI specify file format with pipes libvips/libvips#433

Closed

jcupitt mentioned this issue Nov 8, 2017

Rails example libvips/ruby-vips#96

Closed

hgl mentioned this issue Nov 13, 2017

Memory usage h2non/bimg#209

Closed

jcupitt mentioned this issue Nov 15, 2017

Add save to descriptor libvips/libvips#803

Closed

LeslieGerman mentioned this issue Feb 23, 2018

Q: How to access get_disc_threshold() and other global functions? libvips/lua-vips#17

Open

jcupitt mentioned this issue Mar 30, 2018

Too many open() calls for same file libvips/libvips#921

Closed

jcupitt mentioned this issue May 18, 2018

support for standard streams libvips/libvips#979

Closed

jcupitt mentioned this issue May 31, 2018

Load images directly from S3 bucket or using a resource libvips/php-vips#70

Closed

jcupitt mentioned this issue Dec 18, 2018

Stream and metadata #1508

Closed

jcupitt mentioned this issue Mar 9, 2019

Cannot create image from buffered IO stream libvips/pyvips#85

Open

jcupitt mentioned this issue Apr 22, 2019

early feedback: loading and storing via FILE libvips/libvips#1256

Closed

kleisauke mentioned this issue May 19, 2019

Add support for true streaming kleisauke/net-vips#33

Closed

woo94 mentioned this issue Jan 2, 2023

Resizing with stream API takes too much memory #3512

Closed

streams #30

streams #30

Comments

davidmarkclements commented May 22, 2014

lovell commented May 22, 2014

jonathanong commented May 22, 2014

jcupitt commented May 23, 2014

jcupitt commented May 23, 2014

sebpiq commented Jun 8, 2014

davidmarkclements commented Jun 9, 2014

jcupitt commented Jun 9, 2014

jcupitt commented Jun 9, 2014

jcupitt commented Jun 9, 2014

lovell commented Jun 9, 2014

davidmarkclements commented Jun 9, 2014

davidmarkclements commented Jun 9, 2014

lovell commented Jun 10, 2014

jcupitt commented Jun 10, 2014

lovell commented Jun 18, 2014

davidmarkclements commented Jun 18, 2014

davidmarkclements commented Jun 18, 2014

jcupitt commented Jun 19, 2014

jcupitt commented Jun 19, 2014

lovell commented Jun 19, 2014

jcupitt commented Jun 19, 2014

davidmarkclements commented Jun 19, 2014

jcupitt commented Jun 19, 2014

davidmarkclements commented Jun 19, 2014

lovell commented Jun 19, 2014

jcupitt commented Jun 19, 2014

lovell commented Jul 31, 2014

davidmarkclements commented Jul 31, 2014

lovell commented Jul 31, 2014

lovell commented Jul 31, 2014

lovell commented Aug 7, 2014

lovell commented Aug 18, 2014

davidmarkclements commented Aug 18, 2014

lovell commented Aug 21, 2014

jcupitt commented Oct 12, 2019