Skip to content
This repository was archived by the owner on Aug 28, 2019. It is now read-only.

Observable is not appropriate for directory listing; async iterable would be better #4

Open
domenic opened this issue May 29, 2015 · 14 comments

Comments

@domenic
Copy link

domenic commented May 29, 2015

https://github.com/zenparsing/async-iteration/ is where async iterable is proposed.

The difference is that with an observable the creator of the observable defines the pace at which "events" are produced (push), whereas with an async iterable, the consumer defines the pace at which work is performed. The latter is a much better match for filesystem APIs, e.g. Windows's FindFirstFileEx/FindNextFile/CloseFind, or POSIX opendir/readdir. In these cases the async iterable's .next() would map pretty directly to FindFirstFileEx/FindNextFile and readdir.

/cc @zenparsing

@arunranga
Copy link
Contributor

Correct me if I'm totally wrong on this, but it honestly does seem like both can be used for the primary use cases:

  1. Enumerate ONLY files, deep through the containment hierarchy, "discarding" directory results (e.g. enumerateDeep()).
  2. Do something useful only upon encountering certain types of results (e.g., only *.jpeg or *.txt; only binary data, only texture data etc.) regardless of containment hierarchy, but determine all results volume.

"Observable" lent itself to filtering and mapping. I think the pace at which work is performed is dependent on a few things. Then there's backpressure too, which I'm not 100% clear on :-(

But async iterables look totally usable.

@domenic
Copy link
Author

domenic commented May 29, 2015

Correct me if I'm totally wrong on this, but it honestly does seem like both can be used for the primary use cases:

Both can be used, for sure. However, async iterables map more directly (in the sense of giving more control) than observables do. The async iterable to observable transformation is lossy, in the sense that to create an observable from an async iterable you have to pump the async iterable in a loop in order to "emit events" (i.e., call next() on the passed generator object), instead of waiting for the consumer to pull at the pace it desires.

"Observable" lent itself to filtering and mapping.

As proposed to TC39, neither async iterable nor observable has filtering or mapping. But it is equally trivial to add filtering and mapping to either. (I kind of doubt either will happen before we add filtering and mapping to sync iterables though, like maps and sets.)

I think the pace at which work is performed is dependent on a few things.

Unfortunately no. An observable is just a wrapper for a function f({ next, return, throw }), so that when you call observable.subscribe(g), it calls f(g). Thus it's totally up to the implementation of f when next() is called. An observable for directory listings would thus call readdir in a loop, continuously calling next() when the readdir call finishes.

Then there's backpressure too, which I'm not 100% clear on :-(

An async iterable, on the other hand, consists of an object with a promise-returning next() method. The pace at which work is performed on an async iterable can thus be determined by how fast the next() method is called---for example, if it is never called, then there is no need to call the next readdir. This kind of matching of consumer speed to producer speed, in combination with how it propagates through transforms (like map and filter), is backpressure.

My main concern in opening this is that observable is not misused in places where it is a bad fit. It makes sense for events like "click" where the producer drives the production of events ("push"), but it doesn't make sense for things like directory listings where the consumer drives things ("pull"). I know it is the most-evangelized asynchronous-plural type among the trio of observable, async iterable, and behavior, but I wanted to make sure the right tool is used for the job here instead of the most-evangelized one :)

@jhusain
Copy link

jhusain commented May 30, 2015

Evangelizer-in-chief here. ;)

It's true that an asynchronous iterator gives the consumer control (back pressure). However it's not clear to me what the motivating use cases are for that control. It is important to spell that out because back pressure support will come at hefty performance cost given the fine granularity of the proposed API.

Let's say you did a deep enumeration and saw 1000 files (not unrealistic). You would get a minimum of 1000 promise allocations. A single map operation would double the number of promise allocations to 2000. Assuming you were calling next() again before the previous Promise resolved then you would be growing the job queue as well as notifications were queued up.

Comparably an Observable would involve only three allocations: minimally the observer, subscription, and Observable. A map operation would introduce another three allocations + 1 extra allocation for the map closure argument. In other words the allocations would not grow with the number of files being processed.

Under the circumstances it seems like it's worth enumerating the motivating cases for back pressure here. If there are compelling use cases, an asynchronous iterator might be made more efficient if we made the granularity of the API more coarse. (AsyncIterator of Promise of Array of File)

@domenic
Copy link
Author

domenic commented May 30, 2015

We did a pretty hefty real-world performance comparison for this sort of stuff in streams and found that the cost of promise allocations was absolutely trivial (nanoseconds, in aggregate). The actual I/O is where all the work happens.

The motivation for backpressure here is quite simply to avoid the extra I/O---which is where the real cost lies. Your program will be much slower if it is constantly being forced to perform readdirs at inconvenient times, than if it is performing readdirs (and, incidentally, allocating promises) on-demand.

A simple example would be that with an async iterator you could notice that you've used up 4 ms of your 16 ms frame budget on readdirs, and decide to stop asking for results until next frame. (Well, the rendering thread actually gets hit by the IPC cost of getting it from the I/O process into the render process, but you know what I mean.) Whereas with an observable, this isn't under your control.

@domenic
Copy link
Author

domenic commented May 30, 2015

In general, making performance arguments without data has led down very dark paths. Talking in the abstract about allocations is quite disingenuous given generational GCs and the like. Calling a function versus allocating a short-lived object and calling its method are in the abstract very different amounts of work, but in the real world very similar ones. So in particular claims like

If there are compelling use cases, an asynchronous iterator might be made more efficient if we made the granularity of the API more coarse. (AsyncIterator of Promise of Array of File)

are much better phrased as "I would be interested in benchmarking file-wise async iterator versus array-of-files async iterator." (Although that particular idea doesn't really make sense in terms of how it fails to map to readdir etc.; the added indirection loses the very control and reactivity we seek to gain.)

@jhusain
Copy link

jhusain commented May 30, 2015

Inclined to agree I/O will be the likely bottleneck here. Also agree that we should rely on performance data. Probably worth measuring perf in our use case rather than extrapolating from data collected in different use cases. Should be easy to confirm that all those Promise allocations are indeed chewed up efficiently by V8 during a directory walk. Otherwise GCs could detract from framerate as well. Do you think a directory walk in Node with AsyncIterator vs. Observable would be a reasonable test?

@domenic
Copy link
Author

domenic commented Jun 1, 2015

Do you think a directory walk in Node with AsyncIterator vs. Observable would be a reasonable test?

Unfortunately Node doesn't expose the necessary primitives here---it only does array-returning directory traversal, and doesn't give you access to readdir and friends. So testing this would require a native addon.

I will open a libuv issue to see if we can get those in libuv at least.

@arunranga
Copy link
Contributor

I will open a libuv issue to see if we can get those in libuv at least.

D: if these issue(s) are set up, I'd love a pointer to them. I've trawled libuv/libuv and come up empty.

We're going to have to have an interim API in the places where Observable is used, till platform considerations stabilize. I think I'm coming around to the idea that AsyncIterator might be more useful here. It might be confusing if the platform had both, since the specific messaging would be drowned out by "this makes asynchronous programming better and easier."

In particular, here's the MSFT proposal for promises returning sequence parametrized types for Directory iteration:
https://microsoftedge.github.io/directory-upload/proposal.html#faq

I'm going to rewrite the API a bit in Directory (and I'll file a separate issue for that use this issue as the driver) so that we can have an interim API. Doing that closes this issue for now.

@domenic
Copy link
Author

domenic commented Jul 16, 2015

@arunranga
Copy link
Contributor

THANK you.

@jhusain
Copy link

jhusain commented Jul 31, 2015

Pinging @wycats

@wycats
Copy link

wycats commented Jul 31, 2015

I just want to make sure I understand the constraints here:

  1. You want the general benefits of a composable, first-class object representing the directory list. (all of the proposed options satisfy this requirement, as far as I can tell)
  2. You want the producer to stop producing values if the consumer isn't ready for more. A consumer can use this infrastructure to take values one at a time, or to cache the values until it reaches a maximum number of entries it's willing to cache.
  3. You want this control to be composable across the usual combinators. The consumer of directories.map(cb) should be able to influence the original producer.
  4. The producer has some way of pausing production that is more efficient than a userland buffer. This may involve allowing the kernel to buffer the data, asking an upstream producer to stop producing data, stopping to make pull requests of an upstream producer, or a userland strategy that aggregates the buffers of multiple lists.

Did I miss anything?

@arunranga
Copy link
Contributor

@wycats: in comparing my wishlist with your list above, I'd say 1, 2, and 4 are the main ones. I asked for 3. as a bonus, and perhaps made the mistake of assuming that composition came out of the box (Domenic corrects me above). It's a nice to have.

Till we get something better, we're using a promise that returns a one-and-done sequence parametrized type for file/directory results.

@domenic
Copy link
Author

domenic commented Jul 31, 2015

@wycats I don't really find reasoning from constraints that compelling in this case. Rather, I'd prefer a primitive that maps directly to the underlying OS operations, thus allowing maximum flexibility. You could even imagine an awkward API that does a direct translation (using some kind of "directory handle" JS object). That IMO would be better than any proposal that tries to fire events at you continuously. It's a nice fact that async iterator maps to the OS operations directly.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants