Fix performance regressions and float parsing #74

davidanthoff · 2018-10-01T01:58:59Z

Fixes #73. Well, hopefully, at some point.

This is still very WIP. For now it just tries to properly add branches after calls to iterate and a number of other tweaks. That fixes the most egregious performance issues, but we are still quite far from what we have on julia 0.6 for this package.

I've added a whole bunch of micro benchmarks, and will add more as I try to fix more issues.

This shows how performance degraded from the julia 0.6 version to the current version on master. This shows how performance changes from julia 0.6 version to this PR here. The short story is that we go from regression on the order of a factor of a couple hundred to some improvements and something below a factor 10 regression. This is only true for these micro benchmarks, the story for an overall readcsv is still not good.

I think I essentially know what the next step is: the current model of returning nothing, i or Some(x), i from tryparsenext seems to be really slow. I went back to Nullable in one specific case for this and the performance for that part went back to the julia 0.6 level. So unless someone has a better idea, I'm probably just going to go back to Nullable for tryparsenext.

Oh, and one more thing: I'm going to rebase this branch regularly, just be warned :)

davidanthoff · 2018-10-01T05:50:07Z

Alright, getting closer. Still about a factor 5 slower than on julia 0.6, but I still have a bunch of things that I know need fixing.

TODO list for myself:

Figure out whether I need to do something about fromtype
Fix the remaining regressions in the micro benchmarks
Add more micro benchmarks to benchmark suite
Add a macro benchmark

davidanthoff · 2018-10-04T06:12:04Z

Still not done, but this PR now also fixes #30.

davidanthoff · 2018-10-16T05:39:36Z

@sashi I think this is ready to be reviewed and then merged.

High level summary: this makes TextParse fast again (currently it really is unusably slow on julia 1.0) and it fixes the float parsing issue described in #30.

It might be easiest to review this commit by commit.

The first two commits add a micro benchmark suite. Broadly speaking this PR improves almost all of those micro benchmarks to numbers that are better than what we had on julia 0.6. Relative to the currently released version of TextParse the improvements are on the order of factor 100-500 or something crazy like that.

The next commit adds branches that check for nothing after calls to iterate, that in itself helps a lot with performance.

The next commit uses Nullable again. That also improves performance a lot. Obviously it would be great if we could use small unions instead, but for some reason Nullable is much, much faster, so I think for now we should just go back to that and then try to reintroduce a union story in a future release, and use the benchmark suite to make sure it actually is fast.

The next commit fixes the float parsing bug. I used the code here as a starting point, fixed some bugs, adopted it to the structure here etc. I added that to this PR because I didn't want to fix performance problems in a routine that had to be replaced in any case rather sooner than later. @simonbyrne if you could look over that part it would be great. It all seems very solid to me, and it is fast, but you know much more about floating point stuff than I ever will :)

The final commit adds a number of methods that have faster implementations for UTF8 strings, i.e. when the input is of type String. The strategy for those is to not use normal string iteration, but instead go down to the codeunit level. That gives a great performance boost. I think it would be great if we could find a way to get back to one generic implementations that work with any AbstractString that is fast, but for now having one implementation that works on generic abstract strings and a faster one that works on String seems the best option.

I also compared overall performance of this new TextParse.jl with some typical other CSV readers, and it is in great shape again. Essentially R's fread still beats it in most cases, but other than that it is very competitive and faster in almost all situations than the other competing packages, except for very small files. The benchmark I ran was 20 columns of random data for the uniform files, and the mixed files had one column for each data type. Here are the results:

Without missing data
With missing data

shashi · 2018-10-16T16:01:30Z

benchmark/benchmarks.jl

+SUITE["util"]["tryparsenext"]["Percentage"] = @benchmarkable TextParse.tryparsenext(TextParse.Percentage(), $percentagestring,1,$percentagestringlen, TextParse.default_opts)
+SUITE["util"]["tryparsenext"]["StringToken"] = @benchmarkable TextParse.tryparsenext(TextParse.StringToken(String), $somestring,1,$somestringlen, TextParse.default_opts)
+SUITE["util"]["tryparsenext"]["DateTimeToken"] = @benchmarkable TextParse.tryparsenext(tok, $datetimestr,1,$datetimestrlen, opts)
+SUITE["util"]["tryparsenext"]["QuotedStringToken"] = @benchmarkable TextParse.tryparsenext(Quoted(String,quotechar='"', escapechar='"'), somequotedstring)


Awesome!

do you need

$(Quoted(String,quotechar='"', escapechar='"'))

here so as to not time its construction?

shashi · 2018-10-16T16:09:43Z

src/field.jl

    end

    @label done
-    return R(convert(F, sign*(x+f))), i
+    return R(convert(F, f)), i

    @label error
    return R(), i


I probably won't be able to review this meaningfully. @simonbyrne if you can review, that'd be neat! Otherwise, @davidanthoff please merge this PR in a day or two.

quinnj · 2018-10-16T16:11:03Z

could you share how the benchmarks were generated? I'm planning on working some more on CSV.jl performance this week.

shashi · 2018-10-16T16:13:44Z

src/utf8optimizations.jl

+
+    @label error
+    return R(), i
+end


I love this file, there's something nice about code like this. :-)

shashi · 2018-10-16T16:16:34Z

Apart from the one minor comment, I have nothing to add here. I've invited you as a collaborator, please accept it and merge the PR when you see fit.

Thanks for this heroic feat!

davidanthoff · 2018-10-16T16:45:14Z

could you share how the benchmarks were generated? I'm planning on working some more on CSV.jl performance this week.

Sure: https://github.com/davidanthoff/csv-comparison. I think there are still some hard coded paths in there :) It also is a beast to run... I have comparison results from a Mac somewhere as well, it was broadly a similar story. The bars in the graphs always show the minimum runtime, each tick is one sample (I ran everything five times). short in the file name means that it has floating point numbers with only a few digits in the file. ; 0.6 in the package name refers to a run on julia 0.6 (with the latest tagged version of that package).

More test cases, corrections or any other PR for that repo are very welcome! Might be nice to keep it up-to-date as a performance comparison tool.

JeffBezanson · 2018-10-16T19:33:42Z

I'm planning on working some more on CSV.jl performance this week.

Is there any way we can get the best of both worlds? I.e. support DataStreams, DataValue, Union-based missings, and the best-performing code from both packages in one? What's the main blocker for that happening?

davidanthoff · 2018-10-19T00:12:17Z

Is there any way we can get the best of both worlds? I.e. support DataStreams, DataValue, Union-based missings, and the best-performing code from both packages in one? What's the main blocker for that happening?

My main, very strong preference, is that we don't do any big refactorings, rearchitecture or design change things with TextParse.jl. I think it is excellent the way it is. From my (or queryverse's) point of view, stability in the packages that we depend on is really a major benefit/objective, and TextParse.jl was fantastic in that (and lots of other) aspects in the last years. I think of course small, incremental PRs that improve things here and there (like this PR here) would be fantastic, but other than that I would just prefer to not make big changes, in particular because I don't see any significant problems with the current design.

I do plan to add support for returning DataValueVectors again, as an option. I think that can be done as a small, incremental PR.

In terms of integration with Tables.jl/TableTraits.jl/DataStreams.jl, here is my view on that: first, DataStreams.jl is no more, as far as I understand it, I think Tables.jl is really the next version of that. My preference is that a package like TextParse.jl does not provide integration with either TableTraits.jl or Tables.jl, but instead stays a low level parsing package that does not hook into any of these ecosystems. For TableTraits.jl/Queryverse the integration with everything else is in CSVFiles.jl (which has a dependency on TextParse.jl), and that kind of design (low level parsing package with a user facing high level package that integrates with the ecosystem) has really worked extremely well in the last couple of years. I believe @quinnj is not a fan of that design, though :)

So that leaves the question whether CSV.jl and TextParse.jl should be unified in some form. In my mind it is excellent that CSV.jl is trying out new things in this space. The core designs and approaches in TextParse.jl and CSV.jl seem distinct enough to me that I think it makes sense to keep both around and improve them separately and see how things turn out. From my point of view I'm neither convinced enough of CSV.jl's new approach to be willing to give up on the TextParse.jl approach, nor am I convinced that the approach in TextParse.jl is so clearly superior that I wouldn't want to see how the design in CSV.jl fares. I think this is a situation where it would be premature to try to settle on one horse, sometimes pursuing multiple options simultaneously and thereby hedging to some extent is the right move, and I think that is the case here.

Having said that, I think there is ample opportunity to also learn from each other, share certain things etc. Some examples that come to mind: I think it would be great to have a shared repo with "tough" csv files (or code that creates tough CSV files) that can serve as a testbed. Similar with specific algorithms: if the float parsing in this PR here is fine, there might be some ideas that can just be copied directly to Parsers.jl (my sense is that parsing of long floats is really the major thing that slows CSV.jl down right now). I think the performance test repo that I created makes also a good candidate for something that we jointly keep updated.

If folks disagree with what I wrote here and want to discuss that, we should probably move the discussion somewhere else. I think right now I mostly would like to just merge this PR so that things are functional again :)

davidanthoff · 2018-10-19T00:14:42Z

src/field.jl

+        end
+    end
+    return Float64(x)
+end


@simonbyrne would be great if you could indicate whether you will be able to review this function in particular in the near future. Totally understood if not, but in that case I think I would just merge the PR, given that it passes all tests, and passes more parsing tests than the current master version. Even if there is something wrong with this code, I think it would still be better than the status quo. BUT, a review from you would be really great, of course!

And if you have too much time (haha), you could also take a quick look at the function that calls this one here, I think at that point you would have reviewed all that there is to this float parsing.

I had a bit of a look over: although it looks like it will help a bit, it won't be able to completely solve the problem (there are some cases which are beyond double-double).

It will also likely be much slower: ideally you only want to utilise excess precision only when you need to. Also it probably makes sense to handle the powers of 10 via a lookup table (possible two separate ones: one where 10.0^k is exact, and another in higher precision).

although it looks like it will help a bit

Well, the issue previously was that if you just generated a bunch of random numbers, wrote them out to a string (with the default number of digits in julia write methods) and then read them back in, about 10% of those were read incorrectly. In my mind that was a large fraction... This PR seems to mostly solve that, at least for the random number streams I tested, this goes to 0 errors.

it won't be able to completely solve the problem

Thanks, that is good to know. So my take is that we now go from reading about 10% incorrectly to a number of tricky corner cases. We should try to fix them as well, but for now I'm happy enough with the improvements here :)

I guess the main question here is whether there is an easy way to detect situations where even the double double code path is not good enough and we need to switch to something else for those corner cases...

It will also likely be much slower: ideally you only want to utilise excess precision only when you need to.

So the double double code path is only taken if needed, otherwise a faster code path runs (see here). But even the double double case shows good performance: if you look at the aggregate performance for the uniform_data_float64.csv file in the benchmark I posted, performance in this PR (using the double double code path) is actually better than in TextParse.jl on julia 0.6 (which didn't use any extended precision at all). Relative to all the other packages, only R fread beats it (but that beats us in almost all cases).

Also it probably makes sense to handle the powers of 10 via a lookup table (possible two separate ones: one where 10.0^k is exact, and another in higher precision).

I played around with that at some point and didn't see any improvements. I didn't do it super carefully, might well have been that I got something wrong and it would have helped if done more carefully.

So the double double code path is only taken if needed, otherwise a faster code path runs

Ah, I missed how that worked. In that case, it is certainly a net improvement.

JeffBezanson · 2018-10-19T06:40:23Z

I agree with most of that. I'm not after any big redesigns here, I just want to solve the problems (1) having 2+ CSV packages is a bad user experience because people don't know which to use, and (2) we need to pool our efforts in one place. Sometimes TextParse performance edges ahead, and sometimes CSV.jl does. There are also cases where we need to pick a default, e.g. if you go through the FileIO interface. So I believe we would be better served by a single monotonically-improving codebase. If nothing else, the names CSV.jl and CSVFiles.jl are confusing; they sound like exactly the same thing.

💯 to moving common low-level routines to Parsers.jl, and to the CSV testbed repo idea.

davidanthoff · 2018-10-19T19:01:47Z

@JeffBezanson just some quick reaction:

I don't share your worry about having multiple user facing CSV packages. We have multiple table packages as well, right? In general, I think if there are different designs on the table, it is better to have multiple packages that try out things rather than spend a lot of time and energy in settling on one or another approach. I think I broadly always cheer if I see multiple attacks on a similar problem, I think it is a sign of a healthy ecosystem. I think user confusion can be solved via documentation or meta packages. For example, I think the Queryverse file IO story is pretty simple at this point (see here).

moving common low-level routines to Parsers.jl

Actually, that was not what I meant. In my mind the approach in TextParse.jl and Parsers.jl are so different, that settling on one or the other would amount to a major rewrite for either TextParse.jl or CSV.jl. I meant that we can copy ideas and algorithms from each other and help out, but not try to settle on one code base. Maybe down the road if it becomes clear that one is a better approach than the other, but I don't think that is the case right now.

JeffBezanson · 2018-10-19T20:21:36Z

I guess I am a bit surprised that there are approaches to e.g. parsing floats that are so different it is impossible to share code, even when using the same algorithm. At the end of the day, something has to read a sequence of bytes and return a float. But I guess it could be that, at least in the short term, the effort of porting ideas over is lower than the effort of integrating the codebases.

I'm just frustrated that if I use e.g. FileIO or JuliaDB, next week CSV.jl might get much faster csv parsing and I don't benefit from it. Do we expect, for example, that for the foreseeable future some files will parse faster with one package than with another? If so do we just need to work on interfaces to make it easy for the user to specify which parser to use?

quinnj · 2018-10-23T03:19:41Z

Feel free to use the float parsing tests I have here; I spent a good 2-3 weeks last summer writing the float parser in Parsers.jl and it's surprisingly hard to get to 100% accuracy while maintaining performance. Parsers.jl implementation is accurate but currently slow for full-precision floats due to the use of BigInt/BigFloat. @simonbyrne and I have some ideas to work around that, but just haven't gotten around to doing the work yet.

davidanthoff · 2018-10-23T04:13:22Z

I guess I am a bit surprised that there are approaches to e.g. parsing floats that are so different it is impossible to share code, even when using the same algorithm.

I think the main difference right now is that TextParse.jl takes its input as AbstractString, whereas Parsers.jl takes IO. I think switching one or the other to the other solution would probably amount to changes on almost every line of code, which just seems a lot of work...

Feel free to use the float parsing tests

Cool, they look super useful!

Parsers.jl implementation is accurate but currently slow for full-precision floats due to the use of BigInt/BigFloat.

Is the difficult question to detect when the double double approach used here in this PR is no longer good enough and one has to go to the BigInt/BigFloat approach? My guess is that would be rare, so if one used double double for most full-precision floats (like this PR) one would get great performance in most cases, and in the few cases where even that is not good enough one could switch over to a third, even slower path?

quinnj · 2018-10-24T02:18:56Z

Yeah, the tests in Parsers.jl are fairly extensive at this point; I basically went to the rust, go, and boost native implementations and took all the test cases from there, plus a bunch from stack overflow questions where people mentioned seeing float parsing bugs.

Add tune.json to .gitignore

cc47a3a

davidanthoff force-pushed the fix-performance branch from 6a9aa57 to 3a9e3ec Compare October 1, 2018 05:48

davidanthoff force-pushed the fix-performance branch from 3a9e3ec to bfef725 Compare October 4, 2018 06:10

davidanthoff force-pushed the fix-performance branch 2 times, most recently from 0e5f508 to b2dac5c Compare October 5, 2018 05:57

davidanthoff added 4 commits October 15, 2018 22:10

Add benchmark suite

6cf1ca3

Add branches after calls to iterate

572a95c

Use Nullable again

23db89a

Fix float parsing

9681ca9

davidanthoff force-pushed the fix-performance branch from b2dac5c to 52b39d4 Compare October 16, 2018 05:15

Add optimized methods for UTF-8 strings

769a999

davidanthoff force-pushed the fix-performance branch from 52b39d4 to 769a999 Compare October 16, 2018 05:17

davidanthoff changed the title ~~WIP Fix performance regressions~~ Fix performance regressions and float parsing Oct 16, 2018

shashi reviewed Oct 16, 2018

View reviewed changes

Escape properly in benchmarks

05c9c6b

davidanthoff commented Oct 19, 2018

View reviewed changes

davidanthoff merged commit 754fbe3 into queryverse:master Oct 19, 2018

davidanthoff deleted the fix-performance branch October 19, 2018 18:52

Fix performance regressions and float parsing #74

Fix performance regressions and float parsing #74

Uh oh!

Conversation

davidanthoff commented Oct 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davidanthoff commented Oct 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davidanthoff commented Oct 4, 2018

Uh oh!

davidanthoff commented Oct 16, 2018

Uh oh!

shashi Oct 16, 2018

Choose a reason for hiding this comment

Uh oh!

shashi Oct 16, 2018

Choose a reason for hiding this comment

Uh oh!

quinnj commented Oct 16, 2018

Uh oh!

shashi Oct 16, 2018

Choose a reason for hiding this comment

Uh oh!

shashi commented Oct 16, 2018

Uh oh!

davidanthoff commented Oct 16, 2018

Uh oh!

JeffBezanson commented Oct 16, 2018

Uh oh!

davidanthoff commented Oct 19, 2018

Uh oh!

davidanthoff Oct 19, 2018

Choose a reason for hiding this comment

Uh oh!

simonbyrne Oct 19, 2018

Choose a reason for hiding this comment

Uh oh!

davidanthoff Oct 19, 2018

Choose a reason for hiding this comment

Uh oh!

simonbyrne Oct 19, 2018

Choose a reason for hiding this comment

Uh oh!

JeffBezanson commented Oct 19, 2018

Uh oh!

davidanthoff commented Oct 19, 2018

Uh oh!

JeffBezanson commented Oct 19, 2018

Uh oh!

quinnj commented Oct 23, 2018

Uh oh!

davidanthoff commented Oct 23, 2018

Uh oh!

quinnj commented Oct 24, 2018

Uh oh!

Uh oh!

davidanthoff commented Oct 1, 2018 •

edited

Loading

davidanthoff commented Oct 1, 2018 •

edited

Loading