-
Notifications
You must be signed in to change notification settings - Fork 20
Fix performance regressions and float parsing #74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
6a9aa57
to
3a9e3ec
Compare
Alright, getting closer. Still about a factor 5 slower than on julia 0.6, but I still have a bunch of things that I know need fixing. TODO list for myself:
|
3a9e3ec
to
bfef725
Compare
Still not done, but this PR now also fixes #30. |
0e5f508
to
b2dac5c
Compare
b2dac5c
to
52b39d4
Compare
52b39d4
to
769a999
Compare
@sashi I think this is ready to be reviewed and then merged. High level summary: this makes TextParse fast again (currently it really is unusably slow on julia 1.0) and it fixes the float parsing issue described in #30. It might be easiest to review this commit by commit. The first two commits add a micro benchmark suite. Broadly speaking this PR improves almost all of those micro benchmarks to numbers that are better than what we had on julia 0.6. Relative to the currently released version of TextParse the improvements are on the order of factor 100-500 or something crazy like that. The next commit adds branches that check for The next commit uses The next commit fixes the float parsing bug. I used the code here as a starting point, fixed some bugs, adopted it to the structure here etc. I added that to this PR because I didn't want to fix performance problems in a routine that had to be replaced in any case rather sooner than later. @simonbyrne if you could look over that part it would be great. It all seems very solid to me, and it is fast, but you know much more about floating point stuff than I ever will :) The final commit adds a number of methods that have faster implementations for UTF8 strings, i.e. when the input is of type I also compared overall performance of this new TextParse.jl with some typical other CSV readers, and it is in great shape again. Essentially R's fread still beats it in most cases, but other than that it is very competitive and faster in almost all situations than the other competing packages, except for very small files. The benchmark I ran was 20 columns of random data for the uniform files, and the mixed files had one column for each data type. Here are the results: |
benchmark/benchmarks.jl
Outdated
SUITE["util"]["tryparsenext"]["Percentage"] = @benchmarkable TextParse.tryparsenext(TextParse.Percentage(), $percentagestring,1,$percentagestringlen, TextParse.default_opts) | ||
SUITE["util"]["tryparsenext"]["StringToken"] = @benchmarkable TextParse.tryparsenext(TextParse.StringToken(String), $somestring,1,$somestringlen, TextParse.default_opts) | ||
SUITE["util"]["tryparsenext"]["DateTimeToken"] = @benchmarkable TextParse.tryparsenext(tok, $datetimestr,1,$datetimestrlen, opts) | ||
SUITE["util"]["tryparsenext"]["QuotedStringToken"] = @benchmarkable TextParse.tryparsenext(Quoted(String,quotechar='"', escapechar='"'), somequotedstring) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome!
do you need
$(Quoted(String,quotechar='"', escapechar='"'))
here so as to not time its construction?
end | ||
|
||
@label done | ||
return R(convert(F, sign*(x+f))), i | ||
return R(convert(F, f)), i | ||
|
||
@label error | ||
return R(), i |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I probably won't be able to review this meaningfully. @simonbyrne if you can review, that'd be neat! Otherwise, @davidanthoff please merge this PR in a day or two.
could you share how the benchmarks were generated? I'm planning on working some more on CSV.jl performance this week. |
|
||
@label error | ||
return R(), i | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love this file, there's something nice about code like this. :-)
Apart from the one minor comment, I have nothing to add here. I've invited you as a collaborator, please accept it and merge the PR when you see fit. Thanks for this heroic feat! |
Sure: https://github.com/davidanthoff/csv-comparison. I think there are still some hard coded paths in there :) It also is a beast to run... I have comparison results from a Mac somewhere as well, it was broadly a similar story. The bars in the graphs always show the minimum runtime, each tick is one sample (I ran everything five times). More test cases, corrections or any other PR for that repo are very welcome! Might be nice to keep it up-to-date as a performance comparison tool. |
Is there any way we can get the best of both worlds? I.e. support DataStreams, DataValue, Union-based missings, and the best-performing code from both packages in one? What's the main blocker for that happening? |
My main, very strong preference, is that we don't do any big refactorings, rearchitecture or design change things with TextParse.jl. I think it is excellent the way it is. From my (or queryverse's) point of view, stability in the packages that we depend on is really a major benefit/objective, and TextParse.jl was fantastic in that (and lots of other) aspects in the last years. I think of course small, incremental PRs that improve things here and there (like this PR here) would be fantastic, but other than that I would just prefer to not make big changes, in particular because I don't see any significant problems with the current design. I do plan to add support for returning In terms of integration with Tables.jl/TableTraits.jl/DataStreams.jl, here is my view on that: first, DataStreams.jl is no more, as far as I understand it, I think Tables.jl is really the next version of that. My preference is that a package like TextParse.jl does not provide integration with either TableTraits.jl or Tables.jl, but instead stays a low level parsing package that does not hook into any of these ecosystems. For TableTraits.jl/Queryverse the integration with everything else is in CSVFiles.jl (which has a dependency on TextParse.jl), and that kind of design (low level parsing package with a user facing high level package that integrates with the ecosystem) has really worked extremely well in the last couple of years. I believe @quinnj is not a fan of that design, though :) So that leaves the question whether CSV.jl and TextParse.jl should be unified in some form. In my mind it is excellent that CSV.jl is trying out new things in this space. The core designs and approaches in TextParse.jl and CSV.jl seem distinct enough to me that I think it makes sense to keep both around and improve them separately and see how things turn out. From my point of view I'm neither convinced enough of CSV.jl's new approach to be willing to give up on the TextParse.jl approach, nor am I convinced that the approach in TextParse.jl is so clearly superior that I wouldn't want to see how the design in CSV.jl fares. I think this is a situation where it would be premature to try to settle on one horse, sometimes pursuing multiple options simultaneously and thereby hedging to some extent is the right move, and I think that is the case here. Having said that, I think there is ample opportunity to also learn from each other, share certain things etc. Some examples that come to mind: I think it would be great to have a shared repo with "tough" csv files (or code that creates tough CSV files) that can serve as a testbed. Similar with specific algorithms: if the float parsing in this PR here is fine, there might be some ideas that can just be copied directly to Parsers.jl (my sense is that parsing of long floats is really the major thing that slows CSV.jl down right now). I think the performance test repo that I created makes also a good candidate for something that we jointly keep updated. If folks disagree with what I wrote here and want to discuss that, we should probably move the discussion somewhere else. I think right now I mostly would like to just merge this PR so that things are functional again :) |
end | ||
end | ||
return Float64(x) | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@simonbyrne would be great if you could indicate whether you will be able to review this function in particular in the near future. Totally understood if not, but in that case I think I would just merge the PR, given that it passes all tests, and passes more parsing tests than the current master
version. Even if there is something wrong with this code, I think it would still be better than the status quo. BUT, a review from you would be really great, of course!
And if you have too much time (haha), you could also take a quick look at the function that calls this one here, I think at that point you would have reviewed all that there is to this float parsing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a bit of a look over: although it looks like it will help a bit, it won't be able to completely solve the problem (there are some cases which are beyond double-double).
It will also likely be much slower: ideally you only want to utilise excess precision only when you need to. Also it probably makes sense to handle the powers of 10 via a lookup table (possible two separate ones: one where 10.0^k is exact, and another in higher precision).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
although it looks like it will help a bit
Well, the issue previously was that if you just generated a bunch of random numbers, wrote them out to a string (with the default number of digits in julia write methods) and then read them back in, about 10% of those were read incorrectly. In my mind that was a large fraction... This PR seems to mostly solve that, at least for the random number streams I tested, this goes to 0 errors.
it won't be able to completely solve the problem
Thanks, that is good to know. So my take is that we now go from reading about 10% incorrectly to a number of tricky corner cases. We should try to fix them as well, but for now I'm happy enough with the improvements here :)
I guess the main question here is whether there is an easy way to detect situations where even the double double code path is not good enough and we need to switch to something else for those corner cases...
It will also likely be much slower: ideally you only want to utilise excess precision only when you need to.
So the double double code path is only taken if needed, otherwise a faster code path runs (see here). But even the double double case shows good performance: if you look at the aggregate performance for the uniform_data_float64.csv
file in the benchmark I posted, performance in this PR (using the double double code path) is actually better than in TextParse.jl on julia 0.6 (which didn't use any extended precision at all). Relative to all the other packages, only R fread beats it (but that beats us in almost all cases).
Also it probably makes sense to handle the powers of 10 via a lookup table (possible two separate ones: one where 10.0^k is exact, and another in higher precision).
I played around with that at some point and didn't see any improvements. I didn't do it super carefully, might well have been that I got something wrong and it would have helped if done more carefully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the double double code path is only taken if needed, otherwise a faster code path runs
Ah, I missed how that worked. In that case, it is certainly a net improvement.
I agree with most of that. I'm not after any big redesigns here, I just want to solve the problems (1) having 2+ CSV packages is a bad user experience because people don't know which to use, and (2) we need to pool our efforts in one place. Sometimes TextParse performance edges ahead, and sometimes CSV.jl does. There are also cases where we need to pick a default, e.g. if you go through the FileIO interface. So I believe we would be better served by a single monotonically-improving codebase. If nothing else, the names CSV.jl and CSVFiles.jl are confusing; they sound like exactly the same thing. 💯 to moving common low-level routines to Parsers.jl, and to the CSV testbed repo idea. |
@JeffBezanson just some quick reaction: I don't share your worry about having multiple user facing CSV packages. We have multiple table packages as well, right? In general, I think if there are different designs on the table, it is better to have multiple packages that try out things rather than spend a lot of time and energy in settling on one or another approach. I think I broadly always cheer if I see multiple attacks on a similar problem, I think it is a sign of a healthy ecosystem. I think user confusion can be solved via documentation or meta packages. For example, I think the Queryverse file IO story is pretty simple at this point (see here).
Actually, that was not what I meant. In my mind the approach in TextParse.jl and Parsers.jl are so different, that settling on one or the other would amount to a major rewrite for either TextParse.jl or CSV.jl. I meant that we can copy ideas and algorithms from each other and help out, but not try to settle on one code base. Maybe down the road if it becomes clear that one is a better approach than the other, but I don't think that is the case right now. |
I guess I am a bit surprised that there are approaches to e.g. parsing floats that are so different it is impossible to share code, even when using the same algorithm. At the end of the day, something has to read a sequence of bytes and return a float. But I guess it could be that, at least in the short term, the effort of porting ideas over is lower than the effort of integrating the codebases. I'm just frustrated that if I use e.g. FileIO or JuliaDB, next week CSV.jl might get much faster csv parsing and I don't benefit from it. Do we expect, for example, that for the foreseeable future some files will parse faster with one package than with another? If so do we just need to work on interfaces to make it easy for the user to specify which parser to use? |
Feel free to use the float parsing tests I have here; I spent a good 2-3 weeks last summer writing the float parser in Parsers.jl and it's surprisingly hard to get to 100% accuracy while maintaining performance. Parsers.jl implementation is accurate but currently slow for full-precision floats due to the use of BigInt/BigFloat. @simonbyrne and I have some ideas to work around that, but just haven't gotten around to doing the work yet. |
I think the main difference right now is that TextParse.jl takes its input as
Cool, they look super useful!
Is the difficult question to detect when the double double approach used here in this PR is no longer good enough and one has to go to the BigInt/BigFloat approach? My guess is that would be rare, so if one used double double for most full-precision floats (like this PR) one would get great performance in most cases, and in the few cases where even that is not good enough one could switch over to a third, even slower path? |
Yeah, the tests in Parsers.jl are fairly extensive at this point; I basically went to the rust, go, and boost native implementations and took all the test cases from there, plus a bunch from stack overflow questions where people mentioned seeing float parsing bugs. |
Fixes #73. Well, hopefully, at some point.
This is still very WIP. For now it just tries to properly add branches after calls to
iterate
and a number of other tweaks. That fixes the most egregious performance issues, but we are still quite far from what we have on julia 0.6 for this package.I've added a whole bunch of micro benchmarks, and will add more as I try to fix more issues.
This shows how performance degraded from the julia 0.6 version to the current version on
master
. This shows how performance changes from julia 0.6 version to this PR here. The short story is that we go from regression on the order of a factor of a couple hundred to some improvements and something below a factor 10 regression. This is only true for these micro benchmarks, the story for an overallreadcsv
is still not good.I think I essentially know what the next step is: the current model of returning
nothing, i
orSome(x), i
fromtryparsenext
seems to be really slow. I went back toNullable
in one specific case for this and the performance for that part went back to the julia 0.6 level. So unless someone has a better idea, I'm probably just going to go back toNullable
fortryparsenext
.Oh, and one more thing: I'm going to rebase this branch regularly, just be warned :)