Skip to content

implemented Data.streamfrom for Data.field #45

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

ExpandingMan
Copy link
Collaborator

Ok, this still has a couple of issues (in particular, I haven't yet figured out how to properly deserialize the Bools), but it is a good start. This allows you to pull individual fields without loading the whole file (at least I know it does that in the case where memory mapping is used, not quite sure what happens otherwise).

"But ExpandingMan, clearly the right thing to do is to implement the ability to pull arbitrarily small subsets of rows from columns, and the Data.Field implementation should be based on that."
Well, yes, true, but unfortunately the DataStreams interface standard doesn't support that yet (I think), though it should definitely be expanded to. As is, there is a hell of a lot of extra overhead in pulling individual fields. It'll be pretty simple to implement pulling subsets of columns once it is added to the standard. The Data.Field stuff I wrote will still be useful (and on a personal note, it helped me to learn more about how these types of formats work).

Ok, if you guys can let me know how this can be improved, I'd be happy to make some changes so that this can be merged. At the moment, I think that the only problem is with the Bools. Thanks!

@ExpandingMan
Copy link
Collaborator Author

There seem to be a couple of simple errors. Unfortunately I won't get a chance to fix them until Sunday or Monday, but I'll address them then.

@ExpandingMan
Copy link
Collaborator Author

Well that was embarrassing. Apparently when I moved around some of the functions before making the original commit I had deleted a couple of the Data.streamfrom methods. Easily fixed.

Also, it seems that there were a couple of weird issues with Arrow.Date that I fixed. Sometimes Ints were getting passed to unix2date, which is no good because it tries to do x.value on those Ints. I'm not sure why this wasn't fixed before.

Tests all seem to pass now. I still don't know how to get Bools so what's in here is still a hack until I figure out the right way of doing it. Also, if there is any interest in updating the DataStreams standard to stream partial columns, I will add the appropriate methods.

Please take a look. Thanks!

@codecov-io
Copy link

codecov-io commented Apr 17, 2017

Codecov Report

Merging #45 into master will decrease coverage by 6.26%.
The diff coverage is 60%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #45      +/-   ##
==========================================
- Coverage   83.21%   76.94%   -6.27%     
==========================================
  Files           3        3              
  Lines         274      321      +47     
==========================================
+ Hits          228      247      +19     
- Misses         46       74      +28
Impacted Files Coverage Δ
src/Arrow.jl 73.33% <100%> (+16.19%) ⬆️
src/Feather.jl 76.58% <58.33%> (-7.61%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3d8a1f9...99317cf. Read the comment docs.

@ExpandingMan
Copy link
Collaborator Author

I think the Travis test fails because of the use of workspace(). Note that the nightlies count as VERSION < v"0.6.0" so workspace()` still gets called. For whatever reason, it causes Julia to hang while eating up an entire CPU core indefinitely.

@quinnj
Copy link
Member

quinnj commented Apr 18, 2017

Thanks for the PR! Little slammed at the moment, but I'll try to review properly in the next few days.

@ExpandingMan
Copy link
Collaborator Author

As per our discussion here what I have implemented here is Data.Column streaming, but one field at a time. I'll wait until changes to DataStreams are settled on before making the appropriate changes to this PR (hopefully I'll also then be able to add streaming of partial columns).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants