Skip to content

ENH: select column/coordinates/multiple with start/stop/selection #6177

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Feb 9, 2014

Conversation

wabu
Copy link
Contributor

@wabu wabu commented Jan 29, 2014

select_as_multiple/column/coordinate and remove behaves strange on combinations of where clauses and start/stop keyword arguments so I started fixing this.

I changed select_as_multiple to select coordinates inside the TableIterator, so it will work on large tables. Moreover select_as_multiple always throws a KeyError when either a key or the selector is not available in the file.

Closes #4835.

@jreback
Copy link
Contributor

jreback commented Jan 29, 2014

can you show (in the top of this PR), e.g. an ipython session of some use cases (you can use stripped down versions of your test cases).

and before/after if it matters (e.g. for the API change/bug on the errors).

@jreback
Copy link
Contributor

jreback commented Jan 29, 2014

this closes issue #4835 and/or #3307 ? (if so pls put it in the header)

@jreback
Copy link
Contributor

jreback commented Jan 29, 2014

as an aside...been meaning to fix/completely rewrite the API for select_as_multiple. If you store a small amount of meta data, it is easy to reconstruct these linked tables. Furthermore the way of specifying the key / join tables could be a lot better (e.g. a function/object oriented interface)

luv to here ideas.

@wabu
Copy link
Contributor Author

wabu commented Feb 8, 2014

had not so much time this week. fixed the bug with pytables < 3.0 (which showed up on in the python 2.7 travis build), but Travis build seems not to work anymore for 2.7. I'll post examples later. Mainly its about using start/stop in all select_... and in remove and to apply the selection filter for read_coordinates.

@jreback
Copy link
Contributor

jreback commented Feb 8, 2014

just force it to build again

git commit --amend -C HEAD then force push

@wabu
Copy link
Contributor Author

wabu commented Feb 8, 2014

Here's an example showing how select_as_multiple did handle start before the commit in contrast to select:

In [3]: data = pd.DataFrame({'a': np.random.rand(10), 'b': np.random.rand(10)})
In [4]: store = pd.HDFStore('test.h5')
In [5]: store.append_to_multiple({'a': ['a'], 'b': ['b']}, data, 'a')
In [6]: store.select('a', where='a>.1', start=5)
Out[6]: 
          a
5  0.386593
6  0.363150
7  0.247858
8  0.628002
9  0.785359
[5 rows x 1 columns]

In [7]: store.select_as_multiple(['a','b'], where='a>.1', start=5)
Out[7]: 
Empty DataFrame
Columns: [a, b]
Index: []

[0 rows x 2 columns]

The result is empty, as the where is first applied and the start is applied on the filtered result.

select_as_coordinates misbehaves when where cause results in a filter expression:

In [10]: sel = np.arange(5,1000)
In [11]: store.select_as_coordinates('a', where='index = sel')
Out[11]: Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64')

Here, the filter expression is ignored, so everything is returned

After the fix, we have:

In [4]: store.select('a', where='a>.1', start=5)
Out[4]: 
          a
5  0.386593
6  0.363150
7  0.247858
8  0.628002
9  0.785359
[5 rows x 1 columns]
In [5]: store.select_as_multiple(['a','b'], where='a>.1', start=5)
Out[5]: 
          a         b
5  0.386593  0.258102
6  0.363150  0.345453
7  0.247858  0.841031
8  0.628002  0.437058
9  0.785359  0.520087
[5 rows x 2 columns]

In [6]: sel = np.arange(5,1000)
In [7]: store.select_as_coordinates('a', where='index = sel')
Out[7]: Int64Index([5, 6, 7, 8, 9], dtype='int64')

@@ -2195,6 +2195,68 @@ def test_remove_where(self):
# self.assertRaises(ValueError, store.remove,
# 'wp2', [('column', ['A', 'D'])])

def test_remove_startstop(self):

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you put a reference to the issues number here (or the PR if their isn't an issue associated)

@jreback
Copy link
Contributor

jreback commented Feb 8, 2014

need a release note mentioning this changes (make 2 notes, one in API section about the TypeError changing to KeyError , 1 in Bug fix section), reference the issue (or the this PR) if their isnt' an associated issue.

squash em all down to 1-2 commits....then ok to merge

@wabu
Copy link
Contributor Author

wabu commented Feb 9, 2014

how should i referent the pr? with :pull:, :pr: or :issue:

@jreback
Copy link
Contributor

jreback commented Feb 9, 2014

issue

This commit fixes issues with start/stop and selection. Furthermore
select_as_mutlipe is implemented differently:
- select_column and remove now hanlde start/stop arguments
- select_as_multipe hanldes start/stop the same way as select
- select_as_coordinates and select_column works with where expressions
  that result in filters. This also fixes select_as_multipe with filters
- select_as_multipe table iterator is rewritting, so memory usage is
  constant.

select_as_multiple now will always throw a KeyError if either a key or
the selector is not found.

Also closes GH4835
jreback added a commit that referenced this pull request Feb 9, 2014
ENH: select column/coordinates/multiple with start/stop/selection
@jreback jreback merged commit 5adf4ea into pandas-dev:master Feb 9, 2014
@jreback
Copy link
Contributor

jreback commented Feb 9, 2014

@wabu thanks...this is really great!

I suppose if the user specifies start and/or stop of the complete table range then all rows will be deleted by the table will still exists, which I think is ok. If you find that a problem, can come back and address it later (e.g. you change change start=0 to start=None, and stop=nrows to stop=None). But a minor point.

@wabu wabu deleted the selection-start-stop-fixes branch February 9, 2014 13:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO HDF5 read_hdf, HDFStore
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants