-
Notifications
You must be signed in to change notification settings - Fork 375
Open
Description
This issue is meant to replace: #2215 #2148 #3066 #2422 #2414 #1839
The proposed improved API for stack
is:
stack(df::AbstractDataFrame,
measure_vars,
id_vars;
variable_name=:variable,
name_value=identity,
value_name=:value, # or function
variable_eltype::Type=String,
fill=missing,
view::Bool=false)
Questions to discuss:
- should we rename
measure_vars
andid_vars
arguments (I was thinking of something better but could not come with anything better); variable_name
- no change herename_value
- I propose to allow passing a function that takesmeasure_vars
column names as strings and produces values in thevariable_name
column where the name of themeasure_vars
will be stored (I propose to have a single column still although e.g. dplyr allows multiple - column splitting can be performed as a later step - but maybe you will find it useful to allow for splitting instack
?; also the question is what name would be best here)value_name
- I propose to allow passing a function that takesmeasure_vars
column names as strings and produces name of the column where the name of the values will be storedfill
ifvariable_name
/value_name
combination is missing what value use to fill data- do we need to provide an option to specify sorting order of produced data frame? (I now assume
measure_vars
columns are processed left to right) view
-true
will be disallowed ifvariable_name
orvalue_name
is a function- if
variable_name
/value_name
combination produces duplicate I assume we throw an error (but maybe we want some other behavior also?)
Example. Input df
:
2×5 DataFrame
│ Row │ ID │ varA2018 │ varA2019 │ varB2018 │ varB2019 │
│ │ Int64 │ Int64 │ Int64 │ Int64 │ Int64 │
├─────┼───────┼──────────┼──────────┼──────────┼──────────┤
│ 1 │ 1 │ 1 │ 1 │ 1 │ 1 │
│ 2 │ 2 │ 2 │ 2 │ 2 │ 2 │
Output of stack(df, r"var", :ID, variable_name= :Year, name_value=x -> parse(Int, last(x, 4)), value_name=x -> first(x, 4)):
4×4 DataFrame
│ Row │ ID │ Year │ varA │ varB │
│ │ Int64 │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┼───────┤
│ 1 │ 1 │ 2018 │ 1 │ 1 │
│ 2 │ 1 │ 2019 │ 1 │ 1 │
│ 3 │ 2 │ 2018 │ 2 │ 2 │
│ 4 │ 2 │ 2019 │ 2 │ 2 │
(so the general idea is to allow for dynamic generation of variable_name
and value_name
based on measure_vars
column names)
The proposed improved API for unstack
is:
unstack(df::AbstractDataFrame,
row_keys,
col_keys,
values;
renamecols::Function=(x...) -> join(x, "_"),
allowmissing::Bool=false,
combine=values isa ColumnIndex ? only : (x...) -> only.(x),
fill=missing,
threads::Bool=true)
Questions to discuss:
- rename
rowkeys
torow_keys
- rename
colkey
tocol_keys
; start allowing passing multiple columns ascol_keys
- rename
value
tovalues
; start allowing passing multiple columns asvalues
renamecols
: takes as many positional arguments as there arecol_keys
columns; by default joins them with_
combine
: takes as many positional arguments as there arevalues
columns; by default if a single column is passedonly
, and if multiple a tuple of captured values is produced (but anything can be computed and returned here). Internally it will bevalues => combine
transformation in operation specification syntax. Note that I propose, as opposed to, e.g. dplyr, that we do not create multiple columns here, but instead a value in a cell if a function of multiple columns (by default a tuple of matching values)
Example. Input df
:
8×5 DataFrame
Row │ id n1 n2 v1 v2
│ Int64 String String Int64 Char
─────┼────────────────────────────────────
1 │ 1 a x 1 a
2 │ 1 a y 2 b
3 │ 1 b x 3 c
4 │ 1 b y 4 d
5 │ 2 a x 5 e
6 │ 2 a y 6 f
7 │ 2 b x 7 g
8 │ 2 b y 8 h
Output of unstack(df, :id, r"n", r"v")
(with default renamescols
and combine
):
2×5 DataFrame
Row │ id a_x a_y b_x b_y
│ Int64 Tuple… Tuple… Tuple… Tuple…
─────┼───────────────────────────────────────────────
1 │ 1 (1, 'a') (2, 'b') (3, 'c') (4, 'd')
2 │ 2 (5, 'e') (6, 'f') (7, 'g') (8, 'h')
Output of unstack(df, :id, r"n", r"v", renamecols=string, combine=(x,y) -> string(x[1], y[1]))
:
2×5 DataFrame
Row │ id ax ay bx by
│ Int64 String String String String
─────┼───────────────────────────────────────
1 │ 1 1a 2b 3c 4d
2 │ 2 5e 6f 7g 8h
MatthewRGonzalez, chentoast and Mattriks