Redesign of `stack` and `unstack`

This issue is meant to replace: https://github.com/JuliaData/DataFrames.jl/issues/2215 https://github.com/JuliaData/DataFrames.jl/issues/2148 https://github.com/JuliaData/DataFrames.jl/issues/3066 https://github.com/JuliaData/DataFrames.jl/issues/2422 https://github.com/JuliaData/DataFrames.jl/issues/2414 https://github.com/JuliaData/DataFrames.jl/issues/1839

The proposed improved API for `stack` is:
```
stack(df::AbstractDataFrame,
    measure_vars,
    id_vars;
    variable_name=:variable,
    name_value=identity,
    value_name=:value, # or function
    variable_eltype::Type=String,
    fill=missing,
    view::Bool=false)
```
Questions to discuss:
* should we rename `measure_vars` and `id_vars` arguments (I was thinking of something better but could not come with anything better);
* `variable_name` - no change here
* `name_value` - I propose to allow passing a function that takes `measure_vars` column names as strings and produces values in the `variable_name` column where the name of the `measure_vars` will be stored (I propose to have a single column still although e.g. dplyr allows multiple - column splitting can be performed as a later step - but maybe you will find it useful to allow for splitting in `stack`?; also the question is what name would be best here)
* `value_name` - I propose to allow passing a function that takes `measure_vars` column names as strings and produces name of the column where the name of the values will be stored
* `fill` if `variable_name`/`value_name` combination is missing what value use to fill data
* do we need to provide an option to specify sorting order of produced data frame? (I now assume `measure_vars` columns are processed left to right)
* `view` - `true` will be disallowed if `variable_name` or `value_name` is a function
* if `variable_name`/`value_name` combination produces duplicate I assume we throw an error (but maybe we want some other behavior also?)

Example. Input `df`:
```
2×5 DataFrame
│ Row │ ID    │ varA2018 │ varA2019 │ varB2018 │ varB2019 │
│     │ Int64 │ Int64    │ Int64    │ Int64    │ Int64    │
├─────┼───────┼──────────┼──────────┼──────────┼──────────┤
│ 1   │ 1     │ 1        │ 1        │ 1        │ 1        │
│ 2   │ 2     │ 2        │ 2        │ 2        │ 2        │
```
Output of stack(df, r"var", :ID, variable_name= :Year, name_value=x -> parse(Int, last(x, 4)), value_name=x -> first(x, 4)):
```
4×4 DataFrame
│ Row │ ID    │ Year  │ varA  │ varB  │
│     │ Int64 │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┼───────┤
│ 1   │ 1     │ 2018  │ 1     │ 1     │
│ 2   │ 1     │ 2019  │ 1     │ 1     │
│ 3   │ 2     │ 2018  │ 2     │ 2     │
│ 4   │ 2     │ 2019  │ 2     │ 2     │
```
(so the general idea is to allow for dynamic generation of `variable_name` and `value_name` based on `measure_vars` column names)

The proposed improved API for `unstack` is:
```
unstack(df::AbstractDataFrame,
      row_keys,
      col_keys,
      values;
      renamecols::Function=(x...) -> join(x, "_"),
      allowmissing::Bool=false,
      combine=values isa ColumnIndex ? only : (x...) -> only.(x),
      fill=missing,
      threads::Bool=true)
```

Questions to discuss:
* rename `rowkeys` to `row_keys`
* rename `colkey` to `col_keys`; start allowing passing multiple columns as `col_keys`
* rename `value` to `values`; start allowing passing multiple columns as `values`
* `renamecols`: takes as many positional arguments as there are `col_keys` columns; by default joins them with `_`
* `combine`: takes as many positional arguments as there are `values` columns; by default if a single column is passed `only`, and if multiple a tuple of captured values is produced (but anything can be computed and returned here). Internally it will be `values => combine` transformation in operation specification syntax. Note that I propose, as opposed to, e.g. dplyr, that we do not create multiple columns here, but instead a value in a cell if a function of multiple columns (by default a tuple of matching values)

Example. Input `df`:
```
8×5 DataFrame
 Row │ id     n1      n2      v1     v2
     │ Int64  String  String  Int64  Char
─────┼────────────────────────────────────
   1 │     1  a       x           1  a
   2 │     1  a       y           2  b
   3 │     1  b       x           3  c
   4 │     1  b       y           4  d
   5 │     2  a       x           5  e
   6 │     2  a       y           6  f
   7 │     2  b       x           7  g
   8 │     2  b       y           8  h
```
Output of `unstack(df, :id, r"n", r"v")` (with default `renamescols` and `combine`):
```
2×5 DataFrame
 Row │ id     a_x       a_y       b_x       b_y
     │ Int64  Tuple…    Tuple…    Tuple…    Tuple…
─────┼───────────────────────────────────────────────
   1 │     1  (1, 'a')  (2, 'b')  (3, 'c')  (4, 'd')
   2 │     2  (5, 'e')  (6, 'f')  (7, 'g')  (8, 'h')
```
Output of `unstack(df, :id, r"n", r"v", renamecols=string, combine=(x,y) -> string(x[1], y[1]))`:
```
2×5 DataFrame
 Row │ id     ax      ay      bx      by
     │ Int64  String  String  String  String
─────┼───────────────────────────────────────
   1 │     1  1a      2b      3c      4d
   2 │     2  5e      6f      7g      8h
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Redesign of `stack` and `unstack` #3237

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Redesign of stack and unstack #3237

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Redesign of `stack` and `unstack` #3237