Skip to content

Redesign of stack and unstack #3237

@bkamins

Description

@bkamins

This issue is meant to replace: #2215 #2148 #3066 #2422 #2414 #1839

The proposed improved API for stack is:

stack(df::AbstractDataFrame,
    measure_vars,
    id_vars;
    variable_name=:variable,
    name_value=identity,
    value_name=:value, # or function
    variable_eltype::Type=String,
    fill=missing,
    view::Bool=false)

Questions to discuss:

  • should we rename measure_vars and id_vars arguments (I was thinking of something better but could not come with anything better);
  • variable_name - no change here
  • name_value - I propose to allow passing a function that takes measure_vars column names as strings and produces values in the variable_name column where the name of the measure_vars will be stored (I propose to have a single column still although e.g. dplyr allows multiple - column splitting can be performed as a later step - but maybe you will find it useful to allow for splitting in stack?; also the question is what name would be best here)
  • value_name - I propose to allow passing a function that takes measure_vars column names as strings and produces name of the column where the name of the values will be stored
  • fill if variable_name/value_name combination is missing what value use to fill data
  • do we need to provide an option to specify sorting order of produced data frame? (I now assume measure_vars columns are processed left to right)
  • view - true will be disallowed if variable_name or value_name is a function
  • if variable_name/value_name combination produces duplicate I assume we throw an error (but maybe we want some other behavior also?)

Example. Input df:

2×5 DataFrame
│ Row │ ID    │ varA2018 │ varA2019 │ varB2018 │ varB2019 │
│     │ Int64 │ Int64    │ Int64    │ Int64    │ Int64    │
├─────┼───────┼──────────┼──────────┼──────────┼──────────┤
│ 1   │ 1     │ 1        │ 1        │ 1        │ 1        │
│ 2   │ 2     │ 2        │ 2        │ 2        │ 2        │

Output of stack(df, r"var", :ID, variable_name= :Year, name_value=x -> parse(Int, last(x, 4)), value_name=x -> first(x, 4)):

4×4 DataFrame
│ Row │ ID    │ Year  │ varA  │ varB  │
│     │ Int64 │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┼───────┤
│ 1   │ 1     │ 2018  │ 1     │ 1     │
│ 2   │ 1     │ 2019  │ 1     │ 1     │
│ 3   │ 2     │ 2018  │ 2     │ 2     │
│ 4   │ 2     │ 2019  │ 2     │ 2     │

(so the general idea is to allow for dynamic generation of variable_name and value_name based on measure_vars column names)

The proposed improved API for unstack is:

unstack(df::AbstractDataFrame,
      row_keys,
      col_keys,
      values;
      renamecols::Function=(x...) -> join(x, "_"),
      allowmissing::Bool=false,
      combine=values isa ColumnIndex ? only : (x...) -> only.(x),
      fill=missing,
      threads::Bool=true)

Questions to discuss:

  • rename rowkeys to row_keys
  • rename colkey to col_keys; start allowing passing multiple columns as col_keys
  • rename value to values; start allowing passing multiple columns as values
  • renamecols: takes as many positional arguments as there are col_keys columns; by default joins them with _
  • combine: takes as many positional arguments as there are values columns; by default if a single column is passed only, and if multiple a tuple of captured values is produced (but anything can be computed and returned here). Internally it will be values => combine transformation in operation specification syntax. Note that I propose, as opposed to, e.g. dplyr, that we do not create multiple columns here, but instead a value in a cell if a function of multiple columns (by default a tuple of matching values)

Example. Input df:

8×5 DataFrame
 Row │ id     n1      n2      v1     v2
     │ Int64  String  String  Int64  Char
─────┼────────────────────────────────────
   1 │     1  a       x           1  a
   2 │     1  a       y           2  b
   3 │     1  b       x           3  c
   4 │     1  b       y           4  d
   5 │     2  a       x           5  e
   6 │     2  a       y           6  f
   7 │     2  b       x           7  g
   8 │     2  b       y           8  h

Output of unstack(df, :id, r"n", r"v") (with default renamescols and combine):

2×5 DataFrame
 Row │ id     a_x       a_y       b_x       b_y
     │ Int64  Tuple…    Tuple…    Tuple…    Tuple…
─────┼───────────────────────────────────────────────
   1 │     1  (1, 'a')  (2, 'b')  (3, 'c')  (4, 'd')
   2 │     2  (5, 'e')  (6, 'f')  (7, 'g')  (8, 'h')

Output of unstack(df, :id, r"n", r"v", renamecols=string, combine=(x,y) -> string(x[1], y[1])):

2×5 DataFrame
 Row │ id     ax      ay      bx      by
     │ Int64  String  String  String  String
─────┼───────────────────────────────────────
   1 │     1  1a      2b      3c      4d
   2 │     2  5e      6f      7g      8h

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions