Creating columns in DataFrame via loops

Hi all,
Stuck on what is probably a super simple problem. It’s the end of the week at 4pm on a Friday so my brain is just no longer working :blush:

I want to create a columns on an existing data frame with a specified value.

Here is my code:

df = DataFrame("A"=>1:10)
category = ["cat1","cat2","cat3","cat4","cat5"]
cat_values = [0.203842327,0.210149485,0,0.070243409,0.034921919]

for i in category, x in cat_values
    df[:,i] .= cat_values[x]
end
#this produces error: ERROR: ArgumentError: invalid index: 0.203842327 of type Float64

I expected the behavior to do this (e.g. after a single loop):

df[:,:cat1] .= cat_values[1]
return(df)

Appreciate if anyone can point me in the right direction. It’s been a long day.

for (i,x) in zip(category, cat_values)
  df[:,i] = repeat([x], 10)  # 10 values required to match column A
end

10Γ—6 DataFrame
 Row β”‚ A      cat1      cat2      cat3     cat4       cat5
     β”‚ Int64  Float64   Float64   Float64  Float64    Float64
─────┼──────────────────────────────────────────────────────────
   1 β”‚     1  0.203842  0.210149      0.0  0.0702434  0.0349219
   2 β”‚     2  0.203842  0.210149      0.0  0.0702434  0.0349219
   3 β”‚     3  0.203842  0.210149      0.0  0.0702434  0.0349219
   4 β”‚     4  0.203842  0.210149      0.0  0.0702434  0.0349219
   5 β”‚     5  0.203842  0.210149      0.0  0.0702434  0.0349219
   6 β”‚     6  0.203842  0.210149      0.0  0.0702434  0.0349219
   7 β”‚     7  0.203842  0.210149      0.0  0.0702434  0.0349219
   8 β”‚     8  0.203842  0.210149      0.0  0.0702434  0.0349219
   9 β”‚     9  0.203842  0.210149      0.0  0.0702434  0.0349219
  10 β”‚    10  0.203842  0.210149      0.0  0.0702434  0.0349219

Another method (might be slower due to use of function closure):

foreach((n,v)->df[!,n] .= v, category, cat_values)

To walk through your code…

Let’s check that against your first iteration (that’s not enough debugging generally but it’s enough here). The first i in category is "cat1", and the first x in cat_values is 0.203842327, so the first iteration does:

julia> df[:, "cat1"] .= cat_values[0.203842327]
ERROR: ArgumentError: invalid index: 0.203842327 of type Float64

The same error, and it’s apparent we didn’t mean to index cat_values a 2nd time.

julia> df[:, "cat1"] .= 0.203842327
10-element Vector{Float64}:
 0.203842327
 0.203842327
 0.203842327
...

So let’s make that change in your loop and check the resulting df:

julia> for i in category, x in cat_values
         df[:,i] .= x
       end

julia> df
10Γ—6 DataFrame
 Row β”‚ A      cat1       cat2       cat3       cat4       cat5
     β”‚ Int64  Float64    Float64    Float64    Float64    Float64
─────┼──────────────────────────────────────────────────────────────
   1 β”‚     1  0.0349219  0.0349219  0.0349219  0.0349219  0.0349219
   2 β”‚     2  0.0349219  0.0349219  0.0349219  0.0349219  0.0349219
   3 β”‚     3  0.0349219  0.0349219  0.0349219  0.0349219  0.0349219
...

Well that’s not what we want either. We’re iterating through category columns and cat_values values, so what’s the problem? Let’s reference the loop docs:

Multiple nested for loops can be combined into a single outer loop, forming the cartesian product of its iterables:

julia> for i = 1:2, j = 3:4
           println((i, j))
       end
(1, 3)
(1, 4)
(2, 3)
(2, 4)

So instead of 5 iterations of category and cat_values in parallel, we had 5x5=25 iterations of the Cartesian product of their elements. The order made it so that we filled each column with successive values of cat_values until the last 0.0349219. To iterate 2 or more sequences in parallel (until the shortest one is exhausted), we can use zip:

julia> for (i,x) in zip(category, cat_values)
           df[:,i] .= x
       end

julia> df
10Γ—6 DataFrame
 Row β”‚ A      cat1      cat2      cat3     cat4       cat5
     β”‚ Int64  Float64   Float64   Float64  Float64    Float64
─────┼──────────────────────────────────────────────────────────
   1 β”‚     1  0.203842  0.210149      0.0  0.0702434  0.0349219
   2 β”‚     2  0.203842  0.210149      0.0  0.0702434  0.0349219
   3 β”‚     3  0.203842  0.210149      0.0  0.0702434  0.0349219
...

Seems right.

4 Likes