How to get ranges of valid data in vectors or dataframes?

Hi,

using DataFrames

df = DataFrame(col1 = [1, NaN, 3, 4, NaN, 6, 7, 8, 9, 10, NaN, NaN], col2 = [missing, 2, 3, 4, 5, 6, missing, 8, 9, 10, 11, 12])
12×2 DataFrame
 Row │ col1     col2    
     │ Float64  Int64?  
─────┼──────────────────
   1 │     1.0  missing 
   2 │   NaN          2
   3 │     3.0        3
   4 │     4.0        4
   5 │   NaN          5
   6 │     6.0        6
   7 │     7.0  missing 
   8 │     8.0        8
   9 │     9.0        9
  10 │    10.0       10
  11 │   NaN         11
  12 │   NaN         12

So for either column, I would like to get the maximum ranges of contiguous valid data, e.g. the lines (indexes or actual lines) range 6 to 10 for column 1.
Or for missings 2 to 6 and 8 to 11 for column 2.
I started to think about running through the lines etc. but I guess that there’s a more native way to do that.

Here would be an example of how I would do it

function max_valid_run(v)
    max_len = 0
    curr_len = 0
    for x in v
        if x !== missing && !isnan(x)
            curr_len += 1
            max_len = max(max_len, curr_len)
        else
            curr_len = 0
        end
    end
    return max_len
end

mapcols(col -> max_valid_run(col), df)
1×2 DataFrame
 Row │ col1   col2  
     │ Int64  Int64 
─────┼──────────────
   1 │     5     5

Probably is a more efficient way of checking the columns but have to pay some care as isnan(missing) gives missing as opposed to false so !isnan(x) && x !== missing would throw a TypeError

Same idea, compressed to one liner:

julia> mapcols(c->maximum(diff(findall(
  (x->ifelse(ismissing(x),true,isnan(x))).(c))))-1, df)
1×2 DataFrame
 Row │ col1   col2  
     │ Int64  Int64 
─────┼──────────────
   1 │     5      5

The code below outputs vectors of unit ranges as requested:

using DataFrames

function max_valid_ranges(v)
    ix = findall(in((NaN, missing)), v)
    (1 ∉ ix) && pushfirst!(ix, 0)
    n = length(v)
    (n ∉ ix) && append!(ix, n + 1)
    d = diff(ix)
    iy = findall(==(maximum(d)), d)
    return [(ix[i]+1):(ix[i]+d[i]-1) for i in iy]
end

df = DataFrame(col1 = [1, NaN, 3, 4, NaN, 6, 7, 8, 9, 10, NaN, NaN], col2 = [missing, 2, 3, 4, 5, 6, missing, 8, 9, 10, 11, 12])

max_valid_ranges(df.col1)   # [6:10]
max_valid_ranges(df.col2)   # [2:6; 8:12]
1 Like