How to get ranges of valid data in vectors or dataframes?

Ju_ska · June 18, 2025, 9:49am

Hi,

using DataFrames

df = DataFrame(col1 = [1, NaN, 3, 4, NaN, 6, 7, 8, 9, 10, NaN, NaN], col2 = [missing, 2, 3, 4, 5, 6, missing, 8, 9, 10, 11, 12])

12×2 DataFrame
 Row │ col1     col2    
     │ Float64  Int64?  
─────┼──────────────────
   1 │     1.0  missing 
   2 │   NaN          2
   3 │     3.0        3
   4 │     4.0        4
   5 │   NaN          5
   6 │     6.0        6
   7 │     7.0  missing 
   8 │     8.0        8
   9 │     9.0        9
  10 │    10.0       10
  11 │   NaN         11
  12 │   NaN         12

So for either column, I would like to get the maximum ranges of contiguous valid data, e.g. the lines (indexes or actual lines) range 6 to 10 for column 1.
Or for missings 2 to 6 and 8 to 11 for column 2.
I started to think about running through the lines etc. but I guess that there’s a more native way to do that.

lukketotte · June 18, 2025, 2:26pm

Here would be an example of how I would do it

function max_valid_run(v)
    max_len = 0
    curr_len = 0
    for x in v
        if x !== missing && !isnan(x)
            curr_len += 1
            max_len = max(max_len, curr_len)
        else
            curr_len = 0
        end
    end
    return max_len
end

mapcols(col -> max_valid_run(col), df)

1×2 DataFrame
 Row │ col1   col2  
     │ Int64  Int64 
─────┼──────────────
   1 │     5     5

Probably is a more efficient way of checking the columns but have to pay some care as isnan(missing) gives missing as opposed to false so !isnan(x) && x !== missing would throw a TypeError

Dan · June 19, 2025, 12:15am

Same idea, compressed to one liner:

julia> mapcols(c->maximum(diff(findall(
  (x->ifelse(ismissing(x),true,isnan(x))).(c))))-1, df)
1×2 DataFrame
 Row │ col1   col2  
     │ Int64  Int64 
─────┼──────────────
   1 │     5      5

rafael.guerra · June 19, 2025, 7:38am

The code below outputs vectors of unit ranges as requested:

using DataFrames

function max_valid_ranges(v)
    ix = findall(in((NaN, missing)), v)
    (1 ∉ ix) && pushfirst!(ix, 0)
    n = length(v)
    (n ∉ ix) && append!(ix, n + 1)
    d = diff(ix)
    iy = findall(==(maximum(d)), d)
    return [(ix[i]+1):(ix[i]+d[i]-1) for i in iy]
end

df = DataFrame(col1 = [1, NaN, 3, 4, NaN, 6, 7, 8, 9, 10, NaN, NaN], col2 = [missing, 2, 3, 4, 5, 6, missing, 8, 9, 10, 11, 12])

max_valid_ranges(df.col1)   # [6:10]
max_valid_ranges(df.col2)   # [2:6; 8:12]

Topic		Replies	Views
Finding DataFrame rows with `missing` values in specific columns? General Usage dataframes , missing-values	12	1063	February 7, 2022
Check if value is missing in DataFrame indexed by values (not by numerical indices) General Usage dataframes	1	1133	September 11, 2018
Iterate over all numeric columns in DataFrames Data	21	4809	February 11, 2018
How to single out valid data from DataArray? Data question	4	2237	June 8, 2017
Get the maximum value by row in a DataFrame containing missing values Data statistics , benchmark , dataframes , missing-values	10	144	August 30, 2024

How to get ranges of valid data in vectors or dataframes?

Related topics