Skip to content

DOC: example of DataFrame export to HDF5 and import into R #9636

Closed
@joschkazj

Description

@joschkazj

When searching the web I didn't find any examples of a working pandas to R data transfer using HDF5 files, even though pandas's documentation mentions the used HDF5 format "can easily be imported into R using the rhdf5 library". The pandas export works as expected and I inspected the file format using the HDF group's viewer (HDFView).
After some experimentation I have a working sample for dataframe export from Python/pandas and import into R, which could be added to the documentation to help future users:

# Example of HDF5 export for R

import numpy as np
import pandas as pd

np.random.seed(1)
df = pd.DataFrame({"first": np.random.rand(100),
                   "second": np.random.rand(100),
                   "class": np.random.randint(0, 2, (100,))},
                   index=range(100))

print(df.head())

store = pd.HDFStore("transfer.hdf5", "w", complib=str("zlib"), complevel=5)
store.put("dataframe", df, data_columns=df.columns)
store.close()

Output:

   class     first    second
0      0  0.417022  0.326645
1      0  0.720324  0.527058
2      1  0.000114  0.885942
3      1  0.302333  0.357270
4      1  0.146756  0.908535
# Load values and column names for all datasets from corresponding nodes and
# insert them into one data.frame object.

library(rhdf5)

loadhdf5data <- function(h5File) {

listing <- h5ls(h5File)
# Find all data nodes, values are stored in *_values and corresponding column
# titles in *_items
data_nodes <- grep("_values", listing$name)
name_nodes <- grep("_items", listing$name)

data_paths = paste(listing$group[data_nodes], listing$name[data_nodes], sep = "/")
name_paths = paste(listing$group[name_nodes], listing$name[name_nodes], sep = "/")

columns = list()
for (idx in seq(data_paths)) {
  data <- data.frame(t(h5read(h5File, data_paths[idx])))
  names <- t(h5read(h5File, name_paths[idx]))
  entry <- data.frame(data)
  colnames(entry) <- names
  columns <- append(columns, entry)
}

data <- data.frame(columns)

return(data)
}

Now you can import the DataFrame:

> data = loadhdf5data("transfer.hdf5")
> head(data)
         first    second class
1 0.4170220047 0.3266449     0
2 0.7203244934 0.5270581     0
3 0.0001143748 0.8859421     1
4 0.3023325726 0.3572698     1
5 0.1467558908 0.9085352     1
6 0.0923385948 0.6233601     1

I hope this helps someone. :-)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions