Comparing various CSV reading packages

Overview

A CSV reader benchmark suite. Results are regularly published on the Queryverse benchmark page.

Setup

You need Julia 1.1.0, Julia 0.6.4, Python 3.X and R installed.

For Julia 0.6.4, you need to install the CSV.jl and TextParse.jl package.

For Julia 1.1.0, activate the project in the root folder and instantiate it.

For Python, install pandas and pyarrow.

For R, install data.table and readr.

On Windows, run deps/build.jl to download EmptyStandbyList.exe executable.

Generating test data

Run src/writedata.jl to generate the input data.

WARNING: The writedata.jl script generates a lot of data, currently about 150 GB. Make sure you have enough disc space!

You can configure the details of the write process by adding a file src/local_writeconfig.jl, and then adding any of the following lines to configure various aspects:

# Configure what row number cases you want to generate
ns = [100, 10_000, 1_000_000]

# Configure what column number cases you want to generate
cs = [20, 200]

# Share of missing values in columns with missing data
missing_share = 0.5

# Configure how many rows one very large test file should have
largefile_rown = 150_000_000

# Configure what colun types you want to generate
uniform_types = [
    :float64,      # Float64 with all digits
    :float64short, # Float64 with only 4 digist after the decimal point
    :int64,        # Int64
    :datetime,     # DateTime
    :string,       # Random string
    :stringcat,    # Randomly pick one out of 5 strings for each cell
    :stringescaped # A string that includes an escaped quote char
]

Running the benchmarks

Run the src/main.jl script to run the benchmarks. You need to run the script in elevated mode, so on Windows as an administrator, on Mac and Linux with sudo.

You can configure the details of the benchmarking process by adding a file src/local_config.jl, and then adding any of the following lines to configure various aspects:

# Configure which row cases you want to benchmark
ns = [100, 10_000, 1_000_000]

# Configure which column cases you want to benchmark
cs = [20, 200]

# Configure how many samples you want to take per benchmarking case
samples = 5

# Configure which packages to run
tests_to_run = [
    :textparse,       # TextParse.jl
    :csvfiles,        # CSVFiles.jl
    :textparse06,     # TextParse.jl on Julia 0.6
    :csv,             # CSV.jl
    :csv06,           # CSV.jl on Julia 0.6
    # :csvreader,     # CSVReader.jl
    # :tablereader,   # TableReader.jl
    :pandas,          # Pandas.jl
    :rfreads,         # R data.table single threaded
    :rfreadp,         # R data.table parallel
    :rreadr,          # R readr
    :dataframes,      # DataFrames.jl
    :pythonpandas,    # Python pandas
    :pythonarrows,    # Python pyarrow single threaded
    :pythonarrowp     # Python pyarrow parallel
]

# Path to Julia 0.6 binary
jl06bin = "julia-0.6"

# Path to Julia 1.1 binary
jlbin = "julia"

# Path to Rscript binary
rbin = "Rscript"

# Path to Python binary
pythonbin = "python3"

Contributing

Issues and PRs with contributions are most welcome!

Name		Name	Last commit message	Last commit date
Latest commit History 129 Commits
data		data
deps		deps
src		src
.gitignore		.gitignore
LICENSE		LICENSE
Manifest.toml		Manifest.toml
Project.toml		Project.toml
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Comparing various CSV reading packages

Overview

Setup

Generating test data

Running the benchmarks

Contributing

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

davidanthoff/csv-comparison

Folders and files

Latest commit

History

Repository files navigation

Comparing various CSV reading packages

Overview

Setup

Generating test data

Running the benchmarks

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages