Open source Python libraries for ETL pipelines
If you’re familiar with the Python programming language, you’re probably already acquainted with Pandas and NumPy, two of Python’s currently most-used modules for processing various sources of data. For those of you who are less acquainted, we have provided a brief overview of both packages next.
Pandas
In the wild, giant pandas adapted vertical pupils (similar to cats) that enable them to have amazing night vision. It’s useful to think of Python modules, such as Pandas, in the same context as evolutionary adaptations. Modules such as Pandas are specific augmentations to programming languages such as Python, which make completing tasks not only easier to perform but typically with more clarity and less code.
Similar to its furry counterpart, the Pandas Python module is powerful and flexible, and it was designed to be as close to a one-stop shop for processing data files as reasonably possible. Imported with the pd
prefix, the Pandas library contains the most common import functions for data sources (such as CSV, YXY, and Excel) with simple, human-readable commands such as pd.read_csv(file__path)
. It’s also the most effective for merging data sources together or aggregating data to desirable levels.
Pandas is a nice plug and play module that is easily installed directly on your local environment (though for the purpose of this book, we highly recommend downloading within your virtual pipenv
environment – more on that later).
It’s important to keep in mind that most Python modules rely on the CPUs available on your local device, which means that one problem with Pandas is its processing capacity. When data is imported with Pandas, the data is stored in the local memory on your device during the duration of the script. This becomes problematic, and very quickly, as multiple data copies are created of larger and larger datasets, even if only during a script’s cycle.
Feel free to comb through the documentation for more information about Pandas: https://pypi.org/project/pandas/.
Within your virtual environment, please execute the following command to install Pandas into your project environment. We provided the full installation output so you can get more familiar with what pipenv
outputs for successful package installation:
(Project) usr@project %% pipenv install pandasInstalling pandas... Adding pandas to Pipfile's [packages]..Installation Succeeded Pipfile.lock not found, creating Locking [dev-packages] dependencies Locking [packages] dependencies Building requirements Resolving dependencies
Success! Updated Pipfile.lock (950830)! Installing dependencies from Pipfile.lock (950830)
![]()
0/0 — 0
NumPy
When it comes to crunching numbers, NumPy is your guy. NumPy is a conjunction of “Numbers + Python” and was designed by mathematicians and statisticians that like to keep the mathematical jargon (and therefore integrity) under the pretty hood of a lovely np
abbreviation. Like Pandas, NumPy is also a quick start package install when initiating a new script for data processing, as NumPy can be used for anything ranging from defining and converting data types within a data structure to merging multiple data sources into one clear, mathematically correct aggregate. Also like Pandas, NumPy can easily run on your local environment. For a more in-depth overview of NumPy, feel free to reference the documentation: https://pypi.org/project/numpy/.
In your virtual environment, please execute the following command to install NumPy into your project environment. Similar to the Pandas installation previously, the Pipfile.Lock
file is updated as part of each new package installation:
(Project) usr@project %% pipenv install numpyInstalling numpy Adding numpy to Pipfile's [packages]Installation Succeeded Pipfile.lock (950830) out of date, updating to (1c7351) Locking [dev-packages] dependencies Locking [packages] dependencies Building requirements Resolving dependencies
Success! Updated Pipfile.lock (1c7351)! Installing dependencies from Pipfile.lock (1c7351)
![]()
0/0 — 0
Pandas and NumPy are both incredibly powerful and useful Python packages, but they have their limitations. Both Pandas and NumPy require a contiguous allocation of memory, thus, simple data manipulation operations become quite costly as each new version of the data is stored in contiguous memory locations. Even in large-capacity environments, both packages perform rather poorly (and unbearably slow) on large datasets. Since the premise of this book is creating reliable, scalable, data pipelines, restricting our code base to Pandas and NumPy simply won’t do.