You're reading from Building ETL Pipelines with Python Create and deploy enterprise-ready ETL pipelines by employing modern methods

Product type Paperback

Published in Sep 2023

Publisher Packt

ISBN-13 9781804615256

Length 246 pages

Edition 1st Edition

Languages

Python

Tools

AWS

Concepts

Data Streaming

Authors (2):

Brij Kishore Pandey

Emily Ro Schoof

View More author details

Table of Contents (22) Chapters

Preface

1. Part 1:Introduction to ETL, Data Pipelines, and Design Principles

2. Chapter 1: A Primer on Python and the Development Environment FREE CHAPTER

3. Chapter 2: Understanding the ETL Process and Data Pipelines

4. Chapter 3: Design Principles for Creating Scalable and Resilient Pipelines

5. Part 2:Designing ETL Pipelines with Python

6. Chapter 4: Sourcing Insightful Data and Data Extraction Strategies

7. Chapter 5: Data Cleansing and Transformation

8. Chapter 6: Loading Transformed Data

9. Chapter 7: Tutorial – Building an End-to-End ETL Pipeline in Python

10. Chapter 8: Powerful ETL Libraries and Tools in Python

11. Part 3:Creating ETL Pipelines in AWS

12. Chapter 9: A Primer on AWS Tools for ETL Processes

13. Chapter 10: Tutorial – Creating an ETL Pipeline in AWS

14. Chapter 11: Building Robust Deployment Pipelines in AWS

15. Part 4:Automating and Scaling ETL Pipelines

16. Chapter 12: Orchestration and Scaling in ETL Pipelines

17. Chapter 13: Testing Strategies for ETL Pipelines

18. Chapter 14: Best Practices for ETL Pipelines

19. Chapter 15: Use Cases and Further Reading

20. Index

Why subscribe?

21. Other Books You May Enjoy

Open source Python libraries for ETL pipelines

If you’re familiar with the Python programming language, you’re probably already acquainted with Pandas and NumPy, two of Python’s currently most-used modules for processing various sources of data. For those of you who are less acquainted, we have provided a brief overview of both packages next.

Pandas

In the wild, giant pandas adapted vertical pupils (similar to cats) that enable them to have amazing night vision. It’s useful to think of Python modules, such as Pandas, in the same context as evolutionary adaptations. Modules such as Pandas are specific augmentations to programming languages such as Python, which make completing tasks not only easier to perform but typically with more clarity and less code.

Similar to its furry counterpart, the Pandas Python module is powerful and flexible, and it was designed to be as close to a one-stop shop for processing data files as reasonably possible. Imported with the pd prefix, the Pandas library contains the most common import functions for data sources (such as CSV, YXY, and Excel) with simple, human-readable commands such as pd.read_csv(file__path). It’s also the most effective for merging data sources together or aggregating data to desirable levels.

Pandas is a nice plug and play module that is easily installed directly on your local environment (though for the purpose of this book, we highly recommend downloading within your virtual pipenv environment – more on that later).

It’s important to keep in mind that most Python modules rely on the CPUs available on your local device, which means that one problem with Pandas is its processing capacity. When data is imported with Pandas, the data is stored in the local memory on your device during the duration of the script. This becomes problematic, and very quickly, as multiple data copies are created of larger and larger datasets, even if only during a script’s cycle.

Feel free to comb through the documentation for more information about Pandas: https://pypi.org/project/pandas/.

Within your virtual environment, please execute the following command to install Pandas into your project environment. We provided the full installation output so you can get more familiar with what pipenv outputs for successful package installation:

(Project) usr@project %%  pipenv install pandasInstalling pandas...
Adding pandas to Pipfile's [packages]..
 Installation Succeeded
Pipfile.lock not found, creating
Locking [dev-packages] dependencies
Locking [packages] dependencies
Building requirements
Resolving dependencies
 Success!
Updated Pipfile.lock (950830)!
Installing dependencies from Pipfile.lock (950830)
      0/0 — 0

NumPy

When it comes to crunching numbers, NumPy is your guy. NumPy is a conjunction of “Numbers + Python” and was designed by mathematicians and statisticians that like to keep the mathematical jargon (and therefore integrity) under the pretty hood of a lovely np abbreviation. Like Pandas, NumPy is also a quick start package install when initiating a new script for data processing, as NumPy can be used for anything ranging from defining and converting data types within a data structure to merging multiple data sources into one clear, mathematically correct aggregate. Also like Pandas, NumPy can easily run on your local environment. For a more in-depth overview of NumPy, feel free to reference the documentation: https://pypi.org/project/numpy/.

In your virtual environment, please execute the following command to install NumPy into your project environment. Similar to the Pandas installation previously, the Pipfile.Lock file is updated as part of each new package installation:

(Project) usr@project %%  pipenv install numpyInstalling numpy
Adding numpy to Pipfile's [packages]
 Installation Succeeded
Pipfile.lock (950830) out of date, updating to (1c7351)
Locking [dev-packages] dependencies
Locking [packages] dependencies
Building requirements
Resolving dependencies
 Success!
Updated Pipfile.lock (1c7351)!
Installing dependencies from Pipfile.lock (1c7351)
      0/0 — 0

Pandas and NumPy are both incredibly powerful and useful Python packages, but they have their limitations. Both Pandas and NumPy require a contiguous allocation of memory, thus, simple data manipulation operations become quite costly as each new version of the data is stored in contiguous memory locations. Even in large-capacity environments, both packages perform rather poorly (and unbearably slow) on large datasets. Since the premise of this book is creating reliable, scalable, data pipelines, restricting our code base to Pandas and NumPy simply won’t do.