Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Arrow up icon
GO TO TOP
Building ETL Pipelines with Python

You're reading from   Building ETL Pipelines with Python Create and deploy enterprise-ready ETL pipelines by employing modern methods

Arrow left icon
Product type Paperback
Published in Sep 2023
Publisher Packt
ISBN-13 9781804615256
Length 246 pages
Edition 1st Edition
Languages
Tools
Arrow right icon
Authors (2):
Arrow left icon
Brij Kishore Pandey Brij Kishore Pandey
Author Profile Icon Brij Kishore Pandey
Brij Kishore Pandey
Emily Ro Schoof Emily Ro Schoof
Author Profile Icon Emily Ro Schoof
Emily Ro Schoof
Arrow right icon
View More author details
Toc

Table of Contents (22) Chapters Close

Preface 1. Part 1:Introduction to ETL, Data Pipelines, and Design Principles
2. Chapter 1: A Primer on Python and the Development Environment FREE CHAPTER 3. Chapter 2: Understanding the ETL Process and Data Pipelines 4. Chapter 3: Design Principles for Creating Scalable and Resilient Pipelines 5. Part 2:Designing ETL Pipelines with Python
6. Chapter 4: Sourcing Insightful Data and Data Extraction Strategies 7. Chapter 5: Data Cleansing and Transformation 8. Chapter 6: Loading Transformed Data 9. Chapter 7: Tutorial – Building an End-to-End ETL Pipeline in Python 10. Chapter 8: Powerful ETL Libraries and Tools in Python 11. Part 3:Creating ETL Pipelines in AWS
12. Chapter 9: A Primer on AWS Tools for ETL Processes 13. Chapter 10: Tutorial – Creating an ETL Pipeline in AWS 14. Chapter 11: Building Robust Deployment Pipelines in AWS 15. Part 4:Automating and Scaling ETL Pipelines
16. Chapter 12: Orchestration and Scaling in ETL Pipelines 17. Chapter 13: Testing Strategies for ETL Pipelines 18. Chapter 14: Best Practices for ETL Pipelines 19. Chapter 15: Use Cases and Further Reading 20. Index 21. Other Books You May Enjoy

Open source Python libraries for ETL pipelines

If you’re familiar with the Python programming language, you’re probably already acquainted with Pandas and NumPy, two of Python’s currently most-used modules for processing various sources of data. For those of you who are less acquainted, we have provided a brief overview of both packages next.

Pandas

In the wild, giant pandas adapted vertical pupils (similar to cats) that enable them to have amazing night vision. It’s useful to think of Python modules, such as Pandas, in the same context as evolutionary adaptations. Modules such as Pandas are specific augmentations to programming languages such as Python, which make completing tasks not only easier to perform but typically with more clarity and less code.

Similar to its furry counterpart, the Pandas Python module is powerful and flexible, and it was designed to be as close to a one-stop shop for processing data files as reasonably possible. Imported with the pd prefix, the Pandas library contains the most common import functions for data sources (such as CSV, YXY, and Excel) with simple, human-readable commands such as pd.read_csv(file__path). It’s also the most effective for merging data sources together or aggregating data to desirable levels.

Pandas is a nice plug and play module that is easily installed directly on your local environment (though for the purpose of this book, we highly recommend downloading within your virtual pipenv environment – more on that later).

It’s important to keep in mind that most Python modules rely on the CPUs available on your local device, which means that one problem with Pandas is its processing capacity. When data is imported with Pandas, the data is stored in the local memory on your device during the duration of the script. This becomes problematic, and very quickly, as multiple data copies are created of larger and larger datasets, even if only during a script’s cycle.

Feel free to comb through the documentation for more information about Pandas: https://pypi.org/project/pandas/.

Within your virtual environment, please execute the following command to install Pandas into your project environment. We provided the full installation output so you can get more familiar with what pipenv outputs for successful package installation:

(Project) usr@project %%  pipenv install pandasInstalling pandas...
Adding pandas to Pipfile's [packages]..
 Installation Succeeded
Pipfile.lock not found, creating
Locking [dev-packages] dependencies
Locking [packages] dependencies
Building requirements
Resolving dependencies
 Success!
Updated Pipfile.lock (950830)!
Installing dependencies from Pipfile.lock (950830)
      0/0 — 0

NumPy

When it comes to crunching numbers, NumPy is your guy. NumPy is a conjunction of “Numbers + Python” and was designed by mathematicians and statisticians that like to keep the mathematical jargon (and therefore integrity) under the pretty hood of a lovely np abbreviation. Like Pandas, NumPy is also a quick start package install when initiating a new script for data processing, as NumPy can be used for anything ranging from defining and converting data types within a data structure to merging multiple data sources into one clear, mathematically correct aggregate. Also like Pandas, NumPy can easily run on your local environment. For a more in-depth overview of NumPy, feel free to reference the documentation: https://pypi.org/project/numpy/.

In your virtual environment, please execute the following command to install NumPy into your project environment. Similar to the Pandas installation previously, the Pipfile.Lock file is updated as part of each new package installation:

(Project) usr@project %%  pipenv install numpyInstalling numpy
Adding numpy to Pipfile's [packages]
 Installation Succeeded
Pipfile.lock (950830) out of date, updating to (1c7351)
Locking [dev-packages] dependencies
Locking [packages] dependencies
Building requirements
Resolving dependencies
 Success!
Updated Pipfile.lock (1c7351)!
Installing dependencies from Pipfile.lock (1c7351)
      0/0 — 0

Pandas and NumPy are both incredibly powerful and useful Python packages, but they have their limitations. Both Pandas and NumPy require a contiguous allocation of memory, thus, simple data manipulation operations become quite costly as each new version of the data is stored in contiguous memory locations. Even in large-capacity environments, both packages perform rather poorly (and unbearably slow) on large datasets. Since the premise of this book is creating reliable, scalable, data pipelines, restricting our code base to Pandas and NumPy simply won’t do.

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €18.99/month. Cancel anytime
Visually different images