PETL

PETL is an expressive ETL (Extract, Transform and Load) library for Python data pipelines.

Writing data pipelines can be repetitive work, and a good library that helps with the process can really speed things up, especially when those pipelines are automated and you're notified that one has failed at an inconvenient time. ETL (Extract, Transform, Load) tools specialise in taking data from one place, manipulating it and sending it somewhere else. PETL is just such a tool. This is probably the first tool I reach for when I'm doing any data pipeline work in Python.

PETL pipelines are chains of method calls, each of which typically has a single purpose. These can get quite complex: for illustrative purposes, I'll keep it simple. The pipeline below

reads a CSV file,
converts the date column into a proper Python datetime,
convers all numbers to actual numbers,
selects only rows with a count greater than 10, and
writes the file to an HTML file.

import petl

(
  petl.fromcsv('input.csv')
    .convert('date', petl.datetimeparser('%Y-%m-%d'))
    .convertnumbers()
    .selectgt('count', 10)
    .tohtml('output.html')  
)

PETL supports a wide variety of common formats such as delimited files (like CSV), JSON files, Excel and Google Sheets, as well as any database supported by SQLAlchemy. It's possible to create custom I/O streams to wrapper up other sources or targets. It also supports remote sources using the fsspec library, enabling it to read from and write to a range of widely used cloud storage servers such as AWS S3, Azure Blob and SFTP.

The core of the PETL library is the built-in transformations which operate on rows or columns. There are a wide range of transformations, from simple operations such as filtering, subsetting columns, converting values and sorting, through to more complex operations like reshaping, aggregation, and database-style joins. If the built in transformations are not quite up to the task, it is usually possible to create use the row transforming methods. It's also possible to convert a PETL table into a Pandas dataframe, which is particularly useful when there is some really complex processing to perform.

You can take a reference to a partial pipeline and perform different sets of processing on it: perhaps saving different files with the result of different processing. PETL attempts to be memory efficient, and doesn't read all data into memory until a pipeline calls for it either writing the data or performing a more complex function such as aggregation. You can control this to avoid unnecessary recalculation using a call to the cache() method.

Created: December 13th, 2024
Updated: December 13th, 2024