The Docker Compose of ETL: Meerschaum Compose

This article is about Meerschaum Compose, a tool for defining ETL pipelines in YAML and a plugin for the data engineering framework…

The Docker Compose of ETL: Meerschaum Compose
Photo by CHUTTERSNAP on Unsplash
This article is about Meerschaum Compose, a tool for defining ETL pipelines in YAML and a plugin for the data engineering framework Meerschaum.

Docker was a game-changer, revolutionizing the way we design, build, and run our cloud applications. Pretty early on, however, developers realized its flexibility made collaboration difficult, so docker-compose became to the tool of choice for managing environments and multi-container projects.

In a similar vein, this issue of consistent environments also emerged for the ETL framework Meerschaum. As more data engineers built their pipelines using the platform, the dynamic nature of pipes meant that a solution was needed to provide project-level isolation.

Inspired by Docker Compose, this solution came in the form of a project called Meerschaum Compose. I use Compose daily at work and for my personal projects to build and manage my data pipelines, and today I’d like to show how you can build your ETL projects with Compose.


How Compose Tames the Engine

With great power comes great responsibility, and in Docker’s case, this responsibility is handled by a manifest file called docker-compose.yml which describes how services within an application should run. This file acts as living documentation that facilitates prototyping and describes expected environments to CloudOps. Through a simple standard, Docker Compose fills gaps in the development process by providing a convenient way to standardize and share environments for multi-service projects.

Meerschaum Compose serves a similar purpose: in a mrsm-compose.yml file, you specify everything you might need for a project: the expected environment, plugins, pipes, and connectors.

Commands

When you encounter a new Compose project (using the Tech Slam ‘N Eggs demo project in the screenshots below), try these commands to get your bearings:

Note: see this repository if you’d like to run Compose in a Docker container.

mrsm compose run
Registers the pipes and syncs them one-by-one. This is great for ensuring execution order and updating the pipes’ parameters with the database. One common pattern is to chain together various stages in separate compose files as part of a larger ETL process:

mrsm compose run --file mrsm-compose-00-extract.yaml && \ 
mrsm compose run --file mrsm-compose-01-transform.yaml && \ 
mrsm compose run --file mrsm-compose-02-load.yaml
Note: the command mrsm compose sync pipes is parallelized and is best used if the pipes within the compose file are independent.
The compose run command syncs pipes one-at-a-time.

mrsm compose explain
Parses the Compose file and prints the current environment and status of defined pipes. This is helpful when troubleshooting and understanding the structure of a project.

The compose explain command prints the status of defined pipes.

mrsm compose up --dry
Registers and updates the remote pipes’ parameters. This is implicitly run as part of mrsm compose run and should be run before standard Meerschaum actions like mrsm compose sync pipes. The --dry flag prevents syncing jobs from running.

Note: Compose will tag pipes with the project name. Set the key project_name if using multiple compose files within a project.

mrsm compose down -v
Stops jobs and deletes pipes. This is analogous to docker compose down -v (i.e. -v for “volumes”).

mrsm compose <action>
Executes standard Meerschaum actions from the project environment (e.g. sync pipes, delete pipes, custom actions). I often run mrsm compose python to jump into a REPL within the project environment.

Whenever you run an action with mrsm compose, the flag --tags {project_name} is appended (unless overridden) to ensure that you only interact with the pipes within the project.


An exampe Meerschaum Compose file.
An example Meerschaum Compose project for ETL on weather data.

Pipes

The humble pipe is Meerschaum’s abstraction for incremental ETL. Pipes have input and output connectors and store parameters to configure the behavior of their syncing processes. This may be as simple as a SQL query or may include custom keys for use in your plugins.

Meerschaum pipes created by Meerschaum Compose.
Pipes from the above Compose project displayed by the web UI

Because pipes’ metadata are stored alongside their tables, they’re easily editable (whether via edit pipes or on the web UI), which facilitates prototyping. But this dynamic nature introduces the same problem described at the beginning of this article: in order to scale development, a Compose file is needed to define a project’s components in a way that can be easily version-controlled.

According to the Meerschaum Compose specification, pipes are defined in a list under the keys sync:pipes. Each item defines the keys and parameters needed to construct the pipe, like a blueprint for what you expect the pipes in the database to reflect.

For example, the following snippet would define a pipe that would sync a table weather from a remote PostgreSQL database (defined below as sql:source) to a local SQLite file (sql:dest in this project).

sync: 
  pipes: 
    - connector: "sql:source" 
      metric: "weather" 
      target: "weather" 
      columns: 
        datetime: "timestamp" 
        station: "station" 
      parameters: 
        fetch: 
          backtrack_minutes: 1440 
        query: |- 
          SELECT timestamp, station, temperature 
          FROM weather 
 
 
config: 
  meerschaum: 
    instance: "sql:dest" 
    connectors: 
      sql: 
        source: "postgresql://user:pass@host:5432/db" 
        dest: "sqlite:////tmp/dest.db"

This example would incrementally update a table named weather using the datetime axis timestamp for range bounding (1 day backtracking), and this column plus the ID column station together would make up a composite primary key used for de-duplication.

The URI is written literally just as an example; if you are committing a compose file, either reference an environment variable (e.g. $SECRET_URI) or your host Meerschaum configuration (e.g. MRSM{meerschaum:connectors:sql:source}).

Connectors

First, a quick refresher on Meerschaum connectors: you can define connectors through several ways, the most popular of which being through environment variables. Suppose you define your connection secrets in an environment file:

export MRSM_SQL_REMOTE='postgresql://user:pass@host:5432/db' 
export MRSM_FOO_BAR='{ 
    "user": "abc", 
    "password": "def" 
}'

The first environment variable MRSM_SQL_REMOTE would define the connector sql:remote. If you sourced this file, you could verify this connector with the command mrsm show connectors sql:remote.

The second variable is an example of how to define a custom FooConnector, which you could create using the @make_connector decorator in a plugin. Custom connectors are a powerful tool, but for now, here’s the basic structure:

from meerschaum.connectors import make_connector, Connector 
 
@make_connector 
class FooConnector(Connector): 
    REQUIRED_ATTRIBUTES = ['username', 'password'] 
 
    def fetch(pipe, **kwargs): 
        docs = [] 
        return docs

So we’ve just reviewed how to define connectors in our host environment. Let’s see how to make these host connectors available in a Meerschaum project. In the compose file, all of the connectors we need for our project are defined under config:meerschaum:connectors. Use the MRSM{} syntax to reference the keys from your host environment and pass them into the project.

config: 
  meerschaum: 
    instance: "sql:app" 
    connectors: 
      sql: 
        app: MRSM{meerschaum:connectors:sql:remote} 
      foo: 
        bar: MRSM{meerschaum:connectors:foo:bar}

Plugins

Meerschaum is easily extendable via plugins, which are Python modules. Plugins may fetch data, implement custom connectors, and/or extend Meerschaum (e.g. custom actions, flags, API endpoints, etc.).

Meerschaum supports multiple plugins directories (via MRSM_PLUGINS_DIR), which may be set under the plugins_dir key in mrsm-compose.yaml (the default is a directory plugins).

Storing your plugins within a Compose project makes it clear how you expect your plugins to be used. For example, the Compose file within the MongoDBConnector project demonstrates how the custom connector is used as both a connector and as an instance.

Package Management

When you first start using Meerschaum Compose, the first thing you’ll notice is that it will start installing a fair amount of Python packages. Don’t worry about your environment ― everything is installed into virtual environments within your project’s root subdirectory (a bit ironic, right?). You can install your plugins’ dependencies with mrsm compose init.

To share packages between projects, set the key root_dir in mrsm-compose.yml to a new path. Deleting this root directory will effectively uninstall all of the packages that Compose downloaded, keeping your host environment intact.


Conclusion

Meerschaum Compose has become my go-to tool for building out my medium-scale ETL projects. It has a similar workflow to Meltano or dbt but with a lower barrier to entry and more dynamic control over the ETL process. It’s a neat workflow to organize plugins, connectors, and pipes in a way that works nicely in a team.

You can quickly kickstart a new project with the Meerschaum Compose template repository ― see the MongoDBConnector plugin or Tech Slam ‘N Eggs demo for practical examples.

Feel free to add your project to the Awesome Meerschaum list!