Getting Started

Prerequisites

Installation

  • via PyPI: pip install python-pachyderm

  • via source

    • download desired version here (under assets)

    • make init in folder root-level

Hello World example

Creates a repo, commits data to the repo (versioning the data), and reads the data back from the repo.

import python_pachyderm

# Connects to a pachyderm cluster on localhost:30650
# For other connection options, see the API docs
client = python_pachyderm.Client()

# Create a pachyderm repo called `test`
client.create_repo("test")

# Create a file in (repo="test", branch="master") at `/dir_a/data.txt`
# Similar to `pachctl put file test@master:/dir_a/data.txt`
with client.commit("test", "master") as commit:
    client.put_file_bytes(commit, "/dir_a/data.txt", b"hello world")

# Get the file
f = client.get_file(("test", "master"), "/dir_a/data.txt")
print(f.read())  # >>> b"hello world"

Since client.get_file() returns a file-like object, you can pass it into your favorite analysis packages.

import pandas as pd

f = client.get_file(("my_repo", "my_branch"), "/path_to/my_data.csv")
df = pd.read_csv(f)

Hello World example with Pachyderm Pipelines

Creates a data-driven pipeline that transforms the data and outputs results to a new repo. This example pipeline counts the occurrences of the word hello in the repo. Continued from the example above…

from python_pachyderm.service import pps_proto

# Create a pipeline that logs frequency of the word "hello" in `test`
# repo to a file in the `word_count` repo (which is created automatically)
# Any time data is committed to the `test` repo, this pipeline will
# automatically trigger.
client.create_pipeline(
    "word_count",
    transform=pps_proto.Transform(
        cmd=["bash"],
        stdin=[
            "grep -roh hello /pfs/test/ | wc -w > /pfs/out/count.txt"
        ]
    ),
    input=pps_proto.Input(
        pfs=pps_proto.PFSInput(repo="test", branch="master", glob="/")
    )
)

# Wait for new commit, triggered by pipeline run, to finish
client.wait_commit(("word_count", "master"))

# Check `count.txt` for "hello" count
f = client.get_file(("word_count", "master"), "count.txt")
print(f.read())  # >>> b"1"

# Add more data to the `test` repo
with client.commit("test", "master") as commit:
    client.put_file_bytes(commit, "/data2.txt", b"hello hello from the top of the world")

# Wait for commit to finish
client.wait_commit(commit.id)

# Check `count.txt` for "hello" count
f = client.get_file(("word_count", "master"), "count.txt")
print(f.read())  # >>> b"3"

For more sophisticated examples, see the examples directory. To learn more about what you can do with Pachyderm, check out the docs.