Open source core · Used by 12 teams in private beta

Batch LLM transformations
for your data pipelines

Define AI data transformations in YAML. Classify, extract, and generate at scale with type-safe outputs, automatic retries, and rate-limit handling.

$ pip install affinebox
~/project
$ affinebox run classify_tickets.yaml --source tickets.csv
AffineBox v0.3.1
---
▸ Loading pipeline: classify_tickets.yaml
▸ Source: tickets.csv (2,847 rows)
▸ Model: gpt-4o-mini | Batch size: 50 | Retries: 3
---
Processing ████████████████████ 2847/2847 [3m 42s]
---
✓ Done. 2,847 rows classified.
Categories: billing (847), technical (1,203), feature_request (612), other (185)
Output: ./output/tickets_classified.jsonl
Schema validation: 2,847/2,847 passed
Cost: $1.24 (gpt-4o-mini)

Everything is a YAML config

Define your data source, transformations, and output schema in a single file. AffineBox handles batching, retries, rate limits, and output validation.

classify_tickets.yaml yaml
name: "classify_support_tickets"
version: "1"

source:
  type: csv
  path: "./data/tickets.csv"
  columns: [id, subject, body, created_at]

transform:
  model: "gpt-4o-mini"
  temperature: 0
  prompt: |
    Classify this support ticket into exactly one category.
    Subject: {{ row.subject }}
    Body: {{ row.body }}

  output_schema:
    category:
      type: enum
      values: [billing, technical, feature_request, other]
    confidence:
      type: float
      min: 0
      max: 1
    reasoning:
      type: string
      max_length: 200

settings:
  batch_size: 50
  max_retries: 3
  concurrency: 10

output:
  type: postgres
  connection: "$DATABASE_URL"
  table: "ticket_classifications"
  on_conflict: upsert

What people are building

Real pipelines running in production by our beta users.

classification

Label 10K support tickets in 4 minutes

A SaaS team routes their Zendesk export through AffineBox to auto-classify tickets by category and urgency. Results feed back into their routing rules.

10,000 rows · gpt-4o-mini · ~$4.20 · 3m 48s
extraction

Extract structured data from 500 PDFs

An ops team pulls key fields (vendor, amount, date, line items) from scanned invoices. Schema validation ensures every row has the right types before hitting the DB.

500 documents · gpt-4o · ~$8.50 · 12m
generation

Generate product descriptions from specs

An e-commerce company generates SEO-friendly descriptions from a CSV of product attributes. Outputs are validated against a max-length schema and tone guide.

3,200 products · gpt-4o-mini · ~$2.10 · 6m

Built for production data pipelines

Not a prompt playground. AffineBox is infrastructure for running LLM transformations on real data at scale.

$ schema

Type-safe outputs

Define output schemas with types, enums, min/max, regex. Every LLM response is validated before writing. Failed validations trigger retries automatically.

$ connect

Any source, any sink

Read from CSV, Postgres, S3, BigQuery, or HTTP APIs. Write to any database, file, or webhook. Connection strings from env vars.

$ batch

Smart batching

Automatic rate-limit detection and backoff. Configurable concurrency and batch sizes. Resume from where you left off on failures.

$ model

Any model provider

OpenAI, Anthropic, Google, or any OpenAI-compatible endpoint. Switch models by changing one line. Cost tracking per run.

$ sdk

Python SDK

For complex logic, use the Python SDK instead of YAML. Same engine under the hood. Compose pipelines programmatically.

$ log

Observability built in

Every run is logged with inputs, outputs, latency, cost, and error rates. Export to your own observability stack or use the local dashboard.

Simple, usage-based

Free during beta. Paid tiers launching Q3 2026. Pricing is per record processed through a pipeline (you bring your own LLM API keys).

Free
$0 / mo
For experimentation and small jobs
  • 100 records / day
  • All connectors
  • Community support
  • Local CLI only
  • BYO API keys
Get started
Scale
$499 / mo
Unlimited volume, priority support
  • Unlimited records
  • Self-hosted option
  • SSO & audit logs
  • Dedicated support
  • Custom connectors
Contact us

Recent updates

We ship weekly. Here is what landed recently.

May 2, 2026 v0.3.1

Added resume-from-checkpoint for interrupted runs. Fixed a bug where Postgres output would silently drop rows with NULL primary keys. Improved rate-limit backoff for Anthropic API.

Apr 24, 2026 v0.3.0

Python SDK now in public beta. New connector: BigQuery (read & write). Output schema now supports nested objects and arrays. Breaking: renamed transform.schema to transform.output_schema.

Apr 15, 2026 v0.2.8

Added cost tracking per run (prints total $ spent at end). New flag: --dry-run to validate pipeline config without processing. S3 connector now supports IAM role auth.

Apr 5, 2026 v0.2.6

Initial support for Anthropic Claude models. Added --concurrency flag override. Fixed template rendering for nested row fields.

Start processing data in 5 minutes

Install the CLI, write a pipeline YAML, and run it. No sign-up required for the free tier.