I Built a Data Engineer Agent for Customer.io

I needed to get Customer.io data into BigQuery. Sounds simple enough. But here's the catch: Customer.io doesn't pipe directly into BigQuery. It drops Parquet files into a Google Cloud Storage bucket every 15 minutes. So the real work starts after the files land.

And I'm not a data engineer. I needed data pipeline automation without hiring one.

I know enough to be dangerous with GCP, but building production-grade pipelines—with proper IAM, scheduled jobs, deduplication, and cost controls—that's not my wheelhouse. So I used Claude Code to build it for me.

The result: a repeatable pipeline pattern I can now deploy for any client in under an hour. What used to take a data engineering team half a day per client now takes me a fraction of that.

Let's go ahead and jump into it.

The Problem: Parquet Files Piling Up

Customer.io's BigQuery integration works like this: every 15 minutes, it exports your messaging data—campaigns, events, deliveries, opens, clicks—as Parquet files into a GCS bucket you control.

That's where the "integration" ends. From there, you're on your own.

The challenges stack up fast:

Volume explosion. At 15-minute intervals, you're looking at 96 file drops per day. Multiply that by the number of data types (subjects, outputs, metrics), and you've got hundreds of files landing daily. Left unchecked, that's millions of records in weeks.

Deduplication and idempotency. Files can contain overlapping data. The same record might appear in multiple exports. If you're not careful, you'll double-count everything.

Schema drift. Customer.io updates their export format periodically. Their current v5 schema uses a naming convention like <name>_v5_<workspace_id>_<sequence>.parquet. If your ingestion logic assumes v4, things break.

Cost control. BigQuery charges for storage and queries. Keep everything forever, and you'll burn money. But you also need audit trails and the ability to debug historical issues.

I'd seen clients try to solve this with third-party ETL tools like Fivetran or Airbyte. Those work, but they add monthly costs and another dependency. I wanted something I controlled end-to-end.

The Architecture

Before I started building, I mapped out what the pipeline needed to look like:

Customer.io → GCS bucket (Parquet file drops every 15 minutes)

Cloud Function triggered on new file arrival

Validates file format and schema version
Loads into BigQuery "landing" tables

Scheduled merge queries (every 15-30 minutes)

Deduplicates and upserts into "curated" tables
These are what downstream reporting actually uses

Lifecycle and retention controls

GCS lifecycle rules expire raw Parquet files after N days
BigQuery partition expiration keeps storage costs predictable

The key insight: separate raw from curated. Raw tables are your audit log and debugging safety net. Curated tables are clean, fast, and what your BI tools query.

What Claude Code Actually Did

I gave Claude Code a clear brief: set up this entire pipeline in my GCP project, with proper security, cost controls, and documentation for the next person who has to maintain it.

Here's the step-by-step of what it built.

Step 1: GCP Authentication and Context

First, the agent confirmed my GCP environment. It ran gcloud config list to verify the active project and region, then checked that I had the necessary APIs enabled (Cloud Functions, BigQuery, Cloud Storage, Cloud Scheduler).

This is the kind of thing I would have skipped and then spent 30 minutes debugging later. The agent made sure the foundation was solid before building anything.

Step 2: Create the Storage Landing Zone

The agent created the GCS bucket for Customer.io exports with:

Regional storage class (cheaper than multi-regional for this use case)
Lifecycle rule: delete Parquet files older than 30 days
Naming convention that maps to the client workspace

The lifecycle rule is critical. Without it, you're paying to store files you'll never look at again. Thirty days gives you enough runway to investigate issues while keeping costs in check.

Step 3: IAM and Service Account Setup

This is where things usually go wrong. GCP permissions are easy to mess up, and you end up with either too much access (security risk) or too little (nothing works).

The agent created a dedicated service account for the Customer.io integration with exactly the permissions it needs:

storage.objects.create on the target bucket
storage.objects.get for read-back verification
Nothing else

It then generated the JSON key file that Customer.io needs, with a clear checklist of "paste this into the Customer.io UI" steps.

Least privilege from the start. No sweeping storage.admin roles that would let Customer.io touch other buckets.

Step 4: BigQuery Dataset and Table Structure

The agent set up the BigQuery destination with three logical layers:

Landing tables — Raw Parquet loads land here. These are append-only and partitioned by ingestion date.

Curated tables — Clean, deduplicated data with proper clustering for query performance. This is what dashboards and downstream agents use.

Archive tables (optional) — Long-term storage for compliance, with 90-day partition expiration.

For each Customer.io data type (subjects, outputs, deliveries, etc.), the agent created corresponding tables with the v5 schema. It even flagged fields that commonly cause issues—like timestamp formats that differ between Parquet and BigQuery.

Step 5: The Cloud Function

This is the workhorse. Every time a new Parquet file lands in GCS, a Cloud Function fires and loads it into BigQuery.

The agent scaffolded a Go-based function (matching Customer.io's sample code structure) that:

Validates the file name — Checks it matches the expected <name>_v5_<workspace_id>_<sequence>.parquet pattern
Extracts metadata — Determines which table to load into based on the file prefix
Runs the BigQuery load job — Uses the WRITE_APPEND disposition for landing tables
Handles idempotency — Tracks processed files to avoid double-loading if the function retries
Logs structured output — Every load includes file name, row count, and duration for debugging

The idempotency piece is crucial. Cloud Functions can retry on transient failures. Without deduplication logic, you'd load the same file multiple times and corrupt your data.

The agent also set up a dead-letter pattern: files that fail validation get moved to an /errors prefix for manual review instead of silently disappearing.

Step 6: Scheduled Merge Queries

Loading into landing tables is only half the job. The real value is in the curated layer—clean data that analysts can query without worrying about duplicates.

The agent created scheduled queries that run every 15 minutes and:

MERGE new records from landing into curated tables
Deduplicate on stable keys (like message ID + customer ID)
Handle upserts — If a record exists, update it; if not, insert it

Here's the pattern it used:

MERGE curated.deliveries AS target
USING (
  SELECT * FROM landing.deliveries
  WHERE _PARTITIONDATE >= DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)
  QUALIFY ROW_NUMBER() OVER (PARTITION BY message_id, customer_id ORDER BY timestamp DESC) = 1
) AS source
ON target.message_id = source.message_id AND target.customer_id = source.customer_id
WHEN MATCHED THEN UPDATE SET ...
WHEN NOT MATCHED THEN INSERT ...

The QUALIFY clause handles deduplication within the source. The MERGE handles deduplication against existing data. Clean, efficient, and predictable.

Step 7: Retention and Cost Controls

This is where most DIY pipelines fall apart. People set everything up, it works great for a month, then the BigQuery bill arrives and they panic.

The agent configured:

Partition expiration on landing tables — 7 days. You don't need raw loads older than that once they've been merged.

Partition expiration on curated tables — 90 days for operational data, configurable per table.

GCS lifecycle rules — Delete Parquet files after 30 days.

Query cost controls — Set up BigQuery custom quotas to alert if daily costs exceed a threshold.

The result: predictable, bounded costs. The pipeline can run forever without someone remembering to clean up.

The Agentified Outcome

Here's what made this worth the effort: it's not a one-off script.

I now have a repeatable set of skills I can run for any client:

Connect to their GCP project
Provision buckets, datasets, tables with proper naming
Set up IAM and generate service account keys
Deploy the Cloud Function
Configure scheduled queries
Set retention policies

That means improvements compound. When Customer.io releases schema v6, I update the agent's template once and roll it out everywhere.

Results

Metric	Before (Manual Setup)	After (Agent-Assisted)
Setup time per client	4-8 hours	< 1 hour
IAM mistakes	Frequent	Zero (so far)
Forgotten cost controls	Always	Never
Documentation	Spotty	Comprehensive

The biggest win isn't even speed—it's consistency. Every client gets the same battle-tested pattern. No more "I set this one up differently six months ago" mysteries.

What Didn't Go Perfectly

I'm not going to pretend this was flawless on the first run.

Schema surprises. Customer.io's Parquet exports include nested fields that BigQuery handles differently than I expected. The agent's first attempt at table creation missed some type mappings. We iterated.

Retry logic edge cases. The initial Cloud Function didn't handle the case where BigQuery's load job timed out but eventually succeeded. That caused one file to load twice before I caught it in reconciliation.

GCP permissioning is fiddly. Even with the agent's help, I hit two instances where a missing IAM binding broke things. The error messages from GCP are notoriously unhelpful. "Permission denied" tells you almost nothing.

Scheduled query cold starts. BigQuery scheduled queries have a startup latency. If files land faster than queries complete, you can get merge conflicts. We added a 5-minute buffer.

The agent helped me identify and fix all of these. But it took iteration. This wasn't a single-prompt-and-done project.

Guardrails and Best Practices

For anyone trying to build something similar, here's what I learned about doing this responsibly:

Security

Least-privilege service accounts, always
Rotate keys periodically; scope keys per client
Never paste secrets into prompts—use environment variables and Secret Manager

Reliability

Idempotent loads (track what you've processed)
Retries with exponential backoff
Dead-letter queues for failed files

Cost control

GCS lifecycle expiration (don't store files forever)
BigQuery partition expiration
Separate raw vs. curated datasets with different retention

Data quality

Row count reconciliation between source and destination
Schema drift detection and alerting
"Known-good" totals to catch discrepancies early

Why This Matters

We're at an inflection point with AI and infrastructure. The tools are now good enough that a generalist can build systems that used to require specialists.

I'm not replacing data engineers. Complex pipelines with multi-source orchestration, streaming requirements, and real-time SLAs still need deep expertise.

But for the 80% of pipelines that are "get data from A to B, keep it clean, don't spend too much"—an AI agent with the right guidance can handle that. And it can handle it consistently, across dozens of clients, without the human forgetting that one cost control setting.

McKinsey's research suggests that 60-70% of work activities could be augmented by current AI capabilities. Data pipeline setup is squarely in that bucket. The skills involved—reading documentation, writing configuration, debugging errors—are exactly what these models excel at.

The Playbook

If you want to replicate this approach:

Start with one pipeline. Don't try to boil the ocean. Pick a single integration (Customer.io → BigQuery, Stripe → Snowflake, whatever) and get it working.
Document your constraints. What are your cost limits? Retention requirements? Performance needs? The clearer you are upfront, the better the agent performs.
Iterate in small steps. Don't ask for the whole pipeline in one prompt. Break it into: authentication → storage → IAM → tables → ingestion → scheduling → monitoring.
Validate everything. Run reconciliation checks. Compare row counts. Query the data and make sure it looks right.
Make it repeatable. Once it works for one client, templatize it. The goal is to reduce marginal effort on the next deployment to near zero.

FAQ

Q: Does this work for other data sources, or just Customer.io?

The pattern is identical. Stripe webhooks, HubSpot exports, Salesforce dumps—any source that lands files in cloud storage can use this architecture. The specifics of the Cloud Function change, but the structure doesn't.

Q: Why not just use Fivetran or Airbyte?

Those are great tools, and for some clients they're the right answer. But they add monthly cost ($500-2000/month is typical for this scale) and another vendor dependency. For clients who want full control and lower ongoing costs, the DIY approach makes sense.

Q: What if Customer.io changes their schema?

This is a real risk. They've moved from v1 through v5 already. The agent-built pipeline includes schema version detection, so we'll know when a new version appears. Updating the tables and ingestion logic is a few hours of work, not a rebuild.

Q: How do I debug when something goes wrong?

Structured logging is your friend. Every Cloud Function execution logs the file name, table target, row count, and duration. If something's off, you can trace exactly which file caused the issue.

Q: Can non-technical people use this?

They can use the dashboards that sit on top of it. Setting up the pipeline still requires comfort with GCP, BigQuery, and debugging cloud infrastructure. But once it's running, it's hands-off.

Ready to Build Your Data Pipeline?

If you're sitting on a "data integration" backlog because it feels too heavy, this is your sign. Claude Code and modern cloud infrastructure make data pipeline automation dramatically more accessible.

Start with one pipeline. Let Claude Code build it. Then reuse the pattern everywhere.

We've implemented this stack for several clients now—from early-stage startups to established businesses with complex data landscapes. The approach scales, and the ROI is immediate.

Check out our data automation services or see how we built a complete reporting stack with AI agents for another example of what's possible.

That's all I got for now. Until next time.