CASE STUDY

AI Trends Pipeline
Built for Scale.

A three-phase pipeline: deterministic TypeScript for data sourcing and validation, a single constrained Haiku turn for scoring and selection. ~$0.03/run, zero human intervention.

How it works

A GitHub Actions cron job triggers a three-phase TypeScript pipeline every morning. Phase 1 fetches articles from HN and NewsData.io deterministically. Phase 2 hands a numbered pool to a single constrained Haiku 4.5 turn (zero tools, single turn) for scoring and headline selection. Phase 3 validates every URL against the pre-built allowlist, merges the result into trends-data.json, and pushes to main. The website's AI Trends dashboard is updated every morning before business hours — ~$0.03/run, ~26 seconds end-to-end.

LLMs for Judgment Only

Scoring news significance and selecting headlines requires judgment. Fetching APIs, parsing JSON, writing files, and running git commands does not. The LLM only touches Phase 2.

Zero Tools, Single Turn

allowedTools: [] and maxTurns: 1 — the agent cannot call WebSearch or any tool. It scores and selects from a pre-verified article pool only.

Validate Deterministically

An allowlist check is cheaper and more reliable than asking the LLM not to hallucinate. Every URL in the output is validated against the pre-fetched pool.

TRIGGER

GitHub Actions Cron

0 11 * * * (6:00 AM EST)

pipeline repo

0 11 * * *

6:00 AM EST, daily

Runs tsx src/trends-pipeline.ts via GitHub Actions workflow

DATA SOURCES

News Ingestion

Two complementary data feeds

HN Firebase API

Top 100 stories, 24h timestamp filter, AI keyword regex

Developer tools, open-source AI, foundation models, AI agents · Free

NewsData.io REST API

6 parallel topic queries, 24h pubDate filter

AI general, regulation, safety, infra, enterprise, multimodal · Free (200 credits/day)

All URLs pre-verified before agent sees them

Phase 1: build unified article pool (deterministic TS)

PHASE 1

Deterministic Data Sourcing

TypeScript, no LLM

~3 seconds
zero API cost

Fetch HN

Firebase API top 100, 24h timestamp filter, AI keyword regex match.

Fetch NewsData

6 parallel queries (AI general, regulation, safety, infra, enterprise, multimodal).

Build Pool

Merge HN + NewsData, dedup by URL, tag origin, build allowlist Set<string>.

Cross-Day Dedup

Load last 3 entries' URLs, remove already-used articles, re-index 1..N.

Phase 2: agent scores and selects (Haiku 4.5, zero tools)

HAIKU 4.5

Agent Scoring

Single turn, zero tools, constrained to pool

model: claude-haiku-4-5-20251001
allowedTools: []
maxTurns: 1

Score Categories

Present numbered article pool. Agent scores 10 categories 0–100 based on volume, significance, and source breadth.

Select Headlines

Agent selects 1–5 headlines per category with title, source, URL, and summary. From the pre-verified pool only.

SAFEGUARDS

Anti-Hallucination Measures

Allowlist Validation

Every URL in agent output checked against pre-fetched pool. Unknown URLs rejected and logged.

Cross-Day Dedup

URLs from last 3 entries removed before agent sees pool. Prevents day-over-day recycling.

Zero Tools

allowedTools: [] — agent cannot call WebSearch or any tool.

Single Turn

maxTurns: 1 — no multi-turn drift or negotiation.

Score Zeroing

Categories with all headlines rejected get score set to 0.

Phase 3: validate, write, and push (deterministic TS)

PHASE 3

Validate + Write

TypeScript, no LLM

Parse Agent Output

Extract JSON from agent response (code block extraction + fallback parsing).

Allowlist Validation

Reject any URL not present in the pre-fetched pool Set<string>.

Dedup Validation

Remove cross-category duplicate headlines within the same entry.

Zero Empty Scores

Categories with no valid headlines remaining get score set to 0.

Merge + Log

Upsert into trends-data.json by date. Append run log to pipeline-runs.jsonl.

GIT SAFETY

Safe Deployment Workflow

Deterministic TS — handles all edge cases

WORKFLOW SEQUENCE

git stash --include-untracked → git pull --rebase → git add trends-data.json → verify staged → git commit → git push → retry on conflict → git stash pop

Dirty working tree — git stash --include-untracked before pull, git stash pop after push
Conflict on push — catches push failure, runs git pull --rebase, retries push
No-op detection — git diff --cached --quiet exits 0 when nothing staged, skips commit
Selective staging — only stages trends-data.json, never commits unrelated files
Stash recovery — always pops stash in both success and error paths

pipeline produces three outputs

trends-data.json

Public scores + headlines per day. Imported by the Next.js /trends page at build time. Single source of truth.

data/trends-data.json

pipeline-signals.json

Internal telemetry: cost, duration, article pool stats, warnings. Not rendered on the public page.

data/pipeline-signals.json

Vercel Auto-Deploy

Push to main → Vercel webhook → next build → CDN deploy. Live within ~2 minutes.

GitHub → Vercel auto-deploy

COST

Estimated Cost Per Run

Haiku 4.5 — scoring + selection
Single turn, zero tools, ~8K tokens

~$0.03

~$0.03 per run · ~$0.90/month · ~26s · Hard cap: $2.00 (MAX_BUDGET_USD)

PRINCIPLES

Design Principles

LLMs for judgment only — scoring and selecting requires judgment; fetching and writing does not.
Zero tools = minimal overhead — allowedTools: [] means no tool schemas per turn.
Validate deterministically — an allowlist check beats asking the LLM not to hallucinate.
Custom prompt over presets — a 2-sentence system prompt replaces 10K tokens of unused instructions.

REPO SEPARATION The three-phase pipeline runs in its own repository via GitHub Actions. It reaches into the website repo only to write trends-data.json. Two repos: the pipeline (TypeScript + GitHub Actions workflow) and the website (Next.js + Vercel). The git safety workflow ensures the pipeline never touches other files, never commits staged user changes, and always restores the working tree via stash/pop.