CASE STUDY

AI Trends Pipeline
Built for Scale.

A three-phase pipeline: deterministic TypeScript for data sourcing and validation, a single constrained Haiku turn for scoring and selection. ~$0.03/run, zero human intervention.

How it works

A GitHub Actions cron job triggers a three-phase TypeScript pipeline every morning. Phase 1 fetches articles from HN and NewsData.io deterministically. Phase 2 hands a numbered pool to a single constrained Haiku 4.5 turn (zero tools, single turn) for scoring and headline selection. Phase 3 validates every URL against the pre-built allowlist, merges the result into trends-data.json, and pushes to main. The website's AI Trends dashboard is updated every morning before business hours — ~$0.03/run, ~26 seconds end-to-end.

LLMs for Judgment Only
Scoring news significance and selecting headlines requires judgment. Fetching APIs, parsing JSON, writing files, and running git commands does not. The LLM only touches Phase 2.
Zero Tools, Single Turn
allowedTools: [] and maxTurns: 1 — the agent cannot call WebSearch or any tool. It scores and selects from a pre-verified article pool only.
Validate Deterministically
An allowlist check is cheaper and more reliable than asking the LLM not to hallucinate. Every URL in the output is validated against the pre-fetched pool.
TRIGGER
GitHub Actions Cron
0 11 * * * (6:00 AM EST)
pipeline repo
0 11 * * *
6:00 AM EST, daily
DATA SOURCES
News Ingestion
Two complementary data feeds
HN Firebase API
Top 100 stories, 24h timestamp filter, AI keyword regex
Developer tools, open-source AI, foundation models, AI agents · Free
NewsData.io REST API
6 parallel topic queries, 24h pubDate filter
AI general, regulation, safety, infra, enterprise, multimodal · Free (200 credits/day)
All URLs pre-verified before agent sees them
Phase 1: build unified article pool (deterministic TS)
PHASE 1
Deterministic Data Sourcing
TypeScript, no LLM
~3 seconds
zero API cost
1a
Fetch HN
Firebase API top 100, 24h timestamp filter, AI keyword regex match.
1b
Fetch NewsData
6 parallel queries (AI general, regulation, safety, infra, enterprise, multimodal).
1c
Build Pool
Merge HN + NewsData, dedup by URL, tag origin, build allowlist Set<string>.
1d
Cross-Day Dedup
Load last 3 entries' URLs, remove already-used articles, re-index 1..N.
Phase 2: agent scores and selects (Haiku 4.5, zero tools)
HAIKU 4.5
Agent Scoring
Single turn, zero tools, constrained to pool
model: claude-haiku-4-5-20251001
allowedTools: []
maxTurns: 1
2
Score Categories
Present numbered article pool. Agent scores 10 categories 0–100 based on volume, significance, and source breadth.
3
Select Headlines
Agent selects 1–5 headlines per category with title, source, URL, and summary. From the pre-verified pool only.
SAFEGUARDS
Anti-Hallucination Measures
1
Allowlist Validation
Every URL in agent output checked against pre-fetched pool. Unknown URLs rejected and logged.
2
Cross-Day Dedup
URLs from last 3 entries removed before agent sees pool. Prevents day-over-day recycling.
3
Zero Tools
allowedTools: [] — agent cannot call WebSearch or any tool.
4
Single Turn
maxTurns: 1 — no multi-turn drift or negotiation.
5
Score Zeroing
Categories with all headlines rejected get score set to 0.
Phase 3: validate, write, and push (deterministic TS)
PHASE 3
Validate + Write
TypeScript, no LLM
5
Parse Agent Output
Extract JSON from agent response (code block extraction + fallback parsing).
6
Allowlist Validation
Reject any URL not present in the pre-fetched pool Set<string>.
7
Dedup Validation
Remove cross-category duplicate headlines within the same entry.
8
Zero Empty Scores
Categories with no valid headlines remaining get score set to 0.
9
Merge + Log
Upsert into trends-data.json by date. Append run log to pipeline-runs.jsonl.
GIT SAFETY
Safe Deployment Workflow
Deterministic TS — handles all edge cases
WORKFLOW SEQUENCE
git stash --include-untracked git pull --rebase git add trends-data.json verify staged git commit git push retry on conflict git stash pop
  1. Dirty working treegit stash --include-untracked before pull, git stash pop after push
  2. Conflict on push — catches push failure, runs git pull --rebase, retries push
  3. No-op detectiongit diff --cached --quiet exits 0 when nothing staged, skips commit
  4. Selective staging — only stages trends-data.json, never commits unrelated files
  5. Stash recovery — always pops stash in both success and error paths
pipeline produces three outputs
trends-data.json
Public scores + headlines per day. Imported by the Next.js /trends page at build time. Single source of truth.
data/trends-data.json
pipeline-signals.json
Internal telemetry: cost, duration, article pool stats, warnings. Not rendered on the public page.
data/pipeline-signals.json
Vercel Auto-Deploy
Push to main → Vercel webhook → next build → CDN deploy. Live within ~2 minutes.
GitHub → Vercel auto-deploy
COST
Estimated Cost Per Run
Haiku 4.5 — scoring + selection
Single turn, zero tools, ~8K tokens
~$0.03
~$0.03 per run · ~$0.90/month · ~26s · Hard cap: $2.00 (MAX_BUDGET_USD)
PRINCIPLES
Design Principles
  1. LLMs for judgment only — scoring and selecting requires judgment; fetching and writing does not.
  2. Zero tools = minimal overheadallowedTools: [] means no tool schemas per turn.
  3. Validate deterministically — an allowlist check beats asking the LLM not to hallucinate.
  4. Custom prompt over presets — a 2-sentence system prompt replaces 10K tokens of unused instructions.
REPO SEPARATION The three-phase pipeline runs in its own repository via GitHub Actions. It reaches into the website repo only to write trends-data.json. Two repos: the pipeline (TypeScript + GitHub Actions workflow) and the website (Next.js + Vercel). The git safety workflow ensures the pipeline never touches other files, never commits staged user changes, and always restores the working tree via stash/pop.