Pipeline Latency Analysis — PodPedia Entity Extraction
Date: 2026-05-14
Scope: backend/handlers/ — transform_entity.go, llm_vertex.go, llm_openai_compatible.go, ratelimit.go, upload.go, ingest.go
Based on: ADR-001 through ADR-010, experiment data, code audit
1. Current Performance — Latency by Payload Size
Data from the ADR-001 live A/B experiment (staging, 10K threshold, n=15 per payload size):
| Payload Size | Without 10K Thresh (sequential) | With 10K Thresh (chunked/parallel) | Δ |
|---|---|---|---|
| 5K chars | ~14s (single chunk, no parallel benefit) | ~14s | None — under threshold |
| 15K chars | 45.8s mean | 37.0s mean | -19.2% (p=0.31, not stat. sig.) |
| 25K chars | 116.2s mean | 37.0s mean | -68.2% (p<0.001) |
ADR-010 production incident: 31-minute entity extraction job from a file upload (very large document, >400K chars). Root causes: (1) no LLM call timeout, (2) rate limiter weight=2 halving effective concurrency from 10→5, (3) no progress tracking.
Key takeaway: The 37s staging latency for a modest 15K document is dominated by a single factor — Vertex AI generation time. Even with parallel chunking, the pipeline is bottlenecked on LLM inference.
2. Pipeline Architecture — Summary
Source (TextUpload/Podcast)
└→ EntityExtractorTransform.Process(doc)
├─ doc.Content ≤ 10K chars → extractFullText() (single LLM call)
└─ doc.Content > 10K chars → extractChunked()
├─ Split into 10K-char chunks (200-char overlap)
├─ Fan out across alitto/pond pool (default 8 workers)
└─ Each worker:
├─ GraphRAGLimiter.Wait(token_estimate) ← rate limiter
├─ context.WithTimeout(120s) ← ADR-010 timeout
└─ LLM.Generate(system_prompt, chunk_prompt) ← Vertex AI
└─ Merge all chunk results
└─ For each entity: EntityResolver.ResolveAndInsert() ← O(n·m) string matching
└─ For each relationship: ResolveRelationship()
The LLM call path for Vertex AI:
vertexLLM.Generate()
├─ Allocate GenerativeModel(gemini-2.5-flash)
├─ Set system instruction (full prompt text, NOT cached)
├─ Set max output tokens = 65536
└─ model.GenerateContent(ctx, userPrompt) ← synchronous, blocks goroutine
3. Bottleneck Ranking — Estimated Contribution to Total Latency (15K Char Document)
🥇 #1: Vertex AI LLM Generation Time — 85-95% of total latency
Estimated contribution: 32-35s of the 37s total
Description: Each 10K-char chunk must be processed by gemini-2.5-flash to generate a JSON payload containing all entities and relationships. This is a heavy reasoning task — the model must understand the full text, identify entities, determine types, and infer relationships. The generation is entirely synchronous and non-streaming.
Evidence:
- ADR-001 baseline: ~14s for a single standard paragraph (1 chunk)
- ADR-004 estimates: flash-lite could bring this to <3s per chunk
- The Vertex AI API is called in blocking mode — the entire JSON response must be received before
extractFullText()returns - No streaming is used (ADR-003 is still Provisional)
Why 15K takes 37s despite 2 parallel chunks:
- 2 chunks × ~14-18s each, but they run concurrently in an 8-worker pool
- The wall-clock time should be ~max(chunk_latency), yet we see 37s
- This suggests that even with 2 parallel chunks, one chunk may be significantly larger (10K vs 5K+overlap), and the Vertex AI generation time scales non-linearly with prompt size
- Additionally, if Vertex AI's backend is under load, per-chunk latency can spike
Sub-factors:
| Sub-factor | Impact | Status |
|---|---|---|
| System prompt sent with every chunk | Adds TTFT per invocation | ✅ ADR-002 proposed but not on staging |
| No streaming — full response must complete | No time-to-first-paint benefit | ⏳ ADR-003 Provisional |
| gemini-2.5-flash vs flash-lite | Flash-lite is 3-5× faster | ⏳ ADR-004 Provisional |
| Output token generation (large JSON payloads) | Varies by text complexity | Inherent to the task |
🥈 #2: Rate Limiter Queuing (Historical) — 5-15% historically, ~0% now
Estimated contribution (pre-ADR-010): 2-6s of queuing delay Estimated contribution (current): <0.5s
Description: The GraphRAGLimiter uses a weighted semaphore + token bucket. Before ADR-010, the weight threshold was hardcoded at 5,000 tokens and max concurrency at 10 slots. A 10K-char chunk (~2,750 tokens) falls under this threshold (weight=1), but before ADR-001 lowered the chunk size from 20K to 10K, chunks were ~5,250 tokens and got weight=2 — halving effective concurrency to 5.
Historical bottleneck (caused 31-min incident):
- 20K-char chunks → ~5,250 tokens → weight=2 → 5 effective concurrent slots
- For large files with 40+ chunks: only 5 chunks process simultaneously
- Each chunk takes ~14-18s → 40 chunks / 5 concurrent × 15s = 120s minimum (but 31 min observed due to head-of-line blocking and no timeouts)
Current state (post ADR-001 + ADR-010):
- 10K-char chunks → ~2,750 tokens → weight=1
- Default 20 concurrent slots (up from 10)
- Weight threshold now 10,000 tokens (up from 5,000, since it's TPM/40)
- Token bucket: 400K TPM → 6,667 tokens/sec → negligible wait for 2,750 tokens
Residual queuing risk: If many jobs fire simultaneously, the semaphore can still saturate. With 20 slots and 8 pool workers, the pool is the limiting factor before the rate limiter is.
🥉 #3: No Context Caching (TTFT Penalty) — 1-3s per chunk
Estimated contribution: 1-3s per chunk (additive to LLM generation time)
Description: Vertex AI CachedContent (ADR-002) can pre-load the system prompt into Google's memory, reducing Time-To-First-Token from seconds to milliseconds. Without caching, every GenerateContent call must re-process the system prompt. With the proposed 36K-token Deep Ontology (56 golden few-shot examples), this penalty becomes severe — 36K tokens is about 3× the size of a 10K user chunk.
Current state:
- No context caching is active on staging (ADR-002 implemented on
feature/adr-002-context-cachingbranch, not merged to development) - The current system prompt is small (a few hundred tokens) — TTFT penalty is modest
- When the Deep Ontology is deployed without caching, TTFT per chunk could balloon significantly
Impact if Deep Ontology is deployed without caching first:
- Each chunk sends a 36K-token system prompt + 2.75K-token user prompt
- Total prompt size: ~39K tokens → Vertex AI must process this before any output
- TTFT could be 5-10s per chunk, making serial chunks very slow
Note: ADR-002 is explicitly designed to prevent this — caching must be deployed alongside the expanded ontology.
#4: Entity Resolution (Serial, Post-LLM) — <1s for typical docs, grows with graph size
Estimated contribution: <500ms for a 15K doc on an empty graph; 2-5s on a large graph
Description: After the LLM returns JSON, each entity goes through ResolveAndInsert(), which compares the normalized name against ALL existing nodes of the same type using a weighted ensemble (JaroWinkler + Levenshtein + JaccardToken). This is O(n_entities × m_existing_nodes). For an empty graph, this is just O(n) inserts. For a graph with 500K nodes, every new entity pays the price of string comparison against all nodes of its type.
Code path:
// transform_entity.go Process() — after LLM returns
for _, e := range result.Entities {
resolvedID := res.ResolveAndInsert(e.ID, e.Type) // O(m) per entity
...
}
for _, r := range result.Relationships {
res.ResolveRelationship(r.Source, r.Target, ...) // calls ResolveAndInsert ×2
...
}
Scaling concern: The entity resolver has no indexing — it's a linear scan. At 50K Person nodes, each new Person entity requires 50K × 3 algorithm comparisons. The cost-benefit analysis notes this: "monitor resolution span durations and consider indexing or sharding when Person nodes exceed 50K."
For a 15K doc on a fresh instance: negligible (<100ms, ~10-30 entities against an empty graph).
#5: Chunk Splitting and Result Merging — <100ms
Estimated contribution: <100ms total
Description: Text splitting via langchaingo/textsplitter and JSON merging of chunk results happen entirely in-process with no I/O. Negligible relative to LLM latency.
#6: Missing Progress Tracking on Upload Handler (User-Perceived Latency)
Estimated contribution: 0s actual latency, but unacceptably poor UX
Description: The upload handler (upload.go:132) creates the EntityExtractorTransform without setting ProgressCallback. The ingest handler (ingest.go:108-118) correctly wires progress tracking. For a 31-minute upload job, the frontend shows a static "Extracting entities from uploaded file..." spinner with no progress updates. Even after completion, the message remains stale because CompleteJob() was not updating the Message field (ADR-010 identified this bug).
Status: ADR-010 was Accepted and marked as "implemented on development", but the upload handler code still lacks the ProgressCallback wiring. This may need verification — the fix may be on a different branch.
4. Summary of Bottlenecks — 15K Char Document (37s baseline)
| Bottleneck | Estimated Contribution | Cumulative | Fixable? |
|---|---|---|---|
| Vertex AI LLM generation | 32-35s (86-95%) | 86-95% | ⚠️ Hard — model inference time |
| Context caching (TTFT penalty) | 1-3s (3-8%) | 89-98% | ✅ ADR-002 (not deployed) |
| Rate limiter queuing | <0.5s (<1%) | 90-99% | ✅ Fixed by ADR-001+ADR-010 |
| Entity resolution (post-LLM) | <0.5s (<1%) | 91-100% | ✅ Negligible at current scale |
| Chunk splitting + merging | <0.1s | ~100% | ✅ Negligible |
| TOTAL | ~37s |
5. Quick Wins — Highest Impact per Engineering Hour
Win #1: Deploy ADR-002 Context Caching to Staging
Effort: Low (already implemented on feature/adr-002-context-caching branch)
Impact: Reduces TTFT from ~1-3s to <100ms per chunk. When combined with the 36K Deep Ontology, this also prevents a latency regression (without caching, the larger system prompt would increase per-chunk TTFT to 5-10s).
Estimated latency gain: -2-3s per chunk (5-8% total reduction)
Win #2: Implement ADR-004 Flash-Lite Model Routing (After Quality Benchmarks)
Effort: Medium (2-4 hours + benchmark time) Impact: Potential 3-5× speedup per chunk. gemini-2.5-flash-lite has 3-5× faster TTFT and higher generation throughput. ADR-004 estimates latency drops from 14s to <3s per chunk. Risk: Quality degradation. Flash-lite is weaker at multi-hop reasoning. Must pass quality benchmarks first (≥90% recall vs flash). Estimated latency gain: -25-30s (70-80% reduction) if quality holds
Win #3: Implement ADR-003 SSE Streaming (De-prioritize)
Effort: Very High (8-15 hours for custom streaming JSON parser) Impact: Changes perceived latency from ~37s to <1s time-to-first-paint. Does NOT reduce actual extraction time — the same LLM computation happens. UX improvement only. Recommended: Defer until ADR-001 + ADR-002 + ADR-004 are explored. If actual latency can be brought to <5s, the streaming complexity is not justified.
Win #4: Fix Upload Handler Progress Tracking (ADR-010 Gap)
Effort: ~15 minutes (one-line ProgressCallback wiring) Impact: Fixes UX for file upload users. Shows chunk progress during 31-minute jobs instead of static spinner. Already done in ingest handler — just copy the pattern.
Win #5: Optimize Entity Resolution with Type Index
Effort: Medium (2-3 hours) Impact: For large graphs (100K+ nodes), reduces ResolveAndInsert from O(n·m) to O(n·log(m)) with a Bloom filter or trie index. Not urgent for current scale. Recommended: Defer until Person nodes exceed 50K (per cost-benefit analysis recommendation).
6. Target Latency — What Should Be Achievable
| Scenario | Current (staging) | After ADR-002 | After ADR-002 + ADR-004 (flash-lite) |
|---|---|---|---|
| 5K chars (1 chunk) | ~14s | ~12s | ~3-5s |
| 15K chars (2 chunks x 8 pool) | ~37s | ~32-34s | ~5-8s |
| 25K chars (3 chunks x 8 pool) | ~37s | ~32-34s | ~6-10s |
| Large file (40+ chunks) | 31 min (pre-fix) | ~2-5 min | ~30-60s |
Realistic target for 15K docs: 5-10 seconds is achievable with:
- ADR-002 context caching deployed (TTFT near-zero)
- ADR-004 flash-lite quality benchmarks passed
- Current rate limiter and pool configuration (20 concurrent, 8 workers)
Without flash-lite (flash-only): 10-15 seconds for 15K docs is a realistic target if context caching is deployed and the rate limiter is not the bottleneck. The 37s staging measurement is an outlier — it reflects the environment's Vertex AI performance characteristics, not the architecture's theoretical limit.
7. Architecture Notes — What the Code Tells Us
The Dominant Pattern: LLM-Bound Synchronous Pipeline
Every byte of latency in this pipeline comes from waiting on Vertex AI. The Go architecture is well-designed for parallelism — pond worker pool, weighted semaphore, token bucket — but it all converges on a single synchronous API call:
// transform_entity.go extractFullText()
result, err := t.LLM.Generate(llmCtx, systemPrompt, prompt, 0.2, true)
This call blocks the goroutine until Vertex AI returns the complete JSON response. All parallelism infrastructure is downstream of this single call. No other component contributes meaningfully to latency at current scale.
The Concurrency Model
pond pool (8 workers, configurable via LLM_POOL_SIZE)
├─ Each worker acquires rate limiter semaphore (weight=1 for 10K chunks)
├─ Token bucket check (negligible at 400K TPM)
└─ Blocks on LLM.Generate() for 12-18s per chunk
Rate limiter: 20 concurrent slots (LLM_MAX_CONCURRENCY)
└─ Pool size (8) is the actual bottleneck, not the rate limiter (20)
The pool creates 8 goroutines. With 20 semaphore slots, each goroutine
acquires a slot instantly. The pool size limits true parallelism.
Key insight: Pool size (8) < Rate limiter semaphore (20) — the pool is the binding constraint on parallelism. For a 2-chunk job, both chunks process concurrently (pool = 8 > 2). For a 40-chunk job, batches of 8 chunks run concurrently, with 5 total batches.
Missing: ADR-010 Upload Handler Fix
The upload handler at upload.go:132 creates:
&EntityExtractorTransform{LLM: a.LLM} // No ProgressCallback!
The ingest handler at ingest.go:108-118 correctly wires:
entityExtractor.ProgressCallback = func(stage string, current, total int) { ... }
This means the ADR-010 decision to add progress tracking parity for file uploads is not yet reflected in the upload handler code. The stale message bug is also present — CompleteJob() is called with the string "File upload complete" instead of a node/edge summary.
8. Recommendations
| Priority | Action | Effort | Impact | Blocker? |
|---|---|---|---|---|
| P0 | Deploy ADR-002 context caching to staging | Merge feature branch | -2-3s per chunk, enables Deep Ontology | None |
| P0 | Run ADR-004 flash-lite quality benchmarks | 1 day | Gate decision on 3-5× speedup | Quality unknown |
| P1 | Fix upload handler ProgressCallback | 15 min | UX parity for file uploads | None |
| P1 | Fix stale CompleteJob message | 5 min | Shows actual graph stats on completion | None |
| P2 | Profile Vertex AI latency with/without caching | 1 hour | Data-driven decision on ADR-002 value | None |
| P3 | Add entity resolver index (when graph >50K nodes) | 2-3 hours | Prevents O(n²) resolution bottleneck | Not urgent |
| Defer | ADR-003 SSE streaming | 8-15 hours | UX only — re-evaluate if latency >5s after fixes | Parser risk |
Appendix: Rate Limiter Parameter Evolution
| Parameter | Pre-ADR-001/010 (incident) | Current (post-ADR-010) |
|---|---|---|
| Chunk size | 20,000 chars | 10,000 chars |
| Prompt tokens per chunk | ~5,250 | ~2,750 |
| Weight threshold | 5,000 tokens | 10,000 tokens (TPM/40) |
| Weight per chunk | 2 (halved slots) | 1 (full slots) |
| Max concurrency | 10 slots | 20 slots |
| Effective concurrency | 5 (10/2) | 20 (20/1) |
| TPM limit | 200,000 | 400,000 |
| Token bucket rate | 3,333 tok/s | 6,667 tok/s |
| Pool size | 8 (hardcoded) | 8 (configurable via LLM_POOL_SIZE) |
| LLM request timeout | None (30 min Cloud Run) | 120s per chunk (LLM_REQUEST_TIMEOUT) |
| Logging | stdout JSON (not in Cloud Logging) | slog-gcp (structured, queryable) |