{

Pipeline Latency Analysis — PodPedia Entity Extraction

Date: 2026-05-14 Scope: backend/handlers/ — transform_entity.go, llm_vertex.go, llm_openai_compatible.go, ratelimit.go, upload.go, ingest.go Based on: ADR-001 through ADR-010, experiment data, code audit

1. Current Performance — Latency by Payload Size

Data from the ADR-001 live A/B experiment (staging, 10K threshold, n=15 per payload size):

Payload Size	Without 10K Thresh (sequential)	With 10K Thresh (chunked/parallel)	Δ
5K chars	~14s (single chunk, no parallel benefit)	~14s	None — under threshold
15K chars	45.8s mean	37.0s mean	-19.2% (p=0.31, not stat. sig.)
25K chars	116.2s mean	37.0s mean	-68.2% (p<0.001)

ADR-010 production incident: 31-minute entity extraction job from a file upload (very large document, >400K chars). Root causes: (1) no LLM call timeout, (2) rate limiter weight=2 halving effective concurrency from 10→5, (3) no progress tracking.

Key takeaway: The 37s staging latency for a modest 15K document is dominated by a single factor — Vertex AI generation time. Even with parallel chunking, the pipeline is bottlenecked on LLM inference.

2. Pipeline Architecture — Summary

Source (TextUpload/Podcast)
    └→ EntityExtractorTransform.Process(doc)
        ├─ doc.Content ≤ 10K chars → extractFullText() (single LLM call)
        └─ doc.Content > 10K chars  → extractChunked()
            ├─ Split into 10K-char chunks (200-char overlap)
            ├─ Fan out across alitto/pond pool (default 8 workers)
            └─ Each worker:
                ├─ GraphRAGLimiter.Wait(token_estimate)  ← rate limiter
                ├─ context.WithTimeout(120s)              ← ADR-010 timeout
                └─ LLM.Generate(system_prompt, chunk_prompt)  ← Vertex AI
            └─ Merge all chunk results
        └─ For each entity: EntityResolver.ResolveAndInsert()  ← O(n·m) string matching
        └─ For each relationship: ResolveRelationship()

The LLM call path for Vertex AI:

vertexLLM.Generate()
├─ Allocate GenerativeModel(gemini-2.5-flash)
├─ Set system instruction (full prompt text, NOT cached)
├─ Set max output tokens = 65536
└─ model.GenerateContent(ctx, userPrompt)  ← synchronous, blocks goroutine

3. Bottleneck Ranking — Estimated Contribution to Total Latency (15K Char Document)

🥇 #1: Vertex AI LLM Generation Time — 85-95% of total latency

Estimated contribution: 32-35s of the 37s total

Description: Each 10K-char chunk must be processed by gemini-2.5-flash to generate a JSON payload containing all entities and relationships. This is a heavy reasoning task — the model must understand the full text, identify entities, determine types, and infer relationships. The generation is entirely synchronous and non-streaming.

Evidence:

ADR-001 baseline: ~14s for a single standard paragraph (1 chunk)
ADR-004 estimates: flash-lite could bring this to <3s per chunk
The Vertex AI API is called in blocking mode — the entire JSON response must be received before extractFullText() returns
No streaming is used (ADR-003 is still Provisional)

Why 15K takes 37s despite 2 parallel chunks:

2 chunks × ~14-18s each, but they run concurrently in an 8-worker pool
The wall-clock time should be ~max(chunk_latency), yet we see 37s
This suggests that even with 2 parallel chunks, one chunk may be significantly larger (10K vs 5K+overlap), and the Vertex AI generation time scales non-linearly with prompt size
Additionally, if Vertex AI's backend is under load, per-chunk latency can spike

Sub-factors:

Sub-factor	Impact	Status
System prompt sent with every chunk	Adds TTFT per invocation	✅ ADR-002 proposed but not on staging
No streaming — full response must complete	No time-to-first-paint benefit	⏳ ADR-003 Provisional
gemini-2.5-flash vs flash-lite	Flash-lite is 3-5× faster	⏳ ADR-004 Provisional
Output token generation (large JSON payloads)	Varies by text complexity	Inherent to the task

🥈 #2: Rate Limiter Queuing (Historical) — 5-15% historically, ~0% now

Estimated contribution (pre-ADR-010): 2-6s of queuing delay Estimated contribution (current): <0.5s

Description: The GraphRAGLimiter uses a weighted semaphore + token bucket. Before ADR-010, the weight threshold was hardcoded at 5,000 tokens and max concurrency at 10 slots. A 10K-char chunk (~2,750 tokens) falls under this threshold (weight=1), but before ADR-001 lowered the chunk size from 20K to 10K, chunks were ~5,250 tokens and got weight=2 — halving effective concurrency to 5.

Historical bottleneck (caused 31-min incident):

20K-char chunks → ~5,250 tokens → weight=2 → 5 effective concurrent slots
For large files with 40+ chunks: only 5 chunks process simultaneously
Each chunk takes ~14-18s → 40 chunks / 5 concurrent × 15s = 120s minimum (but 31 min observed due to head-of-line blocking and no timeouts)

Current state (post ADR-001 + ADR-010):

10K-char chunks → ~2,750 tokens → weight=1
Default 20 concurrent slots (up from 10)
Weight threshold now 10,000 tokens (up from 5,000, since it's TPM/40)
Token bucket: 400K TPM → 6,667 tokens/sec → negligible wait for 2,750 tokens

Residual queuing risk: If many jobs fire simultaneously, the semaphore can still saturate. With 20 slots and 8 pool workers, the pool is the limiting factor before the rate limiter is.

🥉 #3: No Context Caching (TTFT Penalty) — 1-3s per chunk

Estimated contribution: 1-3s per chunk (additive to LLM generation time)

Description: Vertex AI CachedContent (ADR-002) can pre-load the system prompt into Google's memory, reducing Time-To-First-Token from seconds to milliseconds. Without caching, every GenerateContent call must re-process the system prompt. With the proposed 36K-token Deep Ontology (56 golden few-shot examples), this penalty becomes severe — 36K tokens is about 3× the size of a 10K user chunk.

Current state:

No context caching is active on staging (ADR-002 implemented on feature/adr-002-context-caching branch, not merged to development)
The current system prompt is small (a few hundred tokens) — TTFT penalty is modest
When the Deep Ontology is deployed without caching, TTFT per chunk could balloon significantly

Impact if Deep Ontology is deployed without caching first:

Each chunk sends a 36K-token system prompt + 2.75K-token user prompt
Total prompt size: ~39K tokens → Vertex AI must process this before any output
TTFT could be 5-10s per chunk, making serial chunks very slow

Note: ADR-002 is explicitly designed to prevent this — caching must be deployed alongside the expanded ontology.

#4: Entity Resolution (Serial, Post-LLM) — <1s for typical docs, grows with graph size

Estimated contribution: <500ms for a 15K doc on an empty graph; 2-5s on a large graph

Description: After the LLM returns JSON, each entity goes through ResolveAndInsert(), which compares the normalized name against ALL existing nodes of the same type using a weighted ensemble (JaroWinkler + Levenshtein + JaccardToken). This is O(n_entities × m_existing_nodes). For an empty graph, this is just O(n) inserts. For a graph with 500K nodes, every new entity pays the price of string comparison against all nodes of its type.

Code path:

// transform_entity.go Process() — after LLM returns
for _, e := range result.Entities {
    resolvedID := res.ResolveAndInsert(e.ID, e.Type)  // O(m) per entity
    ...
}
for _, r := range result.Relationships {
    res.ResolveRelationship(r.Source, r.Target, ...)  // calls ResolveAndInsert ×2
    ...
}

Scaling concern: The entity resolver has no indexing — it's a linear scan. At 50K Person nodes, each new Person entity requires 50K × 3 algorithm comparisons. The cost-benefit analysis notes this: "monitor resolution span durations and consider indexing or sharding when Person nodes exceed 50K."

For a 15K doc on a fresh instance: negligible (<100ms, ~10-30 entities against an empty graph).

#5: Chunk Splitting and Result Merging — <100ms

Estimated contribution: <100ms total

Description: Text splitting via langchaingo/textsplitter and JSON merging of chunk results happen entirely in-process with no I/O. Negligible relative to LLM latency.

#6: Missing Progress Tracking on Upload Handler (User-Perceived Latency)

Estimated contribution: 0s actual latency, but unacceptably poor UX

Description: The upload handler (upload.go:132) creates the EntityExtractorTransform without setting ProgressCallback. The ingest handler (ingest.go:108-118) correctly wires progress tracking. For a 31-minute upload job, the frontend shows a static "Extracting entities from uploaded file..." spinner with no progress updates. Even after completion, the message remains stale because CompleteJob() was not updating the Message field (ADR-010 identified this bug).

Status: ADR-010 was Accepted and marked as "implemented on development", but the upload handler code still lacks the ProgressCallback wiring. This may need verification — the fix may be on a different branch.

4. Summary of Bottlenecks — 15K Char Document (37s baseline)

Bottleneck	Estimated Contribution	Cumulative	Fixable?
Vertex AI LLM generation	32-35s (86-95%)	86-95%	⚠️ Hard — model inference time
Context caching (TTFT penalty)	1-3s (3-8%)	89-98%	✅ ADR-002 (not deployed)
Rate limiter queuing	<0.5s (<1%)	90-99%	✅ Fixed by ADR-001+ADR-010
Entity resolution (post-LLM)	<0.5s (<1%)	91-100%	✅ Negligible at current scale
Chunk splitting + merging	<0.1s	~100%	✅ Negligible
TOTAL	~37s

5. Quick Wins — Highest Impact per Engineering Hour

Win #1: Deploy ADR-002 Context Caching to Staging

Effort: Low (already implemented on feature/adr-002-context-caching branch) Impact: Reduces TTFT from ~1-3s to <100ms per chunk. When combined with the 36K Deep Ontology, this also prevents a latency regression (without caching, the larger system prompt would increase per-chunk TTFT to 5-10s). Estimated latency gain: -2-3s per chunk (5-8% total reduction)

Win #2: Implement ADR-004 Flash-Lite Model Routing (After Quality Benchmarks)

Effort: Medium (2-4 hours + benchmark time) Impact: Potential 3-5× speedup per chunk. gemini-2.5-flash-lite has 3-5× faster TTFT and higher generation throughput. ADR-004 estimates latency drops from 14s to <3s per chunk. Risk: Quality degradation. Flash-lite is weaker at multi-hop reasoning. Must pass quality benchmarks first (≥90% recall vs flash). Estimated latency gain: -25-30s (70-80% reduction) if quality holds

Win #3: Implement ADR-003 SSE Streaming (De-prioritize)

Effort: Very High (8-15 hours for custom streaming JSON parser) Impact: Changes perceived latency from ~37s to <1s time-to-first-paint. Does NOT reduce actual extraction time — the same LLM computation happens. UX improvement only. Recommended: Defer until ADR-001 + ADR-002 + ADR-004 are explored. If actual latency can be brought to <5s, the streaming complexity is not justified.

Win #4: Fix Upload Handler Progress Tracking (ADR-010 Gap)

Effort: ~15 minutes (one-line ProgressCallback wiring) Impact: Fixes UX for file upload users. Shows chunk progress during 31-minute jobs instead of static spinner. Already done in ingest handler — just copy the pattern.

Win #5: Optimize Entity Resolution with Type Index

Effort: Medium (2-3 hours) Impact: For large graphs (100K+ nodes), reduces ResolveAndInsert from O(n·m) to O(n·log(m)) with a Bloom filter or trie index. Not urgent for current scale. Recommended: Defer until Person nodes exceed 50K (per cost-benefit analysis recommendation).

6. Target Latency — What Should Be Achievable

Scenario	Current (staging)	After ADR-002	After ADR-002 + ADR-004 (flash-lite)
5K chars (1 chunk)	~14s	~12s	~3-5s
15K chars (2 chunks x 8 pool)	~37s	~32-34s	~5-8s
25K chars (3 chunks x 8 pool)	~37s	~32-34s	~6-10s
Large file (40+ chunks)	31 min (pre-fix)	~2-5 min	~30-60s

Realistic target for 15K docs: 5-10 seconds is achievable with:

ADR-002 context caching deployed (TTFT near-zero)
ADR-004 flash-lite quality benchmarks passed
Current rate limiter and pool configuration (20 concurrent, 8 workers)

Without flash-lite (flash-only): 10-15 seconds for 15K docs is a realistic target if context caching is deployed and the rate limiter is not the bottleneck. The 37s staging measurement is an outlier — it reflects the environment's Vertex AI performance characteristics, not the architecture's theoretical limit.

7. Architecture Notes — What the Code Tells Us

The Dominant Pattern: LLM-Bound Synchronous Pipeline

Every byte of latency in this pipeline comes from waiting on Vertex AI. The Go architecture is well-designed for parallelism — pond worker pool, weighted semaphore, token bucket — but it all converges on a single synchronous API call:

// transform_entity.go extractFullText()
result, err := t.LLM.Generate(llmCtx, systemPrompt, prompt, 0.2, true)

This call blocks the goroutine until Vertex AI returns the complete JSON response. All parallelism infrastructure is downstream of this single call. No other component contributes meaningfully to latency at current scale.

The Concurrency Model

pond pool (8 workers, configurable via LLM_POOL_SIZE)
  ├─ Each worker acquires rate limiter semaphore (weight=1 for 10K chunks)
  ├─ Token bucket check (negligible at 400K TPM)
  └─ Blocks on LLM.Generate() for 12-18s per chunk

Rate limiter: 20 concurrent slots (LLM_MAX_CONCURRENCY)
  └─ Pool size (8) is the actual bottleneck, not the rate limiter (20)

The pool creates 8 goroutines. With 20 semaphore slots, each goroutine
acquires a slot instantly. The pool size limits true parallelism.

Key insight: Pool size (8) < Rate limiter semaphore (20) — the pool is the binding constraint on parallelism. For a 2-chunk job, both chunks process concurrently (pool = 8 > 2). For a 40-chunk job, batches of 8 chunks run concurrently, with 5 total batches.

Missing: ADR-010 Upload Handler Fix

The upload handler at upload.go:132 creates:

&EntityExtractorTransform{LLM: a.LLM}  // No ProgressCallback!

The ingest handler at ingest.go:108-118 correctly wires:

entityExtractor.ProgressCallback = func(stage string, current, total int) { ... }

This means the ADR-010 decision to add progress tracking parity for file uploads is not yet reflected in the upload handler code. The stale message bug is also present — CompleteJob() is called with the string "File upload complete" instead of a node/edge summary.

8. Recommendations

Priority	Action	Effort	Impact	Blocker?
P0	Deploy ADR-002 context caching to staging	Merge feature branch	-2-3s per chunk, enables Deep Ontology	None
P0	Run ADR-004 flash-lite quality benchmarks	1 day	Gate decision on 3-5× speedup	Quality unknown
P1	Fix upload handler ProgressCallback	15 min	UX parity for file uploads	None
P1	Fix stale CompleteJob message	5 min	Shows actual graph stats on completion	None
P2	Profile Vertex AI latency with/without caching	1 hour	Data-driven decision on ADR-002 value	None
P3	Add entity resolver index (when graph >50K nodes)	2-3 hours	Prevents O(n²) resolution bottleneck	Not urgent
Defer	ADR-003 SSE streaming	8-15 hours	UX only — re-evaluate if latency >5s after fixes	Parser risk

Appendix: Rate Limiter Parameter Evolution

Parameter	Pre-ADR-001/010 (incident)	Current (post-ADR-010)
Chunk size	20,000 chars	10,000 chars
Prompt tokens per chunk	~5,250	~2,750
Weight threshold	5,000 tokens	10,000 tokens (TPM/40)
Weight per chunk	2 (halved slots)	1 (full slots)
Max concurrency	10 slots	20 slots
Effective concurrency	5 (10/2)	20 (20/1)
TPM limit	200,000	400,000
Token bucket rate	3,333 tok/s	6,667 tok/s
Pool size	8 (hardcoded)	8 (configurable via `LLM_POOL_SIZE`)
LLM request timeout	None (30 min Cloud Run)	120s per chunk (`LLM_REQUEST_TIMEOUT`)
Logging	stdout JSON (not in Cloud Logging)	slog-gcp (structured, queryable)

}