Experiment Report: ADR-001 — 10K Character Threshold Live A/B Latency Test
| Field | Value |
|---|---|
| Experiment ID | ADR-001-live-ab |
| Date | 2026-05-13 |
| Author | @coding-agent |
| PR | feature/experiment-adr001 |
| Commit (baseline) | Production deployment (podpedia-pipeline, 20K threshold) |
| Commit (experiment) | Staging deployment (podpedia-pipeline-staging, 10K threshold) |
| Status | ⚠️ inconclusive — low trial count |
1. Hypothesis
"Lowering the parallel graph extraction chunking threshold from 20K characters to 10K characters (ADR-001) will reduce wall-clock extraction latency for medium-to-large documents."
Rationale (from ADR-001)
Documents exceeding 10K characters will trigger parallel chunking across the 8-worker pool instead of being processed sequentially. The 10K window provides a sufficient semantic aperture (≈1,500–2,000 words) for the LLM to resolve relationships without orphaned nodes.
Success Criteria (quantified)
- Statistically significant latency reduction (KS test p < 0.05) for documents at or above the 10K threshold
- Mean latency reduction ≥ 15% at the 15K payload size (between old and new thresholds)
- No regression for documents below the 10K threshold (5K payload)
Failure Criteria
- Latency increases at any payload size
- Inconclusive KS test (p ≥ 0.05) for the critical 15K payload after sufficient trials
2. Variables
Independent Variable
Chunking threshold: Production (20K threshold, sequential for <20K docs) vs. Staging (10K threshold, parallel for <10K docs).
Dependent Variables (metrics measured)
- p50 latency (ms per request)
- p95 latency (ms per request)
- p99 latency (ms per request)
- Mean latency (ms per request)
Controlled Variables
- Payload sizes: 5,000 / 15,000 / 25,000 characters — representing below threshold, between thresholds, and above both thresholds
- Environment: Cloud Run (production: podpedia-pipeline-n6xek74vya-uc.a.run.app, staging: podpedia-pipeline-staging-446893038996.us-central1.run.app)
- Concurrency: 15 concurrent fired requests (sequential dispatch, parallel execution)
- Endpoint: Full pipeline ingestion (ingest → entity resolution → graph → sink)
Uncontrolled Variables
- Environment differences: Production and staging may run on different Cloud Run instance profiles (CPU, concurrency, cold starts). This is a live A/B, not a controlled benchmark.
- Trial count: Only 15 trials per payload/environment combination. Low statistical power.
- Ordering: Trials were fired concurrently, so late arrivals may reflect queuing effects rather than pure extraction latency.
3. Methodology
Benchmark Harness
A custom HTTP load test fired 15 concurrent requests at each payload size, targeting both the production and staging endpoints. Each request submitted a text payload of the specified character count and measured wall-clock latency from dispatch to full response.
Sample Size
- Trials per configuration: 15
- Total data points: 15 trials × 3 payloads × 2 environments = 90
Statistical Tests Applied
- Two-sample KS test (approximate) on latency distributions
- Cohen's d for effect size
- Percentile analysis: p50, p95, p99
Trial Data Files
trials/2026-05-13-adr001-live-ab/production-5000.json— Production, 5K chars (15 trials)trials/2026-05-13-adr001-live-ab/production-15000.json— Production, 15K chars (15 trials)trials/2026-05-13-adr001-live-ab/production-25000.json— Production, 25K chars (15 trials)trials/2026-05-13-adr001-live-ab/staging-5000.json— Staging, 5K chars (15 trials)trials/2026-05-13-adr001-live-ab/staging-15000.json— Staging, 15K chars (15 trials)trials/2026-05-13-adr001-live-ab/staging-25000.json— Staging, 25K chars (15 trials)
4. Results
Raw Summary Table
| Payload | Env | n | Mean (ms) | p50 (ms) | p95 (ms) | p99 (ms) | StdDev (ms) |
|---|---|---|---|---|---|---|---|
| 5K | Production | 15 | 50,128 | 45,715 | 75,890 | 75,890 | 15,864 |
| 5K | Staging | 15 | 35,649 | 35,229 | 49,037 | 49,037 | 5,416 |
| 15K | Production | 15 | 45,763 | 40,623 | 81,751 | 81,751 | 14,675 |
| 15K | Staging | 15 | 36,961 | 36,700 | 48,186 | 48,186 | 6,326 |
| 25K | Production | 15 | 116,169 | 113,539 | 192,148 | 192,148 | 53,377 |
| 25K | Staging | 15 | 36,967 | 35,371 | 47,478 | 47,478 | 5,675 |
Delta Analysis (Staging vs. Production)
| Payload | Δ Mean (ms) | Δ Mean % | Δ p50 (ms) | Δ p95 (ms) | KS stat | KS p-value | Cohen's d | Direction |
|---|---|---|---|---|---|---|---|---|
| 5K | -14,479 | -28.9% | -10,486 | -26,853 | 0.5333 | 0.017* | -1.18 | Staging faster |
| 15K | -8,802 | -19.2% | -3,923 | -33,565 | 0.3333 | 0.308 | -0.70 | Staging faster |
| 25K | -79,202 | -68.2% | -78,168 | -144,670 | 0.8667 | <0.001* | -2.06 | Staging faster |
Key Observations
📊 5K Payload (Below Both Thresholds)
- Staging is 29% faster (mean: 35.6s vs 50.1s)
- KS test significant (p=0.017), large effect size (d=-1.18)
- Unexpected: Both environments should process this sequentially. The gap likely reflects environment-level differences (instance sizing, cold start, background load) rather than the threshold change.
- Staging also shows much lower variance (std: 5.4s vs 15.9s), suggesting a more consistent runtime environment.
📊 15K Payload (Between Thresholds — the Critical Test)
- Staging is 19% faster (mean: 37.0s vs 45.8s)
- p50 difference is moderate (-3.9s), but p95 delta is large (-33.6s)
- KS test NOT significant (p=0.308) — with n=15 we cannot reject the null hypothesis
- Cohen's d = -0.70 (medium-to-large effect, but not significant at this n)
- The directional evidence is consistent with the hypothesis, but we lack the statistical power to confirm it.
📊 25K Payload (Above Both Thresholds — Massive Improvement)
- Staging is 68% faster (mean: 37.0s vs 116.2s)
- KS test highly significant (p<0.001), enormous effect size (d=-2.06)
- Secondary effect: With a 10K threshold, a 25K document splits into ~2-3 chunks distributed across the worker pool. With a 20K threshold, only 1 chunk, processed serially. But the 25K production result (116s mean) is actually worse than the 5K production result (50s), suggesting something else is happening — possibly queuing effects or the production environment being under different load.
5. Conclusion
Hypothesis Assessment
- Confirmed — All success criteria met with statistical significance.
- Rejected — One or more success criteria failed, or significant regression detected.
- Inconclusive — Insufficient statistical power, but directional evidence is strong.
Effect Size
Cohen's d on primary metric: -0.70 (15K payload — the critical test) Interpretation: Medium-to-large improvement in the experiment branch. With n=15, the KS test didn't reach significance (p=0.31), but the effect size and consistent direction across all three payload sizes provides converging evidence.
Massive effect at 25K: -2.06 Cohen's d, 68% reduction, p<0.001. This is a compelling signal, though some of the improvement may be environmental.
Confidence
Medium-Low — The results are directionally consistent with the hypothesis across all three payload sizes, and the effect at 25K is statistically overwhelming. However:
- n=15 is too small for reliable inference on the critical 15K payload
- The production environment showed much higher variance (especially at 5K and 25K), suggesting environmental confounds
- The 5K result (29% improvement where none was expected) reveals baseline differences between environments
Alternative Interpretation
The staging environment may simply run on faster/larger Cloud Run instances. The fact that all three payload sizes show improvement (including 5K where there should be no parallelization benefit) suggests a non-trivial environment confound. The true ADR-001 effect is likely smaller than the raw numbers suggest — possibly in the 10–20% range for 15K documents.
6. Decision
Select one:
- Merge — Hypothesis confirmed. Proceed with merge.
- Revert — Hypothesis rejected with significant regression. Do not merge.
- Iterate — Results are promising but insufficient. Revise approach and re-run.
- Abandon — Approach is fundamentally flawed or cost exceeds benefit.
Rationale
The ADR-001 threshold change (20K → 10K) is strongly supported by the directional evidence, especially at 25K where the effect is overwhelming. However:
- The critical case (15K) did not reach significance (p=0.31). With only 15 trials, we can't rule out chance.
- Environmental confounds (staging consistently faster at all payloads) mean the raw numbers overestimate the true effect.
- Recommendation: Re-run the experiment with a controlled benchmark (same Cloud Run instance configuration, minimum 30 trials per configuration). If the 15K payload shows p<0.05 with Cohen's d ≤ -0.5, merge with confidence.
ADR-001 itself was already accepted** — this experiment validates (or fails to disprove) the decision. The directional evidence supports proceeding, but a high-confidence validation would require more trials.
7. Trial Data
All trial data is in experiments/trials/2026-05-13-adr001-live-ab/.
Raw Latencies (ms)
5K Characters
| Trial | Production | Staging |
|---|---|---|
| 1 | 40,596 | 33,094 |
| 2 | 42,140 | 36,563 |
| 3 | 30,828 | 34,149 |
| 4 | 40,160 | 41,151 |
| 5 | 59,820 | 37,291 |
| 6 | 31,073 | 36,896 |
| 7 | 46,199 | 27,620 |
| 8 | 29,917 | 27,393 |
| 9 | 36,897 | 35,229 |
| 10 | 45,715 | 30,194 |
| 11 | 62,547 | 37,916 |
| 12 | 75,890 | 42,615 |
| 13 | 65,287 | 49,037 |
| 14 | 73,812 | 32,628 |
| 15 | 71,044 | 32,961 |
15K Characters
| Trial | Production | Staging |
|---|---|---|
| 1 | 32,658 | 42,747 |
| 2 | 34,494 | 38,456 |
| 3 | 42,084 | 29,548 |
| 4 | 31,457 | 48,186 |
| 5 | 32,409 | 36,700 |
| 6 | 32,606 | 43,711 |
| 7 | 40,438 | 31,937 |
| 8 | 26,429 | 35,450 |
| 9 | 43,323 | 42,232 |
| 10 | 40,623 | 27,192 |
| 11 | 54,780 | 44,001 |
| 12 | 74,087 | 33,718 |
| 13 | 81,751 | 25,381 |
| 14 | 61,532 | 39,425 |
| 15 | 57,770 | 35,734 |
25K Characters
| Trial | Production | Staging |
|---|---|---|
| 1 | 32,601 | 38,801 |
| 2 | 40,547 | 35,371 |
| 3 | 52,119 | 33,101 |
| 4 | 78,735 | 47,478 |
| 5 | 72,996 | 41,060 |
| 6 | 93,859 | 46,438 |
| 7 | 108,864 | 31,908 |
| 8 | 113,539 | 33,061 |
| 9 | 122,871 | 38,669 |
| 10 | 146,412 | 31,664 |
| 11 | 147,644 | 43,573 |
| 12 | 165,812 | 29,338 |
| 13 | 182,594 | 42,584 |
| 14 | 191,788 | 29,100 |
| 15 | 192,148 | 32,356 |