{

Experiment Report: ADR-001 — 10K Character Threshold Live A/B Latency Test

Field Value
Experiment ID ADR-001-live-ab
Date 2026-05-13
Author @coding-agent
PR feature/experiment-adr001
Commit (baseline) Production deployment (podpedia-pipeline, 20K threshold)
Commit (experiment) Staging deployment (podpedia-pipeline-staging, 10K threshold)
Status ⚠️ inconclusive — low trial count

1. Hypothesis

"Lowering the parallel graph extraction chunking threshold from 20K characters to 10K characters (ADR-001) will reduce wall-clock extraction latency for medium-to-large documents."

Rationale (from ADR-001)

Documents exceeding 10K characters will trigger parallel chunking across the 8-worker pool instead of being processed sequentially. The 10K window provides a sufficient semantic aperture (≈1,500–2,000 words) for the LLM to resolve relationships without orphaned nodes.

Success Criteria (quantified)

Failure Criteria


2. Variables

Independent Variable

Chunking threshold: Production (20K threshold, sequential for <20K docs) vs. Staging (10K threshold, parallel for <10K docs).

Dependent Variables (metrics measured)

Controlled Variables

Uncontrolled Variables


3. Methodology

Benchmark Harness

A custom HTTP load test fired 15 concurrent requests at each payload size, targeting both the production and staging endpoints. Each request submitted a text payload of the specified character count and measured wall-clock latency from dispatch to full response.

Sample Size

Statistical Tests Applied

  1. Two-sample KS test (approximate) on latency distributions
  2. Cohen's d for effect size
  3. Percentile analysis: p50, p95, p99

Trial Data Files


4. Results

Raw Summary Table

Payload Env n Mean (ms) p50 (ms) p95 (ms) p99 (ms) StdDev (ms)
5K Production 15 50,128 45,715 75,890 75,890 15,864
5K Staging 15 35,649 35,229 49,037 49,037 5,416
15K Production 15 45,763 40,623 81,751 81,751 14,675
15K Staging 15 36,961 36,700 48,186 48,186 6,326
25K Production 15 116,169 113,539 192,148 192,148 53,377
25K Staging 15 36,967 35,371 47,478 47,478 5,675

Delta Analysis (Staging vs. Production)

Payload Δ Mean (ms) Δ Mean % Δ p50 (ms) Δ p95 (ms) KS stat KS p-value Cohen's d Direction
5K -14,479 -28.9% -10,486 -26,853 0.5333 0.017* -1.18 Staging faster
15K -8,802 -19.2% -3,923 -33,565 0.3333 0.308 -0.70 Staging faster
25K -79,202 -68.2% -78,168 -144,670 0.8667 <0.001* -2.06 Staging faster

Key Observations

📊 5K Payload (Below Both Thresholds)

📊 15K Payload (Between Thresholds — the Critical Test)

📊 25K Payload (Above Both Thresholds — Massive Improvement)


5. Conclusion

Hypothesis Assessment

Effect Size

Cohen's d on primary metric: -0.70 (15K payload — the critical test) Interpretation: Medium-to-large improvement in the experiment branch. With n=15, the KS test didn't reach significance (p=0.31), but the effect size and consistent direction across all three payload sizes provides converging evidence.

Massive effect at 25K: -2.06 Cohen's d, 68% reduction, p<0.001. This is a compelling signal, though some of the improvement may be environmental.

Confidence

Medium-Low — The results are directionally consistent with the hypothesis across all three payload sizes, and the effect at 25K is statistically overwhelming. However:

  1. n=15 is too small for reliable inference on the critical 15K payload
  2. The production environment showed much higher variance (especially at 5K and 25K), suggesting environmental confounds
  3. The 5K result (29% improvement where none was expected) reveals baseline differences between environments

Alternative Interpretation

The staging environment may simply run on faster/larger Cloud Run instances. The fact that all three payload sizes show improvement (including 5K where there should be no parallelization benefit) suggests a non-trivial environment confound. The true ADR-001 effect is likely smaller than the raw numbers suggest — possibly in the 10–20% range for 15K documents.


6. Decision

Select one:

Rationale

The ADR-001 threshold change (20K → 10K) is strongly supported by the directional evidence, especially at 25K where the effect is overwhelming. However:

  1. The critical case (15K) did not reach significance (p=0.31). With only 15 trials, we can't rule out chance.
  2. Environmental confounds (staging consistently faster at all payloads) mean the raw numbers overestimate the true effect.
  3. Recommendation: Re-run the experiment with a controlled benchmark (same Cloud Run instance configuration, minimum 30 trials per configuration). If the 15K payload shows p<0.05 with Cohen's d ≤ -0.5, merge with confidence.

ADR-001 itself was already accepted** — this experiment validates (or fails to disprove) the decision. The directional evidence supports proceeding, but a high-confidence validation would require more trials.


7. Trial Data

All trial data is in experiments/trials/2026-05-13-adr001-live-ab/.

Raw Latencies (ms)

5K Characters

Trial Production Staging
1 40,596 33,094
2 42,140 36,563
3 30,828 34,149
4 40,160 41,151
5 59,820 37,291
6 31,073 36,896
7 46,199 27,620
8 29,917 27,393
9 36,897 35,229
10 45,715 30,194
11 62,547 37,916
12 75,890 42,615
13 65,287 49,037
14 73,812 32,628
15 71,044 32,961

15K Characters

Trial Production Staging
1 32,658 42,747
2 34,494 38,456
3 42,084 29,548
4 31,457 48,186
5 32,409 36,700
6 32,606 43,711
7 40,438 31,937
8 26,429 35,450
9 43,323 42,232
10 40,623 27,192
11 54,780 44,001
12 74,087 33,718
13 81,751 25,381
14 61,532 39,425
15 57,770 35,734

25K Characters

Trial Production Staging
1 32,601 38,801
2 40,547 35,371
3 52,119 33,101
4 78,735 47,478
5 72,996 41,060
6 93,859 46,438
7 108,864 31,908
8 113,539 33,061
9 122,871 38,669
10 146,412 31,664
11 147,644 43,573
12 165,812 29,338
13 182,594 42,584
14 191,788 29,100
15 192,148 32,356
}