{

Experiment Report: ADR-001 — 10K Character Threshold Live A/B Latency Test

Field	Value
Experiment ID	ADR-001-live-ab
Date	2026-05-13
Author	@coding-agent
PR	feature/experiment-adr001
Commit (baseline)	Production deployment (podpedia-pipeline, 20K threshold)
Commit (experiment)	Staging deployment (podpedia-pipeline-staging, 10K threshold)
Status	⚠️ inconclusive — low trial count

1. Hypothesis

"Lowering the parallel graph extraction chunking threshold from 20K characters to 10K characters (ADR-001) will reduce wall-clock extraction latency for medium-to-large documents."

Rationale (from ADR-001)

Documents exceeding 10K characters will trigger parallel chunking across the 8-worker pool instead of being processed sequentially. The 10K window provides a sufficient semantic aperture (≈1,500–2,000 words) for the LLM to resolve relationships without orphaned nodes.

Success Criteria (quantified)

Statistically significant latency reduction (KS test p < 0.05) for documents at or above the 10K threshold
Mean latency reduction ≥ 15% at the 15K payload size (between old and new thresholds)
No regression for documents below the 10K threshold (5K payload)

Failure Criteria

Latency increases at any payload size
Inconclusive KS test (p ≥ 0.05) for the critical 15K payload after sufficient trials

2. Variables

Independent Variable

Chunking threshold: Production (20K threshold, sequential for <20K docs) vs. Staging (10K threshold, parallel for <10K docs).

Dependent Variables (metrics measured)

p50 latency (ms per request)
p95 latency (ms per request)
p99 latency (ms per request)
Mean latency (ms per request)

Controlled Variables

Payload sizes: 5,000 / 15,000 / 25,000 characters — representing below threshold, between thresholds, and above both thresholds
Environment: Cloud Run (production: podpedia-pipeline-n6xek74vya-uc.a.run.app, staging: podpedia-pipeline-staging-446893038996.us-central1.run.app)
Concurrency: 15 concurrent fired requests (sequential dispatch, parallel execution)
Endpoint: Full pipeline ingestion (ingest → entity resolution → graph → sink)

Uncontrolled Variables

Environment differences: Production and staging may run on different Cloud Run instance profiles (CPU, concurrency, cold starts). This is a live A/B, not a controlled benchmark.
Trial count: Only 15 trials per payload/environment combination. Low statistical power.
Ordering: Trials were fired concurrently, so late arrivals may reflect queuing effects rather than pure extraction latency.

3. Methodology

Benchmark Harness

A custom HTTP load test fired 15 concurrent requests at each payload size, targeting both the production and staging endpoints. Each request submitted a text payload of the specified character count and measured wall-clock latency from dispatch to full response.

Sample Size

Trials per configuration: 15
Total data points: 15 trials × 3 payloads × 2 environments = 90

Statistical Tests Applied

Two-sample KS test (approximate) on latency distributions
Cohen's d for effect size
Percentile analysis: p50, p95, p99

Trial Data Files

trials/2026-05-13-adr001-live-ab/production-5000.json — Production, 5K chars (15 trials)
trials/2026-05-13-adr001-live-ab/production-15000.json — Production, 15K chars (15 trials)
trials/2026-05-13-adr001-live-ab/production-25000.json — Production, 25K chars (15 trials)
trials/2026-05-13-adr001-live-ab/staging-5000.json — Staging, 5K chars (15 trials)
trials/2026-05-13-adr001-live-ab/staging-15000.json — Staging, 15K chars (15 trials)
trials/2026-05-13-adr001-live-ab/staging-25000.json — Staging, 25K chars (15 trials)

4. Results

Raw Summary Table

Payload	Env	n	Mean (ms)	p50 (ms)	p95 (ms)	p99 (ms)	StdDev (ms)
5K	Production	15	50,128	45,715	75,890	75,890	15,864
5K	Staging	15	35,649	35,229	49,037	49,037	5,416
15K	Production	15	45,763	40,623	81,751	81,751	14,675
15K	Staging	15	36,961	36,700	48,186	48,186	6,326
25K	Production	15	116,169	113,539	192,148	192,148	53,377
25K	Staging	15	36,967	35,371	47,478	47,478	5,675

Delta Analysis (Staging vs. Production)

Payload	Δ Mean (ms)	Δ Mean %	Δ p50 (ms)	Δ p95 (ms)	KS stat	KS p-value	Cohen's d	Direction
5K	-14,479	-28.9%	-10,486	-26,853	0.5333	0.017*	-1.18	Staging faster
15K	-8,802	-19.2%	-3,923	-33,565	0.3333	0.308	-0.70	Staging faster
25K	-79,202	-68.2%	-78,168	-144,670	0.8667	<0.001*	-2.06	Staging faster

Key Observations

📊 5K Payload (Below Both Thresholds)

Staging is 29% faster (mean: 35.6s vs 50.1s)
KS test significant (p=0.017), large effect size (d=-1.18)
Unexpected: Both environments should process this sequentially. The gap likely reflects environment-level differences (instance sizing, cold start, background load) rather than the threshold change.
Staging also shows much lower variance (std: 5.4s vs 15.9s), suggesting a more consistent runtime environment.

📊 15K Payload (Between Thresholds — the Critical Test)

Staging is 19% faster (mean: 37.0s vs 45.8s)
p50 difference is moderate (-3.9s), but p95 delta is large (-33.6s)
KS test NOT significant (p=0.308) — with n=15 we cannot reject the null hypothesis
Cohen's d = -0.70 (medium-to-large effect, but not significant at this n)
The directional evidence is consistent with the hypothesis, but we lack the statistical power to confirm it.

📊 25K Payload (Above Both Thresholds — Massive Improvement)

Staging is 68% faster (mean: 37.0s vs 116.2s)
KS test highly significant (p<0.001), enormous effect size (d=-2.06)
Secondary effect: With a 10K threshold, a 25K document splits into ~2-3 chunks distributed across the worker pool. With a 20K threshold, only 1 chunk, processed serially. But the 25K production result (116s mean) is actually worse than the 5K production result (50s), suggesting something else is happening — possibly queuing effects or the production environment being under different load.

5. Conclusion

Hypothesis Assessment

Confirmed — All success criteria met with statistical significance.
Rejected — One or more success criteria failed, or significant regression detected.
Inconclusive — Insufficient statistical power, but directional evidence is strong.

Effect Size

Cohen's d on primary metric: -0.70 (15K payload — the critical test) Interpretation: Medium-to-large improvement in the experiment branch. With n=15, the KS test didn't reach significance (p=0.31), but the effect size and consistent direction across all three payload sizes provides converging evidence.

Massive effect at 25K: -2.06 Cohen's d, 68% reduction, p<0.001. This is a compelling signal, though some of the improvement may be environmental.

Confidence

Medium-Low — The results are directionally consistent with the hypothesis across all three payload sizes, and the effect at 25K is statistically overwhelming. However:

n=15 is too small for reliable inference on the critical 15K payload
The production environment showed much higher variance (especially at 5K and 25K), suggesting environmental confounds
The 5K result (29% improvement where none was expected) reveals baseline differences between environments

Alternative Interpretation

The staging environment may simply run on faster/larger Cloud Run instances. The fact that all three payload sizes show improvement (including 5K where there should be no parallelization benefit) suggests a non-trivial environment confound. The true ADR-001 effect is likely smaller than the raw numbers suggest — possibly in the 10–20% range for 15K documents.

6. Decision

Select one:

Merge — Hypothesis confirmed. Proceed with merge.
Revert — Hypothesis rejected with significant regression. Do not merge.
Iterate — Results are promising but insufficient. Revise approach and re-run.
Abandon — Approach is fundamentally flawed or cost exceeds benefit.

Rationale

The ADR-001 threshold change (20K → 10K) is strongly supported by the directional evidence, especially at 25K where the effect is overwhelming. However:

The critical case (15K) did not reach significance (p=0.31). With only 15 trials, we can't rule out chance.
Environmental confounds (staging consistently faster at all payloads) mean the raw numbers overestimate the true effect.
Recommendation: Re-run the experiment with a controlled benchmark (same Cloud Run instance configuration, minimum 30 trials per configuration). If the 15K payload shows p<0.05 with Cohen's d ≤ -0.5, merge with confidence.

ADR-001 itself was already accepted** — this experiment validates (or fails to disprove) the decision. The directional evidence supports proceeding, but a high-confidence validation would require more trials.

7. Trial Data

All trial data is in experiments/trials/2026-05-13-adr001-live-ab/.

Raw Latencies (ms)

5K Characters

Trial	Production	Staging
1	40,596	33,094
2	42,140	36,563
3	30,828	34,149
4	40,160	41,151
5	59,820	37,291
6	31,073	36,896
7	46,199	27,620
8	29,917	27,393
9	36,897	35,229
10	45,715	30,194
11	62,547	37,916
12	75,890	42,615
13	65,287	49,037
14	73,812	32,628
15	71,044	32,961

15K Characters

Trial	Production	Staging
1	32,658	42,747
2	34,494	38,456
3	42,084	29,548
4	31,457	48,186
5	32,409	36,700
6	32,606	43,711
7	40,438	31,937
8	26,429	35,450
9	43,323	42,232
10	40,623	27,192
11	54,780	44,001
12	74,087	33,718
13	81,751	25,381
14	61,532	39,425
15	57,770	35,734

25K Characters

Trial	Production	Staging
1	32,601	38,801
2	40,547	35,371
3	52,119	33,101
4	78,735	47,478
5	72,996	41,060
6	93,859	46,438
7	108,864	31,908
8	113,539	33,061
9	122,871	38,669
10	146,412	31,664
11	147,644	43,573
12	165,812	29,338
13	182,594	42,584
14	191,788	29,100
15	192,148	32,356

}