{

Experiment Report: ADR-007/010 — Pipeline Regression Experiment

Field	Value
Experiment ID	ADR-007-010-pipeline
Date	2026-05-12
Author	@coding-agent
PR	feature/experiment-adr007-010
Commit (baseline)	`e5d631ca266d22d1efd2bc9e6d0537c7a6f4b978` (development)
Commit (experiment)	`dc1fbe51c83a020cdad6a99325d65a83076bf0c5` (feature/experiment-adr007-010)
Status	⚠️ inconclusive — methodology caveat

1. Hypothesis

"The merged ADR-007 + ADR-010 changes introduce no statistically significant regression in graph snapshot latency, graph traversal latency, or memory allocations."

Success Criteria (quantified)

No statistically significant regression (KS test p ≥ 0.05, or Cohen's d not indicating large regression) across all benchmark variants
Memory allocations (allocs/op, B/op) unchanged or improved (within ±5%)
No single variant shows >10% degradation in mean latency

Failure Criteria

Any variant with KS test p < 0.05 and Cohen's d indicating large regression → experiment rejected
Any variant with >10% increase in mean ns/op → rejected

2. Variables

Independent Variable

Benchmark configuration difference: The baseline was measured with -benchtime=30s -count=30, while the experiment was measured with -benchtime=1s -count=10 due to runtime constraints. This is a critical methodology caveat — the shorter benchtime may produce systematically different results (less GC pressure, less JIT warmup, smaller iteration counts). Any observed differences should be interpreted with this caveat.

Dependent Variables (metrics measured)

Mean ns/op (from -benchmem output)
p50, p95, p99 latency (calculated from raw trial data)
Allocs/op (from -benchmem output)
Bytes/op (from -benchmem output)

Controlled Variables

Hardware: AMD Ryzen 7 3700X 8-Core Processor
Go version: go1.26.0 linux/amd64
Baseline benchmark flags: -benchmem -benchtime=30s -count=30
Experiment benchmark flags: -benchmem -benchtime=1s -count=10
Data corpus: In-memory graphs at scale={100, 1000, 5000}, synthetic linear chain topology
Background load: None (stale processes cleaned before experiment run)

3. Methodology

Benchmark Harness

# Baseline (main/development branch)
go test -run '^$' -bench='BenchmarkGraphSnapshot|BenchmarkGraphTraversal' \
  -benchmem -benchtime=30s -count=30 ./handlers/

# Experiment (feature/experiment-adr007-010 branch)
go test -timeout 0 -run '^$' -bench='BenchmarkGraphSnapshot|BenchmarkGraphTraversal' \
  -benchmem -benchtime=1s -count=10 ./handlers/

Sample Size

Baseline: 30 iterations per variant × 12 variants = 360 data points
Experiment: 10 iterations per variant × 12 variants = 119 data points (1 iteration missing for SQLite Traversal scale=5000 due to timeout)

Benchmarks Tested

All benchmarks from backend/handlers/graph_benchmark_test.go:

BenchmarkGraphSnapshot — MemoryGraphDB and SQLiteGraphDB at scale=100, 1000, 5000
BenchmarkGraphTraversal — MemoryGraphDB and SQLiteGraphDB at scale=100, 1000, 5000

Statistical Tests Applied

Two-sample KS test (scipy.stats.ks_2samp) on latency distributions
Cohen's d for effect size
Percentile analysis: p50, p95, p99

Trial Data Files

experiments/trials/2026-05-12-adr007-010-pipeline/main-branch.txt (baseline)
experiments/trials/2026-05-12-adr007-010-pipeline/dev-branch.txt (experiment)

4. Results

⚠️ Methodology Caveat

The baseline used -benchtime=30s -count=30 while the experiment used -benchtime=1s -count=10. The original plan was to run identical parameters, but -benchtime=30s -count=30 proved infeasible (estimated ~7+ hours for the dev branch run due to system constraints). Shorter benchtimes can produce systematically lower latencies because:

Less GC pressure (fewer allocations accumulate)
Different Go runtime JIT warmup profile
Different iteration count calibration by the benchmarking framework

All results below should be interpreted with this caveat. A follow-up experiment with identical benchtime parameters is recommended to confirm findings.

Summary Table

Variant	Baseline (ns/op)	Experiment (ns/op)	Δ	Δ%	Cohen's d	KS p-value
Snapshot Mem/100	3,652	3,439	-213	-5.83%	1.70 (large)	0.000032
Snapshot Mem/1000	40,921	37,139	-3,782	-9.24%	10.30 (large)	<0.000001
Snapshot Mem/5000	226,034	212,945	-13,089	-5.79%	2.46 (large)	<0.000001
Snapshot SQL/100	225,087	214,553	-10,533	-4.68%	4.80 (large)	<0.000001
Snapshot SQL/1000	2,059,968	1,952,934	-107,034	-5.20%	6.23 (large)	<0.000001
Snapshot SQL/5000	11,332,231	10,121,212	-1,211,019	-10.69%	2.09 (large)	<0.000001
Traversal Mem/100	6,558	6,332	-226	-3.45%	1.05 (large)	0.000032
Traversal Mem/1000	87,157	82,828	-4,329	-4.97%	1.93 (large)	<0.000001
Traversal Mem/5000	557,967	529,905	-28,062	-5.03%	1.93 (large)	<0.000001
Traversal SQL/100	205,573	201,148	-4,425	-2.15%	1.12 (large)	0.0001
Traversal SQL/1000	1,532,563	1,465,552	-67,011	-4.37%	1.94 (large)	<0.000001
Traversal SQL/5000	7,612,253	6,926,938	-685,316	-9.00%	1.81 (large)	<0.000001

Memory Analysis

Variant	Baseline Allocs	Experiment Allocs	Baseline Bytes	Experiment Bytes	Δ
Snapshot Mem/100	2	2	11,392	11,392	0
Snapshot Mem/1000	2	2	114,688	114,688	0
Snapshot Mem/5000	2	2	573,440	573,440	0
Snapshot SQL/100	2,229	2,229	69,768	69,768	0
Snapshot SQL/1000	22,035	22,035	612,840	612,840	0
Snapshot SQL/5000	110,045	110,045	3,832,047	3,832,046	-1
Traversal Mem/100	2	2	128	128	0
Traversal Mem/1000	7	7	704	704	0
Traversal Mem/5000	34	34	9,912	9,912	0
Traversal SQL/100	75	75	2,729	2,729	0
Traversal SQL/1000	107	107	3,618	3,618	0
Traversal SQL/5000	368	368	12,213	12,213	0

Memory allocations are identical. All variants show zero change in both allocs/op and bytes/op.

Key Observations

All 12 benchmark variants show statistically significant improvements (KS test p < 0.05, Cohen's d large). No regressions detected.
Improvements range from 2.15% to 10.69% across variants, with effect sizes (Cohen's d) in the large range (1.05 to 10.30).
Memory allocations are identical — zero change in allocs/op for all variants, confirming ADR-007/010 changes introduce no new allocation paths.
SQLite Snapshot at scale=5000 shows the largest improvement (-10.69%, d=2.09), while SQLite Traversal at scale=100 shows the smallest (-2.15%, d=1.12).
The consistent improvement across all variants is suspicious and likely reflects the benchmark methodology difference (1s vs 30s benchtime) rather than actual code improvement.

5. Conclusion

Hypothesis Assessment

Confirmed — All success criteria met with statistical significance.
Rejected — One or more success criteria failed, or significant regression detected.
Inconclusive — Methodology caveat precludes definitive conclusion.

Effect Size

Cohen's d on primary metric: Range 1.05–10.30 (all "large") Interpretation: All variants show large improvement (experiment faster). However, this is almost certainly an artifact of different -benchtime parameters (1s vs 30s), not genuine performance improvement from the ADR-007/010 changes.

Confidence

Low — The benchmark parameter mismatch invalidates direct comparison. The consistent 3–10% improvement across ALL variants strongly suggests a systematic effect of shorter benchtime (less GC amortization, different JIT warmup) rather than code-level optimization.

What We CAN Conclude

Memory allocations are unchanged — ADR-007/010 introduces no new allocation paths. Allocs/op and B/op are effectively identical between baseline and experiment.
No catastrophic regressions visible — Even with the benchtime artifact favoring the experiment branch, no regressions were detected. The code at least doesn't make things dramatically worse.
Methodology needs refinement — Future experiments must use identical -benchtime and -count parameters.

6. Decision

Select one:

Merge — Hypothesis confirmed. Proceed with merge.
Revert — Hypothesis rejected with significant regression. Do not merge.
Iterate — Results are promising but insufficient. Revise approach and re-run.
Abandon — Approach is fundamentally flawed or cost exceeds benefit.

Rationale

The experiment detected no regressions and no memory allocation changes, which is consistent with the hypothesis. However, the benchmark methodology difference (-benchtime=1s vs -benchtime=30s) invalidates direct comparisons of latency metrics. The observed 3–10% "improvement" is almost certainly a benchtime artifact.

Recommendation: Re-run with identical -benchtime=30s -count=30 on a machine with sufficient CPU (no competing processes) and a 4+ hour time budget. Alternatively, use -benchtime=3s -count=30 as a practical compromise that still provides meaningful warmup.

The allocs/op and B/op comparison is valid (unaffected by benchtime) and shows zero change — this is a positive signal for the ADR-007/010 changes.

7. Commit

Measurement Commits

Baseline: e5d631ca266d22d1efd2bc9e6d0537c7a6f4b978 — "fix: resolve merge artifacts — CompleteJob signature, NewGraphDB args, restore waitForLitestream"
Experiment: dc1fbe51c83a020cdad6a99325d65a83076bf0c5 — "bench: add ADR-007/010 regression benchmarks for experiment"

Verification

# Verify baseline commit is on development
git branch --contains e5d631ca266d22d1efd2bc9e6d0537c7a6f4b978 | grep development

# Verify experiment commit is the feature branch HEAD
cd ...podpedia-app.feature-experiment-adr007-010 && git rev-parse HEAD
# dc1fbe51c83a020cdad6a99325d65a83076bf0c5

Trial Data

experiments/trials/2026-05-12-adr007-010-pipeline/main-branch.txt — 360 data points (30 iterations × 12 variants)
experiments/trials/2026-05-12-adr007-010-pipeline/dev-branch.txt — 119 data points (10 iterations × 12 variants, -1)

Analysis Script

experiments/benchmarks/analyze.py — KS test, Cohen's d, percentile analysis

8. Appendix: Experiment Run Notes

Issues Encountered

Zombie benchmark processes (PID 3030704, 3176859) from earlier attempts were consuming ~300% CPU, slowing the initial run to ~1 data point per 2 minutes.
Initial run with -benchtime=30s -count=30 was estimated at 7+ hours due to the zombie CPU contention and the scale=5000 SQLite variants taking ~2 minutes per iteration.
Go test timeout (-test.timeout=10m0s) killed an early run. Mitigated with -timeout 0.
Exec timeout killed two runs. Final run used timeout=1800 (30 min) but still timed out at 119/120 data points.
Stale process log output — the tee output was significantly buffered, making real-time progress monitoring unreliable.

Cleanup Performed

kill -9 3030704 3176859  # zombie benchmark processes

Benchmark Code

All benchmarks in backend/handlers/graph_benchmark_test.go:

BenchmarkGraphSnapshot — MemoryGraphDB and SQLiteGraphDB at scale=100, 1000, 5000
BenchmarkGraphTraversal — MemoryGraphDB and SQLiteGraphDB at scale=100, 1000, 5000
Additional benchmarks exist (BenchmarkPipelineAssembly, BenchmarkGraphBulkInsert, BenchmarkNewGraphDB, BenchmarkGraphStats, BenchmarkEntityResolution) but were not included in this experiment scope.

}