Experiment Report: ADR-007/010 — Pipeline Regression Experiment
| Field | Value |
|---|---|
| Experiment ID | ADR-007-010-pipeline |
| Date | 2026-05-12 |
| Author | @coding-agent |
| PR | feature/experiment-adr007-010 |
| Commit (baseline) | e5d631ca266d22d1efd2bc9e6d0537c7a6f4b978 (development) |
| Commit (experiment) | dc1fbe51c83a020cdad6a99325d65a83076bf0c5 (feature/experiment-adr007-010) |
| Status | ⚠️ inconclusive — methodology caveat |
1. Hypothesis
"The merged ADR-007 + ADR-010 changes introduce no statistically significant regression in graph snapshot latency, graph traversal latency, or memory allocations."
Success Criteria (quantified)
- No statistically significant regression (KS test p ≥ 0.05, or Cohen's d not indicating large regression) across all benchmark variants
- Memory allocations (allocs/op, B/op) unchanged or improved (within ±5%)
- No single variant shows >10% degradation in mean latency
Failure Criteria
- Any variant with KS test p < 0.05 and Cohen's d indicating large regression → experiment rejected
- Any variant with >10% increase in mean ns/op → rejected
2. Variables
Independent Variable
Benchmark configuration difference: The baseline was measured with -benchtime=30s -count=30, while the experiment was measured with -benchtime=1s -count=10 due to runtime constraints. This is a critical methodology caveat — the shorter benchtime may produce systematically different results (less GC pressure, less JIT warmup, smaller iteration counts). Any observed differences should be interpreted with this caveat.
Dependent Variables (metrics measured)
- Mean ns/op (from
-benchmemoutput) - p50, p95, p99 latency (calculated from raw trial data)
- Allocs/op (from
-benchmemoutput) - Bytes/op (from
-benchmemoutput)
Controlled Variables
- Hardware: AMD Ryzen 7 3700X 8-Core Processor
- Go version: go1.26.0 linux/amd64
- Baseline benchmark flags:
-benchmem -benchtime=30s -count=30 - Experiment benchmark flags:
-benchmem -benchtime=1s -count=10 - Data corpus: In-memory graphs at scale={100, 1000, 5000}, synthetic linear chain topology
- Background load: None (stale processes cleaned before experiment run)
3. Methodology
Benchmark Harness
# Baseline (main/development branch)
go test -run '^$' -bench='BenchmarkGraphSnapshot|BenchmarkGraphTraversal' \
-benchmem -benchtime=30s -count=30 ./handlers/
# Experiment (feature/experiment-adr007-010 branch)
go test -timeout 0 -run '^$' -bench='BenchmarkGraphSnapshot|BenchmarkGraphTraversal' \
-benchmem -benchtime=1s -count=10 ./handlers/
Sample Size
- Baseline: 30 iterations per variant × 12 variants = 360 data points
- Experiment: 10 iterations per variant × 12 variants = 119 data points (1 iteration missing for SQLite Traversal scale=5000 due to timeout)
Benchmarks Tested
All benchmarks from backend/handlers/graph_benchmark_test.go:
BenchmarkGraphSnapshot— MemoryGraphDB and SQLiteGraphDB at scale=100, 1000, 5000BenchmarkGraphTraversal— MemoryGraphDB and SQLiteGraphDB at scale=100, 1000, 5000
Statistical Tests Applied
- Two-sample KS test (scipy.stats.ks_2samp) on latency distributions
- Cohen's d for effect size
- Percentile analysis: p50, p95, p99
Trial Data Files
experiments/trials/2026-05-12-adr007-010-pipeline/main-branch.txt(baseline)experiments/trials/2026-05-12-adr007-010-pipeline/dev-branch.txt(experiment)
4. Results
⚠️ Methodology Caveat
The baseline used -benchtime=30s -count=30 while the experiment used -benchtime=1s -count=10. The original plan was to run identical parameters, but -benchtime=30s -count=30 proved infeasible (estimated ~7+ hours for the dev branch run due to system constraints). Shorter benchtimes can produce systematically lower latencies because:
- Less GC pressure (fewer allocations accumulate)
- Different Go runtime JIT warmup profile
- Different iteration count calibration by the benchmarking framework
All results below should be interpreted with this caveat. A follow-up experiment with identical benchtime parameters is recommended to confirm findings.
Summary Table
| Variant | Baseline (ns/op) | Experiment (ns/op) | Δ | Δ% | Cohen's d | KS p-value |
|---|---|---|---|---|---|---|
| Snapshot Mem/100 | 3,652 | 3,439 | -213 | -5.83% | 1.70 (large) | 0.000032 |
| Snapshot Mem/1000 | 40,921 | 37,139 | -3,782 | -9.24% | 10.30 (large) | <0.000001 |
| Snapshot Mem/5000 | 226,034 | 212,945 | -13,089 | -5.79% | 2.46 (large) | <0.000001 |
| Snapshot SQL/100 | 225,087 | 214,553 | -10,533 | -4.68% | 4.80 (large) | <0.000001 |
| Snapshot SQL/1000 | 2,059,968 | 1,952,934 | -107,034 | -5.20% | 6.23 (large) | <0.000001 |
| Snapshot SQL/5000 | 11,332,231 | 10,121,212 | -1,211,019 | -10.69% | 2.09 (large) | <0.000001 |
| Traversal Mem/100 | 6,558 | 6,332 | -226 | -3.45% | 1.05 (large) | 0.000032 |
| Traversal Mem/1000 | 87,157 | 82,828 | -4,329 | -4.97% | 1.93 (large) | <0.000001 |
| Traversal Mem/5000 | 557,967 | 529,905 | -28,062 | -5.03% | 1.93 (large) | <0.000001 |
| Traversal SQL/100 | 205,573 | 201,148 | -4,425 | -2.15% | 1.12 (large) | 0.0001 |
| Traversal SQL/1000 | 1,532,563 | 1,465,552 | -67,011 | -4.37% | 1.94 (large) | <0.000001 |
| Traversal SQL/5000 | 7,612,253 | 6,926,938 | -685,316 | -9.00% | 1.81 (large) | <0.000001 |
Memory Analysis
| Variant | Baseline Allocs | Experiment Allocs | Δ | Baseline Bytes | Experiment Bytes | Δ |
|---|---|---|---|---|---|---|
| Snapshot Mem/100 | 2 | 2 | 0 | 11,392 | 11,392 | 0 |
| Snapshot Mem/1000 | 2 | 2 | 0 | 114,688 | 114,688 | 0 |
| Snapshot Mem/5000 | 2 | 2 | 0 | 573,440 | 573,440 | 0 |
| Snapshot SQL/100 | 2,229 | 2,229 | 0 | 69,768 | 69,768 | 0 |
| Snapshot SQL/1000 | 22,035 | 22,035 | 0 | 612,840 | 612,840 | 0 |
| Snapshot SQL/5000 | 110,045 | 110,045 | 0 | 3,832,047 | 3,832,046 | -1 |
| Traversal Mem/100 | 2 | 2 | 0 | 128 | 128 | 0 |
| Traversal Mem/1000 | 7 | 7 | 0 | 704 | 704 | 0 |
| Traversal Mem/5000 | 34 | 34 | 0 | 9,912 | 9,912 | 0 |
| Traversal SQL/100 | 75 | 75 | 0 | 2,729 | 2,729 | 0 |
| Traversal SQL/1000 | 107 | 107 | 0 | 3,618 | 3,618 | 0 |
| Traversal SQL/5000 | 368 | 368 | 0 | 12,213 | 12,213 | 0 |
Memory allocations are identical. All variants show zero change in both allocs/op and bytes/op.
Key Observations
All 12 benchmark variants show statistically significant improvements (KS test p < 0.05, Cohen's d large). No regressions detected.
Improvements range from 2.15% to 10.69% across variants, with effect sizes (Cohen's d) in the large range (1.05 to 10.30).
Memory allocations are identical — zero change in allocs/op for all variants, confirming ADR-007/010 changes introduce no new allocation paths.
SQLite Snapshot at scale=5000 shows the largest improvement (-10.69%, d=2.09), while SQLite Traversal at scale=100 shows the smallest (-2.15%, d=1.12).
The consistent improvement across all variants is suspicious and likely reflects the benchmark methodology difference (1s vs 30s benchtime) rather than actual code improvement.
5. Conclusion
Hypothesis Assessment
- Confirmed — All success criteria met with statistical significance.
- Rejected — One or more success criteria failed, or significant regression detected.
- Inconclusive — Methodology caveat precludes definitive conclusion.
Effect Size
Cohen's d on primary metric: Range 1.05–10.30 (all "large")
Interpretation: All variants show large improvement (experiment faster). However, this is almost certainly an artifact of different -benchtime parameters (1s vs 30s), not genuine performance improvement from the ADR-007/010 changes.
Confidence
Low — The benchmark parameter mismatch invalidates direct comparison. The consistent 3–10% improvement across ALL variants strongly suggests a systematic effect of shorter benchtime (less GC amortization, different JIT warmup) rather than code-level optimization.
What We CAN Conclude
- Memory allocations are unchanged — ADR-007/010 introduces no new allocation paths. Allocs/op and B/op are effectively identical between baseline and experiment.
- No catastrophic regressions visible — Even with the benchtime artifact favoring the experiment branch, no regressions were detected. The code at least doesn't make things dramatically worse.
- Methodology needs refinement — Future experiments must use identical
-benchtimeand-countparameters.
6. Decision
Select one:
- Merge — Hypothesis confirmed. Proceed with merge.
- Revert — Hypothesis rejected with significant regression. Do not merge.
- Iterate — Results are promising but insufficient. Revise approach and re-run.
- Abandon — Approach is fundamentally flawed or cost exceeds benefit.
Rationale
The experiment detected no regressions and no memory allocation changes, which is consistent with the hypothesis. However, the benchmark methodology difference (-benchtime=1s vs -benchtime=30s) invalidates direct comparisons of latency metrics. The observed 3–10% "improvement" is almost certainly a benchtime artifact.
Recommendation: Re-run with identical -benchtime=30s -count=30 on a machine with sufficient CPU (no competing processes) and a 4+ hour time budget. Alternatively, use -benchtime=3s -count=30 as a practical compromise that still provides meaningful warmup.
The allocs/op and B/op comparison is valid (unaffected by benchtime) and shows zero change — this is a positive signal for the ADR-007/010 changes.
7. Commit
Measurement Commits
- Baseline:
e5d631ca266d22d1efd2bc9e6d0537c7a6f4b978— "fix: resolve merge artifacts — CompleteJob signature, NewGraphDB args, restore waitForLitestream" - Experiment:
dc1fbe51c83a020cdad6a99325d65a83076bf0c5— "bench: add ADR-007/010 regression benchmarks for experiment"
Verification
# Verify baseline commit is on development
git branch --contains e5d631ca266d22d1efd2bc9e6d0537c7a6f4b978 | grep development
# Verify experiment commit is the feature branch HEAD
cd ...podpedia-app.feature-experiment-adr007-010 && git rev-parse HEAD
# dc1fbe51c83a020cdad6a99325d65a83076bf0c5
Trial Data
experiments/trials/2026-05-12-adr007-010-pipeline/main-branch.txt— 360 data points (30 iterations × 12 variants)experiments/trials/2026-05-12-adr007-010-pipeline/dev-branch.txt— 119 data points (10 iterations × 12 variants, -1)
Analysis Script
experiments/benchmarks/analyze.py— KS test, Cohen's d, percentile analysis
8. Appendix: Experiment Run Notes
Issues Encountered
- Zombie benchmark processes (PID 3030704, 3176859) from earlier attempts were consuming ~300% CPU, slowing the initial run to ~1 data point per 2 minutes.
- Initial run with
-benchtime=30s -count=30was estimated at 7+ hours due to the zombie CPU contention and the scale=5000 SQLite variants taking ~2 minutes per iteration. - Go test timeout (
-test.timeout=10m0s) killed an early run. Mitigated with-timeout 0. - Exec timeout killed two runs. Final run used
timeout=1800(30 min) but still timed out at 119/120 data points. - Stale process log output — the
teeoutput was significantly buffered, making real-time progress monitoring unreliable.
Cleanup Performed
kill -9 3030704 3176859 # zombie benchmark processes
Benchmark Code
All benchmarks in backend/handlers/graph_benchmark_test.go:
BenchmarkGraphSnapshot— MemoryGraphDB and SQLiteGraphDB at scale=100, 1000, 5000BenchmarkGraphTraversal— MemoryGraphDB and SQLiteGraphDB at scale=100, 1000, 5000- Additional benchmarks exist (
BenchmarkPipelineAssembly,BenchmarkGraphBulkInsert,BenchmarkNewGraphDB,BenchmarkGraphStats,BenchmarkEntityResolution) but were not included in this experiment scope.