{

Experiment Report: ADR-007/010 — Pipeline Regression Experiment

Field Value
Experiment ID ADR-007-010-pipeline
Date 2026-05-12
Author @coding-agent
PR feature/experiment-adr007-010
Commit (baseline) e5d631ca266d22d1efd2bc9e6d0537c7a6f4b978 (development)
Commit (experiment) dc1fbe51c83a020cdad6a99325d65a83076bf0c5 (feature/experiment-adr007-010)
Status ⚠️ inconclusive — methodology caveat

1. Hypothesis

"The merged ADR-007 + ADR-010 changes introduce no statistically significant regression in graph snapshot latency, graph traversal latency, or memory allocations."

Success Criteria (quantified)

Failure Criteria


2. Variables

Independent Variable

Benchmark configuration difference: The baseline was measured with -benchtime=30s -count=30, while the experiment was measured with -benchtime=1s -count=10 due to runtime constraints. This is a critical methodology caveat — the shorter benchtime may produce systematically different results (less GC pressure, less JIT warmup, smaller iteration counts). Any observed differences should be interpreted with this caveat.

Dependent Variables (metrics measured)

Controlled Variables


3. Methodology

Benchmark Harness

# Baseline (main/development branch)
go test -run '^$' -bench='BenchmarkGraphSnapshot|BenchmarkGraphTraversal' \
  -benchmem -benchtime=30s -count=30 ./handlers/

# Experiment (feature/experiment-adr007-010 branch)
go test -timeout 0 -run '^$' -bench='BenchmarkGraphSnapshot|BenchmarkGraphTraversal' \
  -benchmem -benchtime=1s -count=10 ./handlers/

Sample Size

Benchmarks Tested

All benchmarks from backend/handlers/graph_benchmark_test.go:

Statistical Tests Applied

  1. Two-sample KS test (scipy.stats.ks_2samp) on latency distributions
  2. Cohen's d for effect size
  3. Percentile analysis: p50, p95, p99

Trial Data Files


4. Results

⚠️ Methodology Caveat

The baseline used -benchtime=30s -count=30 while the experiment used -benchtime=1s -count=10. The original plan was to run identical parameters, but -benchtime=30s -count=30 proved infeasible (estimated ~7+ hours for the dev branch run due to system constraints). Shorter benchtimes can produce systematically lower latencies because:

All results below should be interpreted with this caveat. A follow-up experiment with identical benchtime parameters is recommended to confirm findings.

Summary Table

Variant Baseline (ns/op) Experiment (ns/op) Δ Δ% Cohen's d KS p-value
Snapshot Mem/100 3,652 3,439 -213 -5.83% 1.70 (large) 0.000032
Snapshot Mem/1000 40,921 37,139 -3,782 -9.24% 10.30 (large) <0.000001
Snapshot Mem/5000 226,034 212,945 -13,089 -5.79% 2.46 (large) <0.000001
Snapshot SQL/100 225,087 214,553 -10,533 -4.68% 4.80 (large) <0.000001
Snapshot SQL/1000 2,059,968 1,952,934 -107,034 -5.20% 6.23 (large) <0.000001
Snapshot SQL/5000 11,332,231 10,121,212 -1,211,019 -10.69% 2.09 (large) <0.000001
Traversal Mem/100 6,558 6,332 -226 -3.45% 1.05 (large) 0.000032
Traversal Mem/1000 87,157 82,828 -4,329 -4.97% 1.93 (large) <0.000001
Traversal Mem/5000 557,967 529,905 -28,062 -5.03% 1.93 (large) <0.000001
Traversal SQL/100 205,573 201,148 -4,425 -2.15% 1.12 (large) 0.0001
Traversal SQL/1000 1,532,563 1,465,552 -67,011 -4.37% 1.94 (large) <0.000001
Traversal SQL/5000 7,612,253 6,926,938 -685,316 -9.00% 1.81 (large) <0.000001

Memory Analysis

Variant Baseline Allocs Experiment Allocs Δ Baseline Bytes Experiment Bytes Δ
Snapshot Mem/100 2 2 0 11,392 11,392 0
Snapshot Mem/1000 2 2 0 114,688 114,688 0
Snapshot Mem/5000 2 2 0 573,440 573,440 0
Snapshot SQL/100 2,229 2,229 0 69,768 69,768 0
Snapshot SQL/1000 22,035 22,035 0 612,840 612,840 0
Snapshot SQL/5000 110,045 110,045 0 3,832,047 3,832,046 -1
Traversal Mem/100 2 2 0 128 128 0
Traversal Mem/1000 7 7 0 704 704 0
Traversal Mem/5000 34 34 0 9,912 9,912 0
Traversal SQL/100 75 75 0 2,729 2,729 0
Traversal SQL/1000 107 107 0 3,618 3,618 0
Traversal SQL/5000 368 368 0 12,213 12,213 0

Memory allocations are identical. All variants show zero change in both allocs/op and bytes/op.

Key Observations

  1. All 12 benchmark variants show statistically significant improvements (KS test p < 0.05, Cohen's d large). No regressions detected.

  2. Improvements range from 2.15% to 10.69% across variants, with effect sizes (Cohen's d) in the large range (1.05 to 10.30).

  3. Memory allocations are identical — zero change in allocs/op for all variants, confirming ADR-007/010 changes introduce no new allocation paths.

  4. SQLite Snapshot at scale=5000 shows the largest improvement (-10.69%, d=2.09), while SQLite Traversal at scale=100 shows the smallest (-2.15%, d=1.12).

  5. The consistent improvement across all variants is suspicious and likely reflects the benchmark methodology difference (1s vs 30s benchtime) rather than actual code improvement.


5. Conclusion

Hypothesis Assessment

Effect Size

Cohen's d on primary metric: Range 1.05–10.30 (all "large") Interpretation: All variants show large improvement (experiment faster). However, this is almost certainly an artifact of different -benchtime parameters (1s vs 30s), not genuine performance improvement from the ADR-007/010 changes.

Confidence

Low — The benchmark parameter mismatch invalidates direct comparison. The consistent 3–10% improvement across ALL variants strongly suggests a systematic effect of shorter benchtime (less GC amortization, different JIT warmup) rather than code-level optimization.

What We CAN Conclude


6. Decision

Select one:

Rationale

The experiment detected no regressions and no memory allocation changes, which is consistent with the hypothesis. However, the benchmark methodology difference (-benchtime=1s vs -benchtime=30s) invalidates direct comparisons of latency metrics. The observed 3–10% "improvement" is almost certainly a benchtime artifact.

Recommendation: Re-run with identical -benchtime=30s -count=30 on a machine with sufficient CPU (no competing processes) and a 4+ hour time budget. Alternatively, use -benchtime=3s -count=30 as a practical compromise that still provides meaningful warmup.

The allocs/op and B/op comparison is valid (unaffected by benchtime) and shows zero change — this is a positive signal for the ADR-007/010 changes.


7. Commit

Measurement Commits

Verification

# Verify baseline commit is on development
git branch --contains e5d631ca266d22d1efd2bc9e6d0537c7a6f4b978 | grep development

# Verify experiment commit is the feature branch HEAD
cd ...podpedia-app.feature-experiment-adr007-010 && git rev-parse HEAD
# dc1fbe51c83a020cdad6a99325d65a83076bf0c5

Trial Data

Analysis Script


8. Appendix: Experiment Run Notes

Issues Encountered

  1. Zombie benchmark processes (PID 3030704, 3176859) from earlier attempts were consuming ~300% CPU, slowing the initial run to ~1 data point per 2 minutes.
  2. Initial run with -benchtime=30s -count=30 was estimated at 7+ hours due to the zombie CPU contention and the scale=5000 SQLite variants taking ~2 minutes per iteration.
  3. Go test timeout (-test.timeout=10m0s) killed an early run. Mitigated with -timeout 0.
  4. Exec timeout killed two runs. Final run used timeout=1800 (30 min) but still timed out at 119/120 data points.
  5. Stale process log output — the tee output was significantly buffered, making real-time progress monitoring unreliable.

Cleanup Performed

kill -9 3030704 3176859  # zombie benchmark processes

Benchmark Code

All benchmarks in backend/handlers/graph_benchmark_test.go:

}