Experiment Index
Searchable table of all PodPedia experiments. Each entry links to a full report in this directory.
Quick Reference
| ID | Date | Hypothesis | Conclusion | Commit | Status |
|---|---|---|---|---|---|
| ADR-001-live-ab | 2026-05-13 | 10K chunk threshold (ADR-001) reduces pipeline latency vs. 20K threshold | Promising but inconclusive (n=15, environment confound) | Production vs. Staging | 🟡 inconclusive |
| ADR-007-010-pipeline | 2026-05-12 | Merged ADR-007 + ADR-010 introduce no regression in graph latency or memory | Inconclusive (benchtime mismatch: 1s vs 30s) | e5d631c / dc1fbe5 |
⚪ inconclusive |
| EXP-001 | 2026-05-11 | Formalized experiment tracking will improve performance decision rigor and reduce regression incidents | Pending (deadline 2026-05-25) | — | 🟡 planned |
| EXP-002 | — | Ollama-based LLM handler achieves comparable p95 latency to Vertex AI for <500 token prompts | — | — | ⬜ planned |
| EXP-003 | — | SQLite-backed graph storage outperforms in-memory graph for entity counts >10,000 | — | — | ⬜ planned |
| EXP-004 | — | Batched entity resolution (chunk size 50) reduces p99 latency vs. sequential resolution by ≥30% | — | — | ⬜ planned |
| EXP-005 | — | Token-bucket rate limiter sustains 2x throughput vs. sliding-window under bursty ingest patterns | — | — | ⬜ planned |
Status Legend
| Icon | Meaning |
|---|---|
| 🟢 | Confirmed |
| 🔴 | Rejected |
| 🟡 | In progress / planned with deadline |
| ⬜ | Planned (no start date) |
| ⚪ | Inconclusive |
ADR-001-live-ab — 10K Character Threshold Live A/B Latency Test
| Field | Value |
|---|---|
| ID | ADR-001-live-ab |
| Type | Live A/B (production vs. staging) |
| Date | 2026-05-13 |
| Author | @coding-agent |
| Status | 🟡 inconclusive — low trial count |
Hypothesis
Lowering the parallel graph extraction chunking threshold from 20K → 10K characters (ADR-001) reduces wall-clock extraction latency.
Results Summary
| Payload | Production (20K) | Staging (10K) | Δ | KS p-value |
|---|---|---|---|---|
| 5K chars | 50.1s | 35.6s | -28.9% | 0.017* |
| 15K chars | 45.8s | 37.0s | -19.2% | 0.308 |
| 25K chars | 116.2s | 37.0s | -68.2% | <0.001* |
Conclusion
Directional evidence strongly favors staging (10K threshold), especially at 25K chars (68% faster, d=-2.06, p<0.001). However, the 5K result shows an environment confound (staging is faster even where no parallelization benefit exists), and the critical 15K payload did not reach significance (p=0.31, n=15). Iterate: re-run with controlled environment and ≥30 trials.
Report: 2026-05-13-adr001-live-ab.md
ADR-007-010-pipeline — Pipeline Regression Experiment
| Field | Value |
|---|---|
| ID | ADR-007-010-pipeline |
| Type | Regression Test |
| Target | backend/handlers/graph.go |
| Date | 2026-05-12 |
| Author | @coding-agent |
| PR | feature/experiment-adr007-010 |
| Commit (baseline) | e5d631ca266d22d1efd2bc9e6d0537c7a6f4b978 |
| Commit (experiment) | dc1fbe51c83a020cdad6a99325d65a83076bf0c5 |
| Status | ⚪ inconclusive — benchtime mismatch |
Hypothesis
Merged ADR-007 + ADR-010 changes introduce no statistically significant regression in graph snapshot latency, graph traversal latency, or memory allocations.
Results
- All 12 benchmark variants showed statistically significant improvements (2.15% to 10.69%)
- Memory allocations: zero change (identical allocs/op and bytes/op across all variants)
- However: baseline used
-benchtime=30s -count=30, experiment used-benchtime=1s -count=10 - The consistent improvement across ALL variants strongly suggests a benchtime artifact, not actual code improvement
Conclusion
Inconclusive. The shorter benchtime invalidates direct latency comparison. Memory analysis (valid regardless of benchtime) shows zero allocation changes — a positive signal. Recommended re-run with identical parameters.
Report: 2026-05-12-adr007-010-pipeline.md
EXP-001 — Meta-Experiment: Formalized Tracking Efficacy
| Field | Value |
|---|---|
| ID | EXP-001 |
| Type | Meta-experiment |
| Start Date | 2026-05-11 |
| Deadline | 2026-05-25 |
| Author | @team |
| Status | 🟡 planned |
Hypothesis
Formalized experiment tracking (ADR-009) will reduce the number of performance regressions merged to main by ≥50% compared to the 3 months prior, and will increase the rate at which performance questions are answered with data rather than intuition.
Success Criteria
- ≥50% reduction in performance regression incidents (counted as reverts or hotfixes due to perf)
- ≥80% of performance-sensitive PRs have a linked experiment report
- Developer survey shows increased confidence in performance decisions
Methodology
This is a process meta-experiment. Over a 2-week trial period, all performance-sensitive PRs must comply with ADR-009. Data collected:
- Count of performance regressions merged before/after
- Count of PRs with/without linked experiments
- Qualitative developer feedback
Report
Report file: EXP-001-formalized-tracking-efficacy.md (pending)
EXP-002 — Ollama vs. Vertex AI LLM Latency Comparison
| Field | Value |
|---|---|
| ID | EXP-002 |
| Type | A/B Comparison |
| Target Component | backend/handlers/llm_ollama.go, backend/handlers/llm_vertex.go |
| Status | ⬜ planned |
Hypothesis
The Ollama-based LLM handler achieves comparable p95 latency to the Vertex AI handler for prompts under 500 tokens, with latency within ±10% of Vertex at p50 and p95.
Success Criteria
- p95 Ollama latency ≤ 1.10 × p95 Vertex latency for <500 token prompts
- p99 Ollama latency ≤ 1.20 × p99 Vertex latency
- No significant regression in output quality (measured via structured output validation)
Key Metrics
- p50, p95, p99 latency per handler
- Token throughput (tokens/sec)
- Memory allocations per invocation
EXP-003 — SQLite Graph vs. In-Memory Graph at Scale
| Field | Value |
|---|---|
| ID | EXP-003 |
| Type | Optimization Experiment |
| Target Component | backend/handlers/graph_sqlite.go, backend/handlers/graph.go |
| Status | ⬜ planned |
Hypothesis
SQLite-backed graph storage outperforms in-memory graph for entity counts exceeding 10,000, reducing p95 traversal latency by ≥40% while adding <10% overhead for small graphs (<1,000 entities).
Success Criteria
- p95 traversal latency reduced by ≥40% for >10K entity graphs
- p95 latency for <1K entity graphs not increased by >10%
- Memory usage reduced by ≥50% for >10K entity graphs
Key Metrics
- Graph traversal latency (p50, p95, p99)
- Memory usage (RSS at steady state)
- Insert/update latency for incremental graph mutations
EXP-004 — Batched vs. Sequential Entity Resolution
| Field | Value |
|---|---|
| ID | EXP-004 |
| Type | Optimization Experiment |
| Target Component | backend/handlers/entity_resolver.go |
| Status | ⬜ planned |
Hypothesis
Batched entity resolution with chunk size 50 reduces p99 resolution latency by ≥30% compared to sequential resolution, without increasing resolution error rate by more than 1 percentage point.
Success Criteria
- p99 latency reduced by ≥30% (KS test p < 0.05)
- Resolution error rate Δ ≤ 1.0 percentage point
- Memory usage not increased by >20%
Key Metrics
- Entity resolution latency (p50, p95, p99)
- Resolution accuracy (precision/recall vs. gold standard)
- Memory allocations per resolution batch
EXP-005 — Token-Bucket vs. Sliding-Window Rate Limiter
| Field | Value |
|---|---|
| ID | EXP-005 |
| Type | A/B Comparison |
| Target Component | backend/handlers/ratelimit.go |
| Status | ⬜ planned |
Hypothesis
The token-bucket rate limiter sustains 2x the throughput of the sliding-window limiter under bursty ingest patterns (burst size 100 requests, 1s window), while maintaining equivalent fairness (measured by max single-client starvation time).
Success Criteria
- Token-bucket throughput ≥ 2.0 × sliding-window throughput under bursty load
- Max single-client starvation time ≤ sliding-window baseline
- p99 request latency under token-bucket ≤ p99 under sliding-window
Key Metrics
- Sustained throughput (req/sec)
- Request latency distribution (p50, p95, p99)
- Client fairness (max consecutive denials per client)