Performance
This document records the benchmark and profiling path for the engine.
For the structural explanation of the matching engine through a performance lens, read `docs/engine-performance.md`.
Benchmark command
Run:
cargo bench --bench throughput -- --sample-size 10cargo bench --bench price_level_prototypes -- --sample-size 10
For a lightweight local profile wrapper:
./scripts/profile_bench.sh
On macOS the helper uses /usr/bin/time -l; on Linux it uses /usr/bin/time -v.
Current baseline
The most recent local benchmark pass produced approximately:
submit_resting_limit_orders:7.80 msto7.83 mscross_seeded_book:395.8 µsto396.9 µscancel_populated_book:4.00 msto4.02 mssweep_single_price_fifo:8.02 msto8.04 ms
The benchmark suite also includes lean-path comparisons:
submit_resting_limit_orders_minimalsubmit_resting_limit_orders_minimal_no_invariantscross_seeded_book_minimalcross_seeded_book_minimal_no_invariantscancel_populated_book_minimalcancel_populated_book_minimal_no_invariantscancel_deep_single_price_levelcancel_deep_single_price_level_minimalcancel_deep_single_price_level_minimal_no_invariants
The repository also includes prototype-only price-level structure comparisons:
prototype_enqueue_vecdeque_1000prototype_enqueue_slab_1000prototype_enqueue_dense_1000prototype_fifo_sweep_vecdeque_1000prototype_fifo_sweep_slab_1000prototype_fifo_sweep_dense_1000prototype_deep_cancel_vecdeque_1000prototype_deep_cancel_slab_1000prototype_deep_cancel_dense_1000
Allocation profile
The most recent ./scripts/profile_bench.sh run on macOS reported approximately:
- maximum resident set size:
57.1 MB - peak memory footprint:
38.7 MB - retired instructions:
895M - elapsed cycles:
275M - wall time for the full benchmark suite:
38.0 s
These are whole-suite numbers, not per-benchmark isolated allocations. They are useful for regression tracking, not for attributing one code path precisely.
These numbers are environment-specific and should only be used as local regression references, not absolute performance claims.
Profiling guidance
Use the benchmark suite before attempting engine optimizations.
Suggested order:
- Run
./scripts/profile_bench.sh. - Compare against the stored Criterion baseline.
- Only optimize paths that show stable pressure across repeated runs.
Current likely hot paths:
- resting order insertion
- resting order insertion with rich result assembly
- resting order insertion with lean summaries
- post-mutation invariant walks
- best-level matching loops
- FIFO sweep execution
- cancellation on a populated book
- cancellation inside a deep single price level
- mixed insert-and-cross churn
Current read on optimization pressure:
- CPU hot spots appear more important than gross memory pressure at the current workload size.
- The full suite footprint is moderate for a benchmark process and does not yet justify structural redesign.
- If optimization continues, prioritize per-benchmark allocation tracing or CPU sampling around matching loops and event/top-of-book construction.