TA-Numba: Technical Analysis Library with Numba and Rust Acceleration
TA-Numba is a Python library for financial technical analysis that provides dependency-free installation and high-performance computation through four computation tiers: Numba JIT bulk processing, Rust/PyO3 bulk processing, Python streaming, and Rust/PyO3 streaming. Version 0.4.0 introduced full quadrant parity — all four tiers produce identical results for the same input data, validated by 83 cross-quadrant tests.
Below is the original research paper benchmarks (v0.2.0, Numba-only), followed by v0.4.0 updates covering the Rust streaming backend and four-quadrant architecture.
📊 Performance Comparison
Based on comprehensive benchmarks with 100,000 data points across multiple technical analysis libraries:
| Aspect | TA-Lib | ta-numba | ta | pandas | cython |
|---|---|---|---|---|---|
| Installation | C compiler required | pip install only | pip install only | pip install only | Compilation required |
| Average Performance | Fastest (baseline) | 4.3x slower | 857x slower | 94x slower | 2.5x slower |
| Best Cases | Fastest overall | MACD: 3.8x faster | All cases slower | All cases slower | Mixed results |
| Worst Cases | WMA, ADX fastest | WMA: 33x slower | PSAR: 8,837x slower | ATR: 13x slower | Variable performance |
| Dependency Issues | Frequent | None | None | Rare | Build-time only |
| Streaming Support | No | Yes (15.8x faster) | No | No | No |
⚡ Performance & Benchmarks
📊 Benchmark Methodology
Test Environment:
- Data Size: 100,000 price points
- Iterations: 3 runs per indicator per library
- Hardware: Standard development machine
- Libraries: ta-numba, ta-lib, ta, pandas, cython, NautilusTrader
Performance Analysis:
- ta-numba delivers substantial performance improvements over pure Python libraries
- TA-Lib maintains performance leadership in bulk processing
- ta-numba provides unique advantages in streaming scenarios
- Installation reliability varies significantly between libraries
📊 Comprehensive Benchmark Results (100K data points)
Complete Library Comparison:
Performance Comparison (Average Time per Run):
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Indicator | ta | ta-numba | ta-lib | pandas | cython | nautilus | Speedup vs ta | Speedup vs talib | Speedup vs pandas | Speedup vs cython | Speedup vs nautilus
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
SMA | 0.001196s | 0.001082s | 0.000087s | 0.000713s | 0.000058s | 0.105247s | 1.11x | 0.08x | 0.66x | 0.05x | 97.29x
EMA | 0.000577s | 0.000112s | 0.000332s | 0.000493s | 0.000168s | 0.011398s | 5.16x | 2.97x | 4.41x | 1.50x | 101.92x
RSI | 0.002789s | 0.001355s | 0.000433s | 0.002412s | 0.001946s | 0.062416s | 2.06x | 0.32x | 1.78x | 1.44x | 46.06x
MACD | 0.001635s | 0.000642s | 0.002456s | 0.001860s | 0.000666s | 0.012047s | 2.55x | 3.83x | 2.90x | 1.04x | 18.77x
ATR | 0.205986s | 0.000672s | 0.002262s | 0.008719s | 0.001687s | 0.018718s | 306.60x | 3.37x | 12.98x | 2.51x | 27.86x
Bollinger Upper | 0.002052s | 0.001432s | 0.000341s | 0.002129s | 0.006004s | 0.214716s | 1.43x | 0.24x | 1.49x | 4.19x | 149.92x
OBV | 0.000685s | 0.000066s | 0.000224s | N/A | 0.000275s | 14.146200s | 10.43x | 3.42x | N/A | 4.19x | 215376.26x
MFI | 0.482099s | 0.002581s | 0.002374s | 0.003096s | 0.006168s | 0.021110s | 186.77x | 0.92x | 1.20x | 2.39x | 8.18x
WMA | 2.456998s | 0.003013s | 0.000092s | 0.126318s | 0.002411s | 0.339517s | 815.56x | 0.03x | 41.93x | 0.80x | 112.70x
VWEMA | 0.000908s | 0.000822s | 0.029710s | 0.002095s | 0.004002s | 0.058675s | 1.10x | 36.13x | 2.55x | 4.87x | 71.35x
ADX | 0.407531s | 0.003533s | 0.000643s | 0.012459s | 0.009984s | 0.002930s | 115.34x | 0.18x | 3.53x | 2.83x | 0.83x
PSAR | 4.123320s | 0.000467s | 0.000346s | 0.449931s | 0.001659s | 0.007989s | 8837.04x | 0.74x | 964.29x | 3.56x | 17.12x
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Summary Statistics:
Average speedup vs ta: 857.10x
Average speedup vs ta-lib: 4.35x
Average speedup vs pandas: 94.34x
Average speedup vs cython: 2.45x
Average speedup vs nautilus: 18002.35x
Identical results vs ta: 11/12
Identical results vs ta-lib: 4/12
Identical results vs cython: 5/12
Identical results vs nautilus: 3/12
📈 Performance Summary
Benchmark Results Analysis:
vs Pure Python Libraries:
- ta library: 857x average speedup (range: 1.1x to 8,837x)
- pandas: 94x average speedup (range: 0.66x to 964x)
- Consistent performance advantage across most indicators
vs Compiled Libraries:
- TA-Lib: 0.23x average performance (ta-numba is 4.3x slower on average)
- cython: 2.5x average speedup (mixed results depending on indicator)
- Performance varies significantly by indicator complexity
Streaming Performance:
- 15.8x faster than bulk recalculation methods
- Constant O(1) memory usage vs. O(n) growth
- Microsecond-level latency for real-time applications
Library Selection Criteria:
- Choose TA-Lib for: Maximum performance, stable environment, C compilation acceptable
- Choose ta-numba for: Reliable deployment, streaming requirements, Python-only environments
- Choose ta/pandas for: Simplicity, small datasets, existing pandas workflows
Real-Time Streaming Performance (per tick):
🚀 REAL-TIME STREAMING COMPARISON
============================================================
Simulating live market data feed with continuous price updates...
📊 Generating 100 warmup ticks...
🔥 Warming up JIT compilation...
📈 Initializing streaming indicators...
🎯 SIMULATING 10,000 LIVE MARKET TICKS...
------------------------------------------------------------
Progress: 10% | Avg Bulk: 0.039ms | Avg Streaming: 0.017ms | Speedup: 2.3x
Progress: 20% | Avg Bulk: 0.103ms | Avg Streaming: 0.018ms | Speedup: 5.8x
Progress: 30% | Avg Bulk: 0.174ms | Avg Streaming: 0.019ms | Speedup: 9.0x
Progress: 40% | Avg Bulk: 0.244ms | Avg Streaming: 0.021ms | Speedup: 11.6x
Progress: 50% | Avg Bulk: 0.313ms | Avg Streaming: 0.023ms | Speedup: 13.5x
Progress: 60% | Avg Bulk: 0.378ms | Avg Streaming: 0.023ms | Speedup: 16.2x
Progress: 70% | Avg Bulk: 0.447ms | Avg Streaming: 0.024ms | Speedup: 18.7x
Progress: 80% | Avg Bulk: 0.516ms | Avg Streaming: 0.024ms | Speedup: 21.7x
Progress: 90% | Avg Bulk: 0.589ms | Avg Streaming: 0.024ms | Speedup: 24.3x
Progress: 100% | Avg Bulk: 0.671ms | Avg Streaming: 0.026ms | Speedup: 26.1x
📊 FINAL RESULTS
============================================================
Total ticks processed: 10,000
Lookback window size: 10000
⏱️ TIMING STATISTICS (per tick):
Method Mean Median 95%ile 99%ile
-------------------------------------------------------
Bulk 0.347ms 0.346ms 0.673ms 0.699ms
Streaming 0.022ms 0.022ms 0.028ms 0.039ms
🚀 PERFORMANCE IMPROVEMENT:
Average speedup: 15.8x faster
Median speedup: 15.9x faster
💾 MEMORY USAGE COMPARISON:
Bulk approach: O(n) = 10000 * 8 bytes * 7 indicators = 546.9 KB
Streaming approach: O(1) = ~1 KB total (constant)
Memory efficiency: 547x less memory
⚡ LATENCY ANALYSIS:
Bulk 99th percentile: 0.699ms
Streaming 99th percentile: 0.039ms
For HFT (<1ms requirement): ✅ Bulk passes, ✅ Streaming passes
v0.4.0: Four-Quadrant Architecture
Version 0.4.0 introduced a fundamental architectural shift: ta-numba now operates across four computation tiers (quadrants), each optimized for a specific use case. All four produce identical numerical results, validated by 83 cross-quadrant parity tests.
The Quadrant Model
| Quadrant | Backend | Indicator Count | Optimal For |
|---|---|---|---|
| Bulk (historical) | Numba JIT | 49 | Training, backtesting, feature stores (100K+ points) |
| Bulk (historical) | Rust/PyO3 | 50 | Available but Numba preferred at batch scale |
| Streaming (live) | Python | 55 | Fallback, environments without Rust |
| Streaming (live) | Rust/PyO3 | 54 | Production live trading, sub-50us latency |
Why Numba Wins Bulk
At production batch sizes (100,000+ data points), Numba JIT compilation operates directly on NumPy memory without any Foreign Function Interface (FFI) boundary crossing. Rust accessed through PyO3 pays a per-call data marshalling cost — converting Python objects to Rust types and back — that becomes the dominant factor at large array sizes. Comprehensive benchmarks (50 iterations, 100K data points) showed Numba averaging 9x faster than Rust/PyO3 for bulk array operations across all 44 indicators.
This was a counter-intuitive finding: at small data sizes (3,000 points, typical of a single instrument's feature window), Rust showed 2-12x speedups over Numba. The FFI overhead only dominates when the array is large enough that the per-call marshalling cost exceeds the per-element computation savings.
Why Rust Wins Streaming
For streaming indicators — per-tick updates with O(1) state maintenance — the relationship inverts. Each update involves a single scalar value (the new price tick) crossing the FFI boundary, making marshalling cost negligible. Rust excels at maintaining complex internal state machines: rolling windows, peak tracking, multi-stage smoothing, and iterative accumulation — all without garbage collection pauses.
The average Rust streaming speedup across all 45 classes was 2.6x, with complex indicators reaching significantly higher. Simple indicators where the Python computation is trivial (OBV, DailyReturn) showed Rust slower due to the fixed FFI overhead dominating minimal computation.
Quadrant Parity Guarantee
All four quadrants produce numerically identical results for the same input data:
- Numba bulk SMA(close, 20) = Rust bulk SMA(close, 20) = Python streaming SMA after warmup = Rust streaming SMA after warmup
- Validated across all shared indicators with max tolerance of 1e-10
- 83 cross-quadrant tests pass in CI
Rust Streaming Benchmarks
Benchmarked with 10,000 ticks, 10 iterations, comparing Rust/PyO3 streaming classes against Numba/Python streaming baselines:
| Indicator | Rust (ms/10K ticks) | Numba (ms/10K ticks) | Speedup |
|---|---|---|---|
| Ulcer Index | 16.9 | 225.2 | 13.3x |
| Stochastic RSI | 10.3 | 94.0 | 9.2x |
| Awesome Oscillator | 6.7 | 54.1 | 8.1x |
| Bollinger Bands | 11.2 | 81.2 | 7.2x |
| Ultimate Oscillator | 15.8 | 109.0 | 6.9x |
| CCI | 8.5 | 52.8 | 6.2x |
| TSI | 5.0 | 28.7 | 5.8x |
| DPO | 5.2 | 26.9 | 5.2x |
| KAMA | 8.5 | 42.4 | 5.0x |
| SMA | 5.1 | 25.6 | 5.0x |
| MACD | 5.3 | 20.6 | 3.9x |
| RSI | 5.0 | 13.9 | 2.8x |
| ATR | 8.6 | 17.9 | 2.1x |
| ADX | 11.2 | 22.2 | 2.0x |
Complex indicators with internal state machines (Ulcer Index, Stochastic RSI, Bollinger Bands) show the largest speedups because Rust avoids Python object allocation and garbage collection overhead during iterative state updates.
New Generic Indicators (v0.4.0)
Version 0.4.0 added four new generic indicators available across all four quadrants:
| Indicator | Description | Use Case |
|---|---|---|
volume_ratio |
Volume relative to N-period average | Volume spike detection |
rolling_zscore |
Z-score within rolling window | Mean-reversion signals |
linear_regression_slope |
OLS slope over rolling window | Trend strength measurement |
rolling_percentile |
Percentile rank within rolling window | Relative position indicators |
These indicators were designed for quantitative feature engineering and are used in Trade-Matrix's production feature pipeline for build_features.py (bulk) and queue_based_features.py (streaming).
Updated Performance Comparison
With v0.4.0's Rust streaming tier, ta-numba's competitive position across library alternatives:
| Aspect | TA-Lib | ta-numba (Numba bulk) | ta-numba (Rust streaming) | ta | pandas |
|---|---|---|---|---|---|
| Installation | C compiler required | pip install only | pip install only (wheels) | pip install only | pip install only |
| Bulk Performance | Fastest (baseline) | 4.3x slower avg | 9x slower than Numba | 857x slower | 94x slower |
| Streaming Performance | No streaming | 15.8x vs bulk recalc | 2.6x faster than Numba streaming | No streaming | No streaming |
| Complex Streaming | N/A | Baseline | Up to 13.3x (Ulcer Index) | N/A | N/A |
| Dependency Issues | Frequent | None | None (pre-built wheels) | None | Rare |
| Memory (streaming) | N/A | O(1) constant | O(1) constant | N/A | N/A |
Updated Library Selection Criteria
- Choose TA-Lib for: Maximum bulk performance, stable environment, C compilation acceptable
- Choose ta-numba (Numba) for: Reliable bulk deployment, Python-only environments, training/backtesting at 100K+ points
- Choose ta-numba (Rust streaming) for: Production live trading, state-machine indicators, sub-50us per-tick latency requirements
- Choose ta/pandas for: Simplicity, small datasets, existing pandas workflows, prototyping
