TA-Numba: Technical Analysis Library with Numba and Rust Acceleration

TA-Numba is a Python library for financial technical analysis that provides dependency-free installation and high-performance computation through four computation tiers: Numba JIT bulk processing, Rust/PyO3 bulk processing, Python streaming, and Rust/PyO3 streaming. Version 0.4.0 introduced full quadrant parity — all four tiers produce identical results for the same input data, validated by 83 cross-quadrant tests.

Below is the original research paper benchmarks (v0.2.0, Numba-only), followed by v0.4.0 updates covering the Rust streaming backend and four-quadrant architecture.

📊 Performance Comparison

Based on comprehensive benchmarks with 100,000 data points across multiple technical analysis libraries:

Aspect TA-Lib ta-numba ta pandas cython
Installation C compiler required pip install only pip install only pip install only Compilation required
Average Performance Fastest (baseline) 4.3x slower 857x slower 94x slower 2.5x slower
Best Cases Fastest overall MACD: 3.8x faster All cases slower All cases slower Mixed results
Worst Cases WMA, ADX fastest WMA: 33x slower PSAR: 8,837x slower ATR: 13x slower Variable performance
Dependency Issues Frequent None None Rare Build-time only
Streaming Support No Yes (15.8x faster) No No No

⚡ Performance & Benchmarks

📊 Benchmark Methodology

Test Environment:

  • Data Size: 100,000 price points
  • Iterations: 3 runs per indicator per library
  • Hardware: Standard development machine
  • Libraries: ta-numba, ta-lib, ta, pandas, cython, NautilusTrader

Performance Analysis:

  • ta-numba delivers substantial performance improvements over pure Python libraries
  • TA-Lib maintains performance leadership in bulk processing
  • ta-numba provides unique advantages in streaming scenarios
  • Installation reliability varies significantly between libraries

📊 Comprehensive Benchmark Results (100K data points)

Complete Library Comparison:

Performance Comparison (Average Time per Run):
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Indicator    | ta           | ta-numba     | ta-lib       | pandas       | cython       | nautilus     | Speedup vs ta | Speedup vs talib | Speedup vs pandas | Speedup vs cython | Speedup vs nautilus
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
SMA          | 0.001196s | 0.001082s | 0.000087s    | 0.000713s    | 0.000058s    | 0.105247s    | 1.11x       | 0.08x           | 0.66x            | 0.05x            | 97.29x
EMA          | 0.000577s | 0.000112s | 0.000332s    | 0.000493s    | 0.000168s    | 0.011398s    | 5.16x       | 2.97x           | 4.41x            | 1.50x            | 101.92x
RSI          | 0.002789s | 0.001355s | 0.000433s    | 0.002412s    | 0.001946s    | 0.062416s    | 2.06x       | 0.32x           | 1.78x            | 1.44x            | 46.06x
MACD         | 0.001635s | 0.000642s | 0.002456s    | 0.001860s    | 0.000666s    | 0.012047s    | 2.55x       | 3.83x           | 2.90x            | 1.04x            | 18.77x
ATR          | 0.205986s | 0.000672s | 0.002262s    | 0.008719s    | 0.001687s    | 0.018718s    | 306.60x       | 3.37x           | 12.98x           | 2.51x            | 27.86x
Bollinger Upper | 0.002052s | 0.001432s | 0.000341s    | 0.002129s    | 0.006004s    | 0.214716s    | 1.43x       | 0.24x           | 1.49x            | 4.19x            | 149.92x
OBV          | 0.000685s | 0.000066s | 0.000224s    | N/A          | 0.000275s    | 14.146200s   | 10.43x       | 3.42x           | N/A              | 4.19x            | 215376.26x
MFI          | 0.482099s | 0.002581s | 0.002374s    | 0.003096s    | 0.006168s    | 0.021110s    | 186.77x       | 0.92x           | 1.20x            | 2.39x            | 8.18x
WMA          | 2.456998s | 0.003013s | 0.000092s    | 0.126318s    | 0.002411s    | 0.339517s    | 815.56x       | 0.03x           | 41.93x           | 0.80x            | 112.70x
VWEMA        | 0.000908s | 0.000822s | 0.029710s    | 0.002095s    | 0.004002s    | 0.058675s    | 1.10x       | 36.13x          | 2.55x            | 4.87x            | 71.35x
ADX          | 0.407531s | 0.003533s | 0.000643s    | 0.012459s    | 0.009984s    | 0.002930s    | 115.34x       | 0.18x           | 3.53x            | 2.83x            | 0.83x
PSAR         | 4.123320s | 0.000467s | 0.000346s    | 0.449931s    | 0.001659s    | 0.007989s    | 8837.04x       | 0.74x           | 964.29x          | 3.56x            | 17.12x
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Summary Statistics:
Average speedup vs ta: 857.10x
Average speedup vs ta-lib: 4.35x
Average speedup vs pandas: 94.34x
Average speedup vs cython: 2.45x
Average speedup vs nautilus: 18002.35x
Identical results vs ta: 11/12
Identical results vs ta-lib: 4/12
Identical results vs cython: 5/12
Identical results vs nautilus: 3/12

📈 Performance Summary

Benchmark Results Analysis:

vs Pure Python Libraries:

  • ta library: 857x average speedup (range: 1.1x to 8,837x)
  • pandas: 94x average speedup (range: 0.66x to 964x)
  • Consistent performance advantage across most indicators

vs Compiled Libraries:

  • TA-Lib: 0.23x average performance (ta-numba is 4.3x slower on average)
  • cython: 2.5x average speedup (mixed results depending on indicator)
  • Performance varies significantly by indicator complexity

Streaming Performance:

  • 15.8x faster than bulk recalculation methods
  • Constant O(1) memory usage vs. O(n) growth
  • Microsecond-level latency for real-time applications

Library Selection Criteria:

  • Choose TA-Lib for: Maximum performance, stable environment, C compilation acceptable
  • Choose ta-numba for: Reliable deployment, streaming requirements, Python-only environments
  • Choose ta/pandas for: Simplicity, small datasets, existing pandas workflows

Real-Time Streaming Performance (per tick):

🚀 REAL-TIME STREAMING COMPARISON
============================================================
Simulating live market data feed with continuous price updates...

📊 Generating 100 warmup ticks...
🔥 Warming up JIT compilation...
📈 Initializing streaming indicators...

🎯 SIMULATING 10,000 LIVE MARKET TICKS...
------------------------------------------------------------
Progress:  10% | Avg Bulk:  0.039ms | Avg Streaming:  0.017ms | Speedup:   2.3x
Progress:  20% | Avg Bulk:  0.103ms | Avg Streaming:  0.018ms | Speedup:   5.8x
Progress:  30% | Avg Bulk:  0.174ms | Avg Streaming:  0.019ms | Speedup:   9.0x
Progress:  40% | Avg Bulk:  0.244ms | Avg Streaming:  0.021ms | Speedup:  11.6x
Progress:  50% | Avg Bulk:  0.313ms | Avg Streaming:  0.023ms | Speedup:  13.5x
Progress:  60% | Avg Bulk:  0.378ms | Avg Streaming:  0.023ms | Speedup:  16.2x
Progress:  70% | Avg Bulk:  0.447ms | Avg Streaming:  0.024ms | Speedup:  18.7x
Progress:  80% | Avg Bulk:  0.516ms | Avg Streaming:  0.024ms | Speedup:  21.7x
Progress:  90% | Avg Bulk:  0.589ms | Avg Streaming:  0.024ms | Speedup:  24.3x
Progress: 100% | Avg Bulk:  0.671ms | Avg Streaming:  0.026ms | Speedup:  26.1x

📊 FINAL RESULTS
============================================================
Total ticks processed: 10,000
Lookback window size: 10000

⏱️  TIMING STATISTICS (per tick):
Method                Mean     Median     95%ile     99%ile
-------------------------------------------------------
Bulk                0.347ms     0.346ms     0.673ms     0.699ms
Streaming           0.022ms     0.022ms     0.028ms     0.039ms

🚀 PERFORMANCE IMPROVEMENT:
Average speedup: 15.8x faster
Median speedup: 15.9x faster

💾 MEMORY USAGE COMPARISON:
Bulk approach: O(n) = 10000 * 8 bytes * 7 indicators = 546.9 KB
Streaming approach: O(1) = ~1 KB total (constant)
Memory efficiency: 547x less memory

⚡ LATENCY ANALYSIS:
Bulk 99th percentile: 0.699ms
Streaming 99th percentile: 0.039ms
For HFT (<1ms requirement): ✅ Bulk passes, ✅ Streaming passes

v0.4.0: Four-Quadrant Architecture

Version 0.4.0 introduced a fundamental architectural shift: ta-numba now operates across four computation tiers (quadrants), each optimized for a specific use case. All four produce identical numerical results, validated by 83 cross-quadrant parity tests.

The Quadrant Model

Quadrant Backend Indicator Count Optimal For
Bulk (historical) Numba JIT 49 Training, backtesting, feature stores (100K+ points)
Bulk (historical) Rust/PyO3 50 Available but Numba preferred at batch scale
Streaming (live) Python 55 Fallback, environments without Rust
Streaming (live) Rust/PyO3 54 Production live trading, sub-50us latency

Why Numba Wins Bulk

At production batch sizes (100,000+ data points), Numba JIT compilation operates directly on NumPy memory without any Foreign Function Interface (FFI) boundary crossing. Rust accessed through PyO3 pays a per-call data marshalling cost — converting Python objects to Rust types and back — that becomes the dominant factor at large array sizes. Comprehensive benchmarks (50 iterations, 100K data points) showed Numba averaging 9x faster than Rust/PyO3 for bulk array operations across all 44 indicators.

This was a counter-intuitive finding: at small data sizes (3,000 points, typical of a single instrument's feature window), Rust showed 2-12x speedups over Numba. The FFI overhead only dominates when the array is large enough that the per-call marshalling cost exceeds the per-element computation savings.

Why Rust Wins Streaming

For streaming indicators — per-tick updates with O(1) state maintenance — the relationship inverts. Each update involves a single scalar value (the new price tick) crossing the FFI boundary, making marshalling cost negligible. Rust excels at maintaining complex internal state machines: rolling windows, peak tracking, multi-stage smoothing, and iterative accumulation — all without garbage collection pauses.

The average Rust streaming speedup across all 45 classes was 2.6x, with complex indicators reaching significantly higher. Simple indicators where the Python computation is trivial (OBV, DailyReturn) showed Rust slower due to the fixed FFI overhead dominating minimal computation.

Quadrant Parity Guarantee

All four quadrants produce numerically identical results for the same input data:

  • Numba bulk SMA(close, 20) = Rust bulk SMA(close, 20) = Python streaming SMA after warmup = Rust streaming SMA after warmup
  • Validated across all shared indicators with max tolerance of 1e-10
  • 83 cross-quadrant tests pass in CI

Rust Streaming Benchmarks

Benchmarked with 10,000 ticks, 10 iterations, comparing Rust/PyO3 streaming classes against Numba/Python streaming baselines:

Indicator Rust (ms/10K ticks) Numba (ms/10K ticks) Speedup
Ulcer Index 16.9 225.2 13.3x
Stochastic RSI 10.3 94.0 9.2x
Awesome Oscillator 6.7 54.1 8.1x
Bollinger Bands 11.2 81.2 7.2x
Ultimate Oscillator 15.8 109.0 6.9x
CCI 8.5 52.8 6.2x
TSI 5.0 28.7 5.8x
DPO 5.2 26.9 5.2x
KAMA 8.5 42.4 5.0x
SMA 5.1 25.6 5.0x
MACD 5.3 20.6 3.9x
RSI 5.0 13.9 2.8x
ATR 8.6 17.9 2.1x
ADX 11.2 22.2 2.0x

Complex indicators with internal state machines (Ulcer Index, Stochastic RSI, Bollinger Bands) show the largest speedups because Rust avoids Python object allocation and garbage collection overhead during iterative state updates.


New Generic Indicators (v0.4.0)

Version 0.4.0 added four new generic indicators available across all four quadrants:

Indicator Description Use Case
volume_ratio Volume relative to N-period average Volume spike detection
rolling_zscore Z-score within rolling window Mean-reversion signals
linear_regression_slope OLS slope over rolling window Trend strength measurement
rolling_percentile Percentile rank within rolling window Relative position indicators

These indicators were designed for quantitative feature engineering and are used in Trade-Matrix's production feature pipeline for build_features.py (bulk) and queue_based_features.py (streaming).


Updated Performance Comparison

With v0.4.0's Rust streaming tier, ta-numba's competitive position across library alternatives:

Aspect TA-Lib ta-numba (Numba bulk) ta-numba (Rust streaming) ta pandas
Installation C compiler required pip install only pip install only (wheels) pip install only pip install only
Bulk Performance Fastest (baseline) 4.3x slower avg 9x slower than Numba 857x slower 94x slower
Streaming Performance No streaming 15.8x vs bulk recalc 2.6x faster than Numba streaming No streaming No streaming
Complex Streaming N/A Baseline Up to 13.3x (Ulcer Index) N/A N/A
Dependency Issues Frequent None None (pre-built wheels) None Rare
Memory (streaming) N/A O(1) constant O(1) constant N/A N/A

Updated Library Selection Criteria

  • Choose TA-Lib for: Maximum bulk performance, stable environment, C compilation acceptable
  • Choose ta-numba (Numba) for: Reliable bulk deployment, Python-only environments, training/backtesting at 100K+ points
  • Choose ta-numba (Rust streaming) for: Production live trading, state-machine indicators, sub-50us per-tick latency requirements
  • Choose ta/pandas for: Simplicity, small datasets, existing pandas workflows, prototyping