High-Performance Technical Analysis with TA-Numba

TA-Numba is a Python library for financial technical analysis with four computation tiers: Numba JIT bulk, Rust/PyO3 bulk, Python streaming, and Rust streaming. Version 0.4.0 achieves full quadrant parity across 49-55 indicators per tier, with Rust streaming delivering up to 13.3x speedups for complex stateful indicators.

TA-Numba: Technical Analysis Library with Numba and Rust Acceleration

TA-Numba is a Python library for financial technical analysis that provides dependency-free installation and high-performance computation through four computation tiers: Numba JIT bulk processing, Rust/PyO3 bulk processing, Python streaming, and Rust/PyO3 streaming. Version 0.4.0 introduced full quadrant parity — all four tiers produce identical results for the same input data, validated by 83 cross-quadrant tests.

Below is the original research paper benchmarks (v0.2.0, Numba-only), followed by v0.4.0 updates covering the Rust streaming backend and four-quadrant architecture.

📊 Performance Comparison

Based on comprehensive benchmarks with 100,000 data points across multiple technical analysis libraries:

Aspect	TA-Lib	ta-numba	ta	pandas	cython
Installation	C compiler required	pip install only	pip install only	pip install only	Compilation required
Average Performance	Fastest (baseline)	4.3x slower	857x slower	94x slower	2.5x slower
Best Cases	Fastest overall	MACD: 3.8x faster	All cases slower	All cases slower	Mixed results
Worst Cases	WMA, ADX fastest	WMA: 33x slower	PSAR: 8,837x slower	ATR: 13x slower	Variable performance
Dependency Issues	Frequent	None	None	Rare	Build-time only
Streaming Support	No	Yes (15.8x faster)	No	No	No

⚡ Performance & Benchmarks

📊 Benchmark Methodology

Test Environment:

Data Size: 100,000 price points
Iterations: 3 runs per indicator per library
Hardware: Standard development machine
Libraries: ta-numba, ta-lib, ta, pandas, cython, NautilusTrader

Performance Analysis:

ta-numba delivers substantial performance improvements over pure Python libraries
TA-Lib maintains performance leadership in bulk processing
ta-numba provides unique advantages in streaming scenarios
Installation reliability varies significantly between libraries

📊 Comprehensive Benchmark Results (100K data points)

Complete Library Comparison:

Performance Comparison (Average Time per Run):
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Indicator    | ta           | ta-numba     | ta-lib       | pandas       | cython       | nautilus     | Speedup vs ta | Speedup vs talib | Speedup vs pandas | Speedup vs cython | Speedup vs nautilus
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
SMA          | 0.001196s | 0.001082s | 0.000087s    | 0.000713s    | 0.000058s    | 0.105247s    | 1.11x       | 0.08x           | 0.66x            | 0.05x            | 97.29x
EMA          | 0.000577s | 0.000112s | 0.000332s    | 0.000493s    | 0.000168s    | 0.011398s    | 5.16x       | 2.97x           | 4.41x            | 1.50x            | 101.92x
RSI          | 0.002789s | 0.001355s | 0.000433s    | 0.002412s    | 0.001946s    | 0.062416s    | 2.06x       | 0.32x           | 1.78x            | 1.44x            | 46.06x
MACD         | 0.001635s | 0.000642s | 0.002456s    | 0.001860s    | 0.000666s    | 0.012047s    | 2.55x       | 3.83x           | 2.90x            | 1.04x            | 18.77x
ATR          | 0.205986s | 0.000672s | 0.002262s    | 0.008719s    | 0.001687s    | 0.018718s    | 306.60x       | 3.37x           | 12.98x           | 2.51x            | 27.86x
Bollinger Upper | 0.002052s | 0.001432s | 0.000341s    | 0.002129s    | 0.006004s    | 0.214716s    | 1.43x       | 0.24x           | 1.49x            | 4.19x            | 149.92x
OBV          | 0.000685s | 0.000066s | 0.000224s    | N/A          | 0.000275s    | 14.146200s   | 10.43x       | 3.42x           | N/A              | 4.19x            | 215376.26x
MFI          | 0.482099s | 0.002581s | 0.002374s    | 0.003096s    | 0.006168s    | 0.021110s    | 186.77x       | 0.92x           | 1.20x            | 2.39x            | 8.18x
WMA          | 2.456998s | 0.003013s | 0.000092s    | 0.126318s    | 0.002411s    | 0.339517s    | 815.56x       | 0.03x           | 41.93x           | 0.80x            | 112.70x
VWEMA        | 0.000908s | 0.000822s | 0.029710s    | 0.002095s    | 0.004002s    | 0.058675s    | 1.10x       | 36.13x          | 2.55x            | 4.87x            | 71.35x
ADX          | 0.407531s | 0.003533s | 0.000643s    | 0.012459s    | 0.009984s    | 0.002930s    | 115.34x       | 0.18x           | 3.53x            | 2.83x            | 0.83x
PSAR         | 4.123320s | 0.000467s | 0.000346s    | 0.449931s    | 0.001659s    | 0.007989s    | 8837.04x       | 0.74x           | 964.29x          | 3.56x            | 17.12x
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Summary Statistics:
Average speedup vs ta: 857.10x
Average speedup vs ta-lib: 4.35x
Average speedup vs pandas: 94.34x
Average speedup vs cython: 2.45x
Average speedup vs nautilus: 18002.35x
Identical results vs ta: 11/12
Identical results vs ta-lib: 4/12
Identical results vs cython: 5/12
Identical results vs nautilus: 3/12

📈 Performance Summary

Benchmark Results Analysis:

vs Pure Python Libraries:

ta library: 857x average speedup (range: 1.1x to 8,837x)
pandas: 94x average speedup (range: 0.66x to 964x)
Consistent performance advantage across most indicators

vs Compiled Libraries:

TA-Lib: 0.23x average performance (ta-numba is 4.3x slower on average)
cython: 2.5x average speedup (mixed results depending on indicator)
Performance varies significantly by indicator complexity

Streaming Performance:

15.8x faster than bulk recalculation methods
Constant O(1) memory usage vs. O(n) growth
Microsecond-level latency for real-time applications

Library Selection Criteria:

Choose TA-Lib for: Maximum performance, stable environment, C compilation acceptable
Choose ta-numba for: Reliable deployment, streaming requirements, Python-only environments
Choose ta/pandas for: Simplicity, small datasets, existing pandas workflows

Real-Time Streaming Performance (per tick):

🚀 REAL-TIME STREAMING COMPARISON
============================================================
Simulating live market data feed with continuous price updates...

📊 Generating 100 warmup ticks...
🔥 Warming up JIT compilation...
📈 Initializing streaming indicators...

🎯 SIMULATING 10,000 LIVE MARKET TICKS...
------------------------------------------------------------
Progress:  10% | Avg Bulk:  0.039ms | Avg Streaming:  0.017ms | Speedup:   2.3x
Progress:  20% | Avg Bulk:  0.103ms | Avg Streaming:  0.018ms | Speedup:   5.8x
Progress:  30% | Avg Bulk:  0.174ms | Avg Streaming:  0.019ms | Speedup:   9.0x
Progress:  40% | Avg Bulk:  0.244ms | Avg Streaming:  0.021ms | Speedup:  11.6x
Progress:  50% | Avg Bulk:  0.313ms | Avg Streaming:  0.023ms | Speedup:  13.5x
Progress:  60% | Avg Bulk:  0.378ms | Avg Streaming:  0.023ms | Speedup:  16.2x
Progress:  70% | Avg Bulk:  0.447ms | Avg Streaming:  0.024ms | Speedup:  18.7x
Progress:  80% | Avg Bulk:  0.516ms | Avg Streaming:  0.024ms | Speedup:  21.7x
Progress:  90% | Avg Bulk:  0.589ms | Avg Streaming:  0.024ms | Speedup:  24.3x
Progress: 100% | Avg Bulk:  0.671ms | Avg Streaming:  0.026ms | Speedup:  26.1x

📊 FINAL RESULTS
============================================================
Total ticks processed: 10,000
Lookback window size: 10000

⏱️  TIMING STATISTICS (per tick):
Method                Mean     Median     95%ile     99%ile
-------------------------------------------------------
Bulk                0.347ms     0.346ms     0.673ms     0.699ms
Streaming           0.022ms     0.022ms     0.028ms     0.039ms

🚀 PERFORMANCE IMPROVEMENT:
Average speedup: 15.8x faster
Median speedup: 15.9x faster

💾 MEMORY USAGE COMPARISON:
Bulk approach: O(n) = 10000 * 8 bytes * 7 indicators = 546.9 KB
Streaming approach: O(1) = ~1 KB total (constant)
Memory efficiency: 547x less memory

⚡ LATENCY ANALYSIS:
Bulk 99th percentile: 0.699ms
Streaming 99th percentile: 0.039ms
For HFT (&#x3C;1ms requirement): ✅ Bulk passes, ✅ Streaming passes

v0.4.0: Four-Quadrant Architecture

Version 0.4.0 introduced a fundamental architectural shift: ta-numba now operates across four computation tiers (quadrants), each optimized for a specific use case. All four produce identical numerical results, validated by 83 cross-quadrant parity tests.

The Quadrant Model

Quadrant	Backend	Indicator Count	Optimal For
Bulk (historical)	Numba JIT	49	Training, backtesting, feature stores (100K+ points)
Bulk (historical)	Rust/PyO3	50	Available but Numba preferred at batch scale
Streaming (live)	Python	55	Fallback, environments without Rust
Streaming (live)	Rust/PyO3	54	Production live trading, sub-50us latency

Why Numba Wins Bulk

At production batch sizes (100,000+ data points), Numba JIT compilation operates directly on NumPy memory without any Foreign Function Interface (FFI) boundary crossing. Rust accessed through PyO3 pays a per-call data marshalling cost — converting Python objects to Rust types and back — that becomes the dominant factor at large array sizes. Comprehensive benchmarks (50 iterations, 100K data points) showed Numba averaging 9x faster than Rust/PyO3 for bulk array operations across all 44 indicators.

This was a counter-intuitive finding: at small data sizes (3,000 points, typical of a single instrument's feature window), Rust showed 2-12x speedups over Numba. The FFI overhead only dominates when the array is large enough that the per-call marshalling cost exceeds the per-element computation savings.

Why Rust Wins Streaming

For streaming indicators — per-tick updates with O(1) state maintenance — the relationship inverts. Each update involves a single scalar value (the new price tick) crossing the FFI boundary, making marshalling cost negligible. Rust excels at maintaining complex internal state machines: rolling windows, peak tracking, multi-stage smoothing, and iterative accumulation — all without garbage collection pauses.

The average Rust streaming speedup across all 45 classes was 2.6x, with complex indicators reaching significantly higher. Simple indicators where the Python computation is trivial (OBV, DailyReturn) showed Rust slower due to the fixed FFI overhead dominating minimal computation.

Quadrant Parity Guarantee

All four quadrants produce numerically identical results for the same input data:

Numba bulk SMA(close, 20) = Rust bulk SMA(close, 20) = Python streaming SMA after warmup = Rust streaming SMA after warmup
Validated across all shared indicators with max tolerance of 1e-10
83 cross-quadrant tests pass in CI

Rust Streaming Benchmarks

Benchmarked with 10,000 ticks, 10 iterations, comparing Rust/PyO3 streaming classes against Numba/Python streaming baselines:

Indicator	Rust (ms/10K ticks)	Numba (ms/10K ticks)	Speedup
Ulcer Index	16.9	225.2	13.3x
Stochastic RSI	10.3	94.0	9.2x
Awesome Oscillator	6.7	54.1	8.1x
Bollinger Bands	11.2	81.2	7.2x
Ultimate Oscillator	15.8	109.0	6.9x
CCI	8.5	52.8	6.2x
TSI	5.0	28.7	5.8x
DPO	5.2	26.9	5.2x
KAMA	8.5	42.4	5.0x
SMA	5.1	25.6	5.0x
MACD	5.3	20.6	3.9x
RSI	5.0	13.9	2.8x
ATR	8.6	17.9	2.1x
ADX	11.2	22.2	2.0x

Complex indicators with internal state machines (Ulcer Index, Stochastic RSI, Bollinger Bands) show the largest speedups because Rust avoids Python object allocation and garbage collection overhead during iterative state updates.

New Generic Indicators (v0.4.0)

Version 0.4.0 added four new generic indicators available across all four quadrants:

Indicator	Description	Use Case
`volume_ratio`	Volume relative to N-period average	Volume spike detection
`rolling_zscore`	Z-score within rolling window	Mean-reversion signals
`linear_regression_slope`	OLS slope over rolling window	Trend strength measurement
`rolling_percentile`	Percentile rank within rolling window	Relative position indicators

These indicators were designed for quantitative feature engineering and are used in Trade-Matrix's production feature pipeline for build_features.py (bulk) and queue_based_features.py (streaming).

Updated Performance Comparison

With v0.4.0's Rust streaming tier, ta-numba's competitive position across library alternatives:

Aspect	TA-Lib	ta-numba (Numba bulk)	ta-numba (Rust streaming)	ta	pandas
Installation	C compiler required	pip install only	pip install only (wheels)	pip install only	pip install only
Bulk Performance	Fastest (baseline)	4.3x slower avg	9x slower than Numba	857x slower	94x slower
Streaming Performance	No streaming	15.8x vs bulk recalc	2.6x faster than Numba streaming	No streaming	No streaming
Complex Streaming	N/A	Baseline	Up to 13.3x (Ulcer Index)	N/A	N/A
Dependency Issues	Frequent	None	None (pre-built wheels)	None	Rare
Memory (streaming)	N/A	O(1) constant	O(1) constant	N/A	N/A

Updated Library Selection Criteria

Choose TA-Lib for: Maximum bulk performance, stable environment, C compilation acceptable
Choose ta-numba (Numba) for: Reliable bulk deployment, Python-only environments, training/backtesting at 100K+ points
Choose ta-numba (Rust streaming) for: Production live trading, state-machine indicators, sub-50us per-tick latency requirements
Choose ta/pandas for: Simplicity, small datasets, existing pandas workflows, prototyping