Trade-Matrix System Architecture - Institutional-Grade Trading Platform

CI/CD Cost

$0/mo

CI/CD automation (GitHub Actions PRO free tier). Total infra: ~$96/mo including Azure VM.

CI/CD Free Tier

ML Inference Latency

<5ms

CPU-only sklearn inference (RF/XGBoost)

CPU-Only

Deployment Time

8min

Weekly model updates with zero downtime

CI/CD Optimized

Orchestration

K3S

Kubernetes orchestration with liveness/readiness probes and rolling updates

Self-Hosted

Monitoring Metrics

413+

71 base metric families × label cardinality (instrument, strategy, status)

Full Visibility

Model Training Time

65min

Transfer Learning with Walk-Forward Validation

Weekly Updates

⚖️ Technical Design Decisions

Design Philosophy: Trade-Matrix prioritizes architectural correctness and research rigor over scale. Every design decision is justified by quantitative research standards, not marketing claims.

Decision	Trade-Matrix Choice	Rationale
CI/CD Automation	GitHub Actions (PRO free tier)	$0/mo CI/CD cost within PRO limits (300/3,000 mins, 1.3/100GB bandwidth). Total infra ~$96/mo including Azure VM + storage.
ML Inference	<5ms (CPU-only, sklearn RF/XGBoost)	Lightweight models enable sub-5ms inference without GPU. Suitable for 4H bar frequency (6 inferences/day per instrument).
Model Updates	Weekly automated (Transfer Learning)	Balances adaptation speed with stability for mid-frequency crypto trading. Preserves old model knowledge via frozen trees.
Risk Management	4-Tier Fallback + Circuit Breaker	Graceful degradation from RL → Kelly → Emergency. Drawdown > 5% triggers automatic position flattening.
Position Sizing	RL-based with regime-adaptive Kelly baseline	Adapts to 4 market regimes (Bear: 25%, Neutral: 50%, Bull: 67%, Crisis: 17% Kelly fraction).
Feature Selection	Boruta Selection (9-13 features per instrument)	Automated wrapper method selects all statistically significant features. Reduces overfitting vs. hand-picked feature sets.
Walk-Forward Validation	200-bar purge gap (40 weekly windows)	Exceeds López de Prado's recommendation (h ≈ 0.01T ≈ 70 bars). Prevents data leakage between train/test folds.
Deployment	GHCR-only (6.24GB base + 319MB models)	Split architecture enables weekly model updates (319MB) without re-deploying base (6.24GB). Zero-downtime rolling updates.
Observability	413+ Prometheus time series + Loki logs	71 base metric families × label cardinality. 4 Grafana dashboards (Trading Cockpit, Market Analysis, Institutional Analytics, Infrastructure).

Current Constraints: Trade-Matrix operates on a single Azure B2als_v2 VM (2 vCPU, 4GB RAM, ~$36/mo). CPU-only inference, 4H bar frequency, 3 instruments (BTC, ETH, SOL), single exchange (Bybit). Architecture supports scaling to GPU inference, tick data, and multi-exchange—but current scope is intentionally constrained for R&D validation.

🏗️ System Architecture

High-Level System Components

📊

Data Ingestion Layer

Bybit Exchange Integration (live trading)
Deribit DVOL (volatility data)
3+ years historical OHLCV (2022-2025)
4-hour bar resolution for institutional-grade analysis

🤖

ML/RL Intelligence

Transfer Learning Models (BTC, ETH, SOL)
4-Tier RL Position Sizing
Regime Detection (4-state HMM)
Weekly automated updates with validation gates

🛡️

Risk Management

HRAA v2 Algorithm (hierarchical allocation)
Circuit Breaker (3-state FSM)
Position Limits (per-instrument)
Kelly Criterion Baseline (regime-adaptive)

⚡

Event-Driven Core

NautilusTrader Framework
MessageBus Architecture (sub-ms routing)
Real-time Order Management
Portfolio tracking with tick-level precision

💾

Data Storage

PostgreSQL 15 + TimescaleDB 2.14
Redis 7.2 (feature cache, pub/sub)
MinIO (ML artifacts, models)
MLflow (model registry, experiments)

📈

Monitoring Stack

Prometheus 2.48 (413+ metrics)
Grafana 10.2 (real-time dashboards)
Loki 2.9 (log aggregation)
30-day retention for forensic analysis

graph TB subgraph "Data Sources" BYBIT[Bybit Exchange
Live Trading] DERIBIT[Deribit
DVOL Volatility] HISTORICAL[Historical Data
2022-2025 OHLCV] end subgraph "Trade-Matrix Core Platform" subgraph "Intelligence Layer" ML[Transfer Learning Models
BTC/ETH/SOL] RL[RL Position Sizing
4-Tier Fallback] REGIME[Regime Detection
4-State HMM] end subgraph "Trading Engine" MSGBUS[MessageBus
Event Router] RISK[Risk Engine
HRAA v2 + Circuit Breaker] EXEC[Execution Engine
Order Management] PORTFOLIO[Portfolio Engine
Position Tracking] end subgraph "Data Layer" REDIS[(Redis 7.2
Cache & Pub/Sub)] POSTGRES[(PostgreSQL + TimescaleDB
Time Series Data)] MINIO[(MinIO
ML Artifacts)] MLFLOW[(MLflow
Model Registry)] end end subgraph "Infrastructure" K3S[K3S Cluster
Production Orchestration] GITHUB[GitHub Actions
CI/CD Pipeline] PROMETHEUS[Prometheus + Grafana
Monitoring Stack] end BYBIT -->|WebSocket| MSGBUS DERIBIT -->|API| ML HISTORICAL -->|Batch| ML MSGBUS --> ML ML --> RL RL --> RISK RISK --> EXEC EXEC --> PORTFOLIO MSGBUS <--> REDIS ML <--> MLFLOW PORTFOLIO --> POSTGRES MLFLOW <--> MINIO EXEC -->|Orders| BYBIT GITHUB -->|Deploy| K3S K3S -->|Runs| MSGBUS PROMETHEUS -->|Monitor| K3S style ML fill:#00d4ff,stroke:#000,stroke-width:2px,color:#000 style RL fill:#00ff88,stroke:#000,stroke-width:2px,color:#000 style RISK fill:#ffd93d,stroke:#000,stroke-width:2px,color:#000 style MSGBUS fill:#ff6b6b,stroke:#000,stroke-width:3px,color:#fff

Complete System Architecture

Real-time Data Flow

Batch/Historical Flow

Configuration/Control

Monitoring/Metrics

graph TB subgraph ExternalData["External Data Sources"] BYBIT_EX[Bybit Exchange - WebSocket] DERIBIT_EX[Deribit Exchange - DVOL] HISTORICAL_S3[Historical Data - MinIO] end subgraph NautilusTrader["NautilusTrader Core"] MSGBUS[MessageBus] DATAENGINE[DataEngine] RISKENGINE[RiskEngine - HRAA v2] EXECENGINE[ExecutionEngine] PORTFOLIO_ENG[PortfolioEngine] CACHE[Cache] CATALOG[DataCatalog] STRATEGIES[ML Strategies] end subgraph MLServices["ML Services"] ML_INFERENCE[Signal Generator] RL_AGENT[RL Position Sizer] REGIME_DETECT[Regime Detector] TL_TRAINER[TL Model Trainer] RL_TRAINER[RL Agent Trainer] FEATURE_ENG[Feature Engineer] end subgraph StorageLayer["Storage Layer"] REDIS[(Redis)] POSTGRES[(PostgreSQL)] MINIO[(MinIO)] MLFLOW_DB[(MLflow)] end subgraph Monitoring["Monitoring"] PROMETHEUS[Prometheus] GRAFANA[Grafana] LOKI[Loki] end subgraph Deployment["Deployment"] K3S[K3S Cluster] GHCR[GitHub Registry] GITHUB_ACTIONS[GitHub Actions] end BYBIT_EX --> DATAENGINE DATAENGINE --> MSGBUS MSGBUS --> CACHE CACHE --> ML_INFERENCE ML_INFERENCE --> RL_AGENT RL_AGENT --> STRATEGIES STRATEGIES --> MSGBUS MSGBUS --> RISKENGINE RISKENGINE --> EXECENGINE EXECENGINE --> BYBIT_EX BYBIT_EX --> PORTFOLIO_ENG HISTORICAL_S3 -.-> FEATURE_ENG DERIBIT_EX -.-> FEATURE_ENG FEATURE_ENG -.-> TL_TRAINER TL_TRAINER -.-> MLFLOW_DB MLFLOW_DB -.-> ML_INFERENCE FEATURE_ENG -.-> RL_TRAINER RL_TRAINER -.-> MLFLOW_DB MLFLOW_DB -.-> RL_AGENT CACHE -.-> REDIS PORTFOLIO_ENG -.-> POSTGRES CATALOG -.-> MINIO TL_TRAINER -.-> MINIO MSGBUS -.-> PROMETHEUS RISKENGINE -.-> PROMETHEUS ML_INFERENCE -.-> PROMETHEUS PROMETHEUS -.-> GRAFANA K3S -.-> LOKI GITHUB_ACTIONS --> GHCR GHCR --> K3S K3S --> MSGBUS style MSGBUS fill:#ff6b6b,stroke:#000,stroke-width:3px,color:#fff style ML_INFERENCE fill:#00d4ff,stroke:#000,stroke-width:2px,color:#000 style RL_AGENT fill:#00ff88,stroke:#000,stroke-width:2px,color:#000 style RISKENGINE fill:#ffd93d,stroke:#000,stroke-width:2px,color:#000

Component Details & Specifications

▼

NautilusTrader Core Components

MessageBus: Event-driven routing between components. Pub/sub pattern for decoupled component communication with zero message loss guarantees.
DataEngine: Normalizes market data from multiple sources into unified format. Currently processes 4H bars from Bybit and Deribit DVOL.
RiskEngine: Implements HRAA v2 with per-instrument position limits, portfolio-level constraints, and circuit breaker integration.
ExecutionEngine: Order lifecycle management with fill tracking and reconciliation. Manages order submission, execution monitoring, and position updates.
PortfolioEngine: Real-time position tracking with mark-to-market PnL updates. Calculates Sharpe ratio, maximum drawdown, and other performance metrics on-the-fly.

ML/RL Services

Unified Signal Generator: Ensemble of 3 TL models (BTC, ETH, SOL) with 4-tier resilient loading. Sub-5ms inference via feature caching and optimized sklearn pipelines.
RL Position Sizer: Reinforcement Learning agent trained via curriculum learning. 4-tier fallback: FULL_RL → BLENDED (50/50 with Kelly) → PURE_KELLY → EMERGENCY_FLAT (0% on circuit breaker OPEN).
Regime Detector: 4-state Hidden Markov Model with Markov-Switching GARCH. Classifies market as Bear/Neutral/Bull/Crisis. Kelly fractions: 25%/50%/67%/17% respectively.
TL Model Trainer: Automated weekly training pipeline with Walk-Forward Validation (40 windows, 200-bar purge gap). Boruta feature selection locks 9-13 features per instrument to prevent overfitting.
RL Agent Trainer: Soft Actor-Critic (SAC) with curriculum learning. Trains in 45 minutes (vs 120 minutes without curriculum). Environment: Bybit 4H bars, transaction cost model, slippage simulation.

Storage Systems

Redis 7.2: Feature cache (TTL-based), pub/sub for ML signals, session persistence.
PostgreSQL 15 + TimescaleDB 2.14: Time-series storage for OHLCV bars, ML predictions, portfolio snapshots. Hypertable compression enabled.
MinIO: S3-compatible object store for ML models (200-500MB per model), training datasets, and backtest results. Organized by instrument and version.
MLflow: Model registry with lifecycle management (Staging → Production), experiment tracking and artifact versioning. Tag-based promotion workflow.

Monitoring Stack

Prometheus 2.48: Collects 413+ time series metrics (71 base families × instrument/strategy/status labels). Retention: 30 days. Scrape interval: 15 seconds.
Grafana 10.2: 4 specialized dashboards (Trading Cockpit, Market Analysis, Institutional Analytics, Infrastructure). Auto-refresh: 5 seconds.
Loki 2.9: Log aggregation with 30-day retention. Indexes: service, level, instrument, strategy. Query performance: <1s for 10M log lines via LogQL.

Hybrid Deployment Architecture

Cost Optimization: Trade-Matrix minimizes CI/CD automation costs to $0/month by leveraging GitHub PRO free tier. Total infrastructure cost is approximately $96/month (Azure VM ~$36, storage ~$10, electricity/internet ~$50).

📦

GitHub Container Registry (GHCR)

Base Image: 6.24GB (Python 3.12, dependencies, vendored NautilusTrader)
Model Layer: 319MB (TL models, RL policies, feature configs)
Total Size: 6.54GB combined
Update Frequency: Weekly models, monthly base
Bandwidth: 1.3GB/month (within PRO 100GB/month limit)

⚙️

GitHub Actions CI/CD

Weekly Pipeline: 73 minutes (training + deployment)
Compute Minutes: ~300/month (within PRO 3,000 limit)
Automation: 15-step validation pipeline
Zero Human Intervention

☸️

K3S Production Cluster

Orchestration: Lightweight Kubernetes (K3S 1.28)
Auto-scaling: Horizontal pod autoscaling
Health Checks: Liveness + readiness probes
Zero-Downtime: Rolling updates (max surge 1)

⚡

Azure VMSS Ephemeral Workers

Instance Type: Standard_D2s_v3 (8GB RAM, 2 vCPU)
Scaling: 0 → 1 for 15-25 min seeding tasks
Annual Savings: $418/year vs always-on instance
Use Case: Signal history pre-calculation

sequenceDiagram participant DEV as Developer/PM participant GITHUB as GitHub Actions participant GHCR as GitHub Container Registry participant K3S as K3S Production Cluster participant TRADE as Trading System Note over DEV,TRADE: Weekly Model Update Workflow (Every Sunday) DEV->>GITHUB: git push (trigger weekly pipeline) rect rgb(0, 50, 100) Note over GITHUB: Phase 1: Training (65 min) GITHUB->>GITHUB: Fetch data from Bybit GITHUB->>GITHUB: Feature engineering (Boruta) GITHUB->>GITHUB: Train TL models (3 instruments) GITHUB->>GITHUB: Train RL agents (curriculum) GITHUB->>GITHUB: Validate (IC at least 0.03, Sharpe over 0.5) end rect rgb(0, 100, 50) Note over GITHUB: Phase 2: Package Models (3 min) GITHUB->>GITHUB: Export MLflow artifacts GITHUB->>GITHUB: Build combined container (6.54GB) GITHUB->>GHCR: Push to GHCR (within free tier) end rect rgb(100, 50, 0) Note over K3S: Phase 3: Deployment (5 min) K3S->>GHCR: Pull new image (6.54GB, cached layers) K3S->>K3S: Rolling update (zero downtime) K3S->>TRADE: Deploy new trading pods TRADE->>TRADE: Health checks pass K3S->>TRADE: Route traffic to new pods K3S->>K3S: Terminate old pods end rect rgb(80, 0, 80) Note over K3S: Phase 3.5: Signal History Seeding (15-25 min, as needed) K3S->>K3S: Scale Azure VMSS 0→1 (Standard_D2s_v3) K3S->>TRADE: Run signal pre-calculation (200 bars) TRADE->>TRADE: Seed PostgreSQL with historical signals K3S->>K3S: Scale VMSS 1→0 (terminate worker) end TRADE-->>DEV: Deployment complete notification DEV->>K3S: Verify metrics (Grafana) Note over DEV,TRADE: Total Time: ~73 minutes | CI/CD Cost: $0 (GitHub Actions PRO)

Cost Breakdown vs Traditional Cloud Deployments

▼

Trade-Matrix (GitHub PRO Optimization)

Compute: $0/month (300 mins/month ÷ 3,000 free mins = 10% utilization)
Container Storage: $0/month (1.5GB ÷ 100GB/month free = 3% utilization)
Bandwidth: $0/month (1.3GB ÷ 100GB/month free = 2.6% utilization)
CI/CD Total: $0/month
Infrastructure Total: ~$96/month (Azure VM ~$36, storage ~$10, electricity/internet ~$50)

Equivalent AWS Setup

EC2 Compute: t3.large (2 vCPU, 8GB RAM) × 2 = $120/month
EKS Cluster: Control plane = $73/month
ECR Storage: 10GB containers = $1/month
S3 + RDS: Storage + backups = $80/month
Data Transfer: 100GB/month = $9/month
CloudWatch: Monitoring + logs = $30/month
Total: $313/month ($3,756/year)

Equivalent GCP Setup

GCE Compute: n1-standard-2 × 2 = $100/month
GKE Cluster: Control plane = $73/month
Container Registry: 10GB = $2/month
Cloud Storage + SQL: = $90/month
Network Egress: 100GB/month = $12/month
Stackdriver: Monitoring + logs = $40/month
Total: $317/month ($3,804/year)

Annual Savings

~$2,700/year savings

vs equivalent AWS setup ($313/mo - $96/mo = $217/mo × 12)

Scalability Note: While current deployment minimizes costs (~$96/month total), the architecture is designed to scale to managed cloud infrastructure (AWS/GCP/Azure) if trading volume requires additional compute. The hybrid container strategy (large base + small models) remains optimal for bandwidth efficiency at any scale.

⚡ Real-Time Trading Workflow

sequenceDiagram participant BYBIT as Bybit Exchange participant DC as DataClient participant MB as MessageBus participant DE as DataEngine participant C as Cache participant ML as ML Inference (Sub-5ms) participant RL as RL Position Sizer (4-Tier) participant S as Strategy participant RE as RiskEngine (HRAA v2) participant CB as Circuit Breaker participant EE as ExecEngine participant P as Portfolio Note over BYBIT,P: Live Trading Flow (Typical Latency: <50ms end-to-end) BYBIT->>DC: Market Data (WebSocket) - BTC-USDT 4H Bar Close DC->>MB: Publish BarEvent MB->>DE: Route to DataEngine DE->>C: Update Cache par Feature Computation C->>ML: Extract Features (9-11 Boruta-selected) ML->>ML: Model Inference - 4-Tier Resilient Load ML->>ML: IC Validation (threshold >= 0.05) end ML->>RL: Signal + Confidence (e.g., BUY, conf=0.73) alt High Confidence 50+ AND High IC 05+ RL->>RL: TIER 1: FULL_RL - 100 percent RL Policy else Medium Confidence OR Medium IC RL->>RL: TIER 2: BLENDED - 50 percent RL + 50 percent Kelly else Low Confidence OR IC Failure RL->>RL: TIER 3: PURE_KELLY - 100 percent Kelly end RL->>CB: Check Circuit Breaker Status alt Circuit Breaker OPEN Drawdown over 5 percent CB->>RL: EMERGENCY_FLAT - 0 percent Position Size RL->>S: Flatten Position else Circuit Breaker CLOSED CB->>RL: OK RL->>S: Position Size (e.g., 15 percent capital) end S->>MB: Submit Order (Market/Limit) MB->>RE: Risk Validation RE->>RE: Check Position Limits - Per-Instrument + Portfolio RE->>RE: Calculate VaR Impact alt Risk Checks Pass RE->>MB: Order Approved MB->>EE: Execute Order EE->>BYBIT: Place Order BYBIT->>EE: Order Acknowledged BYBIT-->>EE: Fill Event EE->>MB: Broadcast Fill MB->>P: Update Position P->>P: Calculate PnL - Mark-to-Market else Risk Checks Fail RE->>MB: Order Rejected MB->>S: Rejection Notice end Note over BYBIT,P: Position Monitored in Real-Time for Circuit Breaker Triggers

Performance Notes: End-to-end latency from market data receipt to order placement averages <50ms, with ML inference contributing <5ms on CPU-only sklearn models. For a 4-hour bar trading strategy, latency is not a competitive differentiator—signal quality and risk management are the primary alpha sources.

🚀 Trade-Matrix

⚖️ Technical Design Decisions

🏗️ System Architecture

High-Level System Components

Complete System Architecture

Component Details & Specifications

NautilusTrader Core Components

ML/RL Services

Storage Systems

Monitoring Stack

Hybrid Deployment Architecture

Cost Breakdown vs Traditional Cloud Deployments

Trade-Matrix (GitHub PRO Optimization)

Equivalent AWS Setup

Equivalent GCP Setup

Annual Savings

⚡ Real-Time Trading Workflow