PR #1425

open

Non-record: PROTEUS Feature Ablation - Parallel Residuals + Mixed INT5/INT6 + TTT on DGX Spark GB10

by dentity007View on GitHub
val_bpb
1.4479
Architecture
Transformer
Optimizer
AdamW
Artifact Size
8.21 MB

Training Techniques

Architecture
parallel residuals
Dual-stream transformer blocks with separate attention and MLP residual streams, learnable route vector, and lane merge.
parameters: {"start_layer":6}
SLOT
Per-batch delta optimization at the last hidden layer during evaluation.
parameters: null
U-Net skip connections
Sigmoid-gated U-Net style skip connections in the transformer architecture.
parameters: null
GQA
Grouped query attention with 8 attention heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
XSA
XSA attention used across all layers.
parameters: {"layers":11}
Quantization
mixed int5/int6
bits: 5
scope: middle MLP layers
Test-Time Training
score-first TTT
parameters: {"enabled":1}
Evaluation
sliding window eval
parameters: null
Weight Averaging
EMA
parameters: {"decay":0.997}
Optimizer
AdamW
weight_decay: 0.085
momentum: null
other_params: {"lr":0.02}
LR Schedule
warmdown
parameters: {"warmdown_steps":0.667}
Regularization
logit softcap
parameters: null

Novel Contributions

  • Parallel residuals with dual-stream attention/MLP routing
  • Mixed INT5/INT6 quantization for artifact size reduction
  • Score-first TTT implementation
  • SLOT evaluation-time delta optimization
  • Ablation showing parallel residuals as the dominant improvement
  • Reported 2.3x throughput improvement on DGX Spark GB10