PR #1425
openNon-record: PROTEUS Feature Ablation - Parallel Residuals + Mixed INT5/INT6 + TTT on DGX Spark GB10
by dentity007View on GitHub
val_bpb
1.4479
Architecture
Transformer
Optimizer
AdamW
Artifact Size
8.21 MB
Training Techniques
Architecture
parallel residuals
Dual-stream transformer blocks with separate attention and MLP residual streams, learnable route vector, and lane merge.
parameters: {"start_layer":6}
SLOT
Per-batch delta optimization at the last hidden layer during evaluation.
parameters: null
U-Net skip connections
Sigmoid-gated U-Net style skip connections in the transformer architecture.
parameters: null
GQA
Grouped query attention with 8 attention heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
XSA
XSA attention used across all layers.
parameters: {"layers":11}
Quantization
mixed int5/int6
bits: 5
scope: middle MLP layers
Test-Time Training
score-first TTT
parameters: {"enabled":1}
Evaluation
sliding window eval
parameters: null
Weight Averaging
EMA
parameters: {"decay":0.997}
Optimizer
AdamW
weight_decay: 0.085
momentum: null
other_params: {"lr":0.02}
LR Schedule
warmdown
parameters: {"warmdown_steps":0.667}
Regularization
logit softcap
parameters: null
Novel Contributions
- Parallel residuals with dual-stream attention/MLP routing
- Mixed INT5/INT6 quantization for artifact size reduction
- Score-first TTT implementation
- SLOT evaluation-time delta optimization
- Ablation showing parallel residuals as the dominant improvement
- Reported 2.3x throughput improvement on DGX Spark GB10