PR #615
openRecord: Residual Input Mixing + mixed int6 GPTQ + grouped TTT + MLP 3.5x
by danialhtView on GitHub
val_bpb
1.1169
Architecture
Transformer
Optimizer
AdamW
Artifact Size
15.6 MB
Training Techniques
Quantization
mixed int6 GPTQ
bits: 6
scope: per-row
Architecture
Residual Input Mixing
Each transformer block sees a learned mix of the current stream, earlier block outputs, and the original x0, creating a denser residual path and enabling reuse of longer-range intermediate features.
parameters: {"layers":11,"dimension":512,"MHA":"8/8","MLP":"3.5x (1792)","BigramHash":8192,"XSA":"all layers","mixed residuals":"each layer from 2 previous layers"}
Optimizer
AdamW
weight_decay: null
momentum: null
other_params: {"grouped":true,"stronger matrix/head adaptation":true,"standard clipping restored":true,"per-chunk warmup removed":true}
Weight Averaging
EMA
parameters: {"decay":0.997}
Test-Time Training
score-first TTT
parameters: {"chunk":131072,"last 2 blocks plus control params unfrozen":true,"optimizer":"Legal score-first AdamW"}
Evaluation
stride-based eval
parameters: {"stride":64}
Novel Contributions
- Changed TTT from a flat optimizer to grouped AdamW with stronger matrix/head adaptation, restoring standard clipping and removing per-chunk warmup.
- Modified architecture to have denser residual connections by mixing inputs from current stream, earlier block outputs, and original input x0 at each transformer block.
- Applied mixed int6 per-row GPTQ quantization with clip_range=15 combined with Early QAT (threshold 0.5) and EMA 0.997.
- Used MLP expansion of 3.5x (1792) and BigramHash 8192 with XSA in all layers.