PR #790
openRecord: Residual Input Mixing + mixed int6 GPTQ + grouped TTT + MLP 3.5x
by danialhtView on GitHub
val_bpb
1.1172
Architecture
Transformer
Optimizer
AdamW
Artifact Size
15.5 MB
Training Techniques
Quantization
mixed int6 GPTQ
bits: 6
scope: per-row weights
QAT
bits: null
scope: mixed int6 GPTQ with early QAT
Architecture
residual mixing
Each transformer block receives a learned mix of the current stream, earlier block outputs, and the original x0, creating denser residual connections and reusing longer-range intermediate features.
parameters: {"layers":11,"dimensions":512,"mlp_multiplier":3.5,"mha":"8/8","bigramhash":8192,"xsa":"all layers"}
MLP3.5x
Expanded MLP width to 3.5x the model dimension.
parameters: {"multiplier":3.5,"hidden_size":1792}
BigramHash
Uses a BigramHash component in the architecture.
parameters: {"dimensions":8192}
XSA
XSA is enabled in all layers.
parameters: {"layers":"all"}
Optimizer
AdamW
weight_decay: null
momentum: null
other_params: {"grouped_params":true,"groups":["matrices","control weights"],"standard_clipping":true,"per_chunk_warmup_removed":true}
Weight Averaging
EMA
parameters: {"decay":0.997}
Evaluation
stride-based eval
parameters: {"stride":64}
Test-Time Training
score-first AdamW TTT
parameters: {"chunk":131072,"unfrozen":"last 2 blocks plus control params","grouped_optimizer":true}
Other
other
GPTQ calibration time is counted within the 600s training budget, requiring a slight reduction in training wall-clock time to stay under the limit.
parameters: {"time_limit_seconds":600}
Novel Contributions
- Fixed the prior bug so GPTQ calibration time counts toward the 600s training budget.
- Reduced training wall-clock time slightly to remain under the time limit.
- Switched TTT from a flat optimizer to grouped AdamW with separate matrix and control-weight parameter groups.
- Strengthened matrix/head adaptation in TTT while restoring standard clipping and removing per-chunk warmup.
- Introduced denser residual input mixing so each block sees a learned mix of current stream, earlier block outputs, and x0.
- Used mixed int6 per-row GPTQ with early QAT and EMA.
- Expanded the MLP to 3.5x width.