PR #615

open

Record: Residual Input Mixing + mixed int6 GPTQ + grouped TTT + MLP 3.5x

by danialhtView on GitHub

val_bpb

1.1169

Architecture

Transformer

Optimizer

AdamW

Artifact Size

15.6 MB

Training Techniques

Quantization

mixed int6 GPTQ

bits: 6

scope: per-row

Architecture

Residual Input Mixing

Each transformer block sees a learned mix of the current stream, earlier block outputs, and the original x0, creating a denser residual path and enabling reuse of longer-range intermediate features.

parameters: {"layers":11,"dimension":512,"MHA":"8/8","MLP":"3.5x (1792)","BigramHash":8192,"XSA":"all layers","mixed residuals":"each layer from 2 previous layers"}

Optimizer

AdamW

weight_decay: null

momentum: null

other_params: {"grouped":true,"stronger matrix/head adaptation":true,"standard clipping restored":true,"per-chunk warmup removed":true}

Weight Averaging

EMA

parameters: {"decay":0.997}

Test-Time Training

score-first TTT

parameters: {"chunk":131072,"last 2 blocks plus control params unfrozen":true,"optimizer":"Legal score-first AdamW"}

Evaluation

stride-based eval

parameters: {"stride":64}

Novel Contributions

Changed TTT from a flat optimizer to grouped AdamW with stronger matrix/head adaptation, restoring standard clipping and removing per-chunk warmup.
Modified architecture to have denser residual connections by mixing inputs from current stream, earlier block outputs, and original input x0 at each transformer block.
Applied mixed int6 per-row GPTQ quantization with clip_range=15 combined with Early QAT (threshold 0.5) and EMA 0.997.
Used MLP expansion of 3.5x (1792) and BigramHash 8192 with XSA in all layers.