PR #790

open

Record: Residual Input Mixing + mixed int6 GPTQ + grouped TTT + MLP 3.5x

by danialhtView on GitHub

val_bpb

1.1172

Architecture

Transformer

Optimizer

AdamW

Artifact Size

15.5 MB

Training Techniques

Quantization

mixed int6 GPTQ

bits: 6

scope: per-row weights

QAT

bits: null

scope: mixed int6 GPTQ with early QAT

Architecture

residual mixing

Each transformer block receives a learned mix of the current stream, earlier block outputs, and the original x0, creating denser residual connections and reusing longer-range intermediate features.

parameters: {"layers":11,"dimensions":512,"mlp_multiplier":3.5,"mha":"8/8","bigramhash":8192,"xsa":"all layers"}

MLP3.5x

Expanded MLP width to 3.5x the model dimension.

parameters: {"multiplier":3.5,"hidden_size":1792}

BigramHash

Uses a BigramHash component in the architecture.

parameters: {"dimensions":8192}

XSA

XSA is enabled in all layers.

parameters: {"layers":"all"}

Optimizer

AdamW

weight_decay: null

momentum: null

other_params: {"grouped_params":true,"groups":["matrices","control weights"],"standard_clipping":true,"per_chunk_warmup_removed":true}

Weight Averaging

EMA

parameters: {"decay":0.997}

Evaluation

stride-based eval

parameters: {"stride":64}

Test-Time Training

score-first AdamW TTT

parameters: {"chunk":131072,"unfrozen":"last 2 blocks plus control params","grouped_optimizer":true}

Other

other

GPTQ calibration time is counted within the 600s training budget, requiring a slight reduction in training wall-clock time to stay under the limit.

parameters: {"time_limit_seconds":600}

Novel Contributions

Fixed the prior bug so GPTQ calibration time counts toward the 600s training budget.
Reduced training wall-clock time slightly to remain under the time limit.
Switched TTT from a flat optimizer to grouped AdamW with separate matrix and control-weight parameter groups.
Strengthened matrix/head adaptation in TTT while restoring standard clipping and removing per-chunk warmup.
Introduced denser residual input mixing so each block sees a learned mix of current stream, earlier block outputs, and x0.
Used mixed int6 per-row GPTQ with early QAT and EMA.
Expanded the MLP to 3.5x width.