PR #1422

open

Non-record: Depth Recurrence + GPTQ + SGD TTT (1.1172, 1xH100)

by swapp1990View on GitHub
val_bpb
1.1172
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.38 MB

Training Techniques

Architecture
depth recurrence
Shares middle transformer blocks across multiple layers to create 13 effective layers from 7 unique blocks.
parameters: {"layers":13,"unique_blocks":7}
XSA
Cross-sample attention applied to all unique blocks.
parameters: {"blocks":7}
SmearGate
SmearGate used in the SwiGLU MLP.
parameters: null
U-Net skip connections
U-Net style skip connections added to the transformer.
parameters: null
weight tying
Tied input and output embeddings.
parameters: null
MLP3x
Uses a 3x MLP expansion with SwiGLU.
parameters: {"multiplier":3}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"matrix_lr":0.04,"scalar_lr":0.04}
Weight Averaging
EMA
parameters: {"during":"warmdown phase"}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
Quantization
GPTQ
bits: null
scope: all
mixed int5/int6/int8
bits: null
scope: MLP, attention, embeddings
Compression
zstd
level: 22
Test-Time Training
full TTT
parameters: {"optimizer":"SGD","learning_rate":0.005,"momentum":0.9,"chunk_size":2048,"all_weights":true}

Novel Contributions

  • Depth recurrence with a 7-block shared-layer schedule to create 13 effective layers
  • GPTQ post-training quantization with Hessian-compensated rounding
  • Mixed-bit quantization across MLP, attention, and embeddings
  • SGD all-weights test-time training on 2048-token chunks
  • Combination of GPTQ and TTT that preserves gains when stacked
  • Use of XSA, SmearGate, and U-Net skip connections in a compact Transformer