PR #1422

open

Non-record: Depth Recurrence + GPTQ + SGD TTT (1.1172, 1xH100)

by swapp1990View on GitHub

val_bpb

1.1172

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.38 MB

Training Techniques

Architecture

depth recurrence

Shares middle transformer blocks across multiple layers to create 13 effective layers from 7 unique blocks.

parameters: {"layers":13,"unique_blocks":7}

XSA

Cross-sample attention applied to all unique blocks.

parameters: {"blocks":7}

SmearGate

SmearGate used in the SwiGLU MLP.

parameters: null

U-Net skip connections

U-Net style skip connections added to the transformer.

parameters: null

weight tying

Tied input and output embeddings.

parameters: null

MLP3x

Uses a 3x MLP expansion with SwiGLU.

parameters: {"multiplier":3}

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"matrix_lr":0.04,"scalar_lr":0.04}

Weight Averaging

EMA

parameters: {"during":"warmdown phase"}

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

Quantization

GPTQ

bits: null

scope: all

mixed int5/int6/int8

bits: null

scope: MLP, attention, embeddings

Compression

zstd

level: 22

Test-Time Training

full TTT

parameters: {"optimizer":"SGD","learning_rate":0.005,"momentum":0.9,"chunk_size":2048,"all_weights":true}

Novel Contributions

Depth recurrence with a 7-block shared-layer schedule to create 13 effective layers
GPTQ post-training quantization with Hessian-compensated rounding
Mixed-bit quantization across MLP, attention, and embeddings
SGD all-weights test-time training on 2048-token chunks
Combination of GPTQ and TTT that preserves gains when stacked
Use of XSA, SmearGate, and U-Net skip connections in a compact Transformer