PR #1216

open

11L INT6 XSA-all + EMA + VE — ttt_bpb 1.1487

by SoHarshhView on GitHub

val_bpb

1.1574

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

16.41MB

Training Techniques

Architecture

XSA

Cross-layer shared attention applied across layers, including an all-layers variant in the reported experiment.

parameters: {"layers":"all"}

Value Embeddings

Shared value embeddings are added into the V stream at later attention layers to reinject token identity.

parameters: {"ve_dim":128,"layers":[10,11]}

MLP3x

Three-times wider MLP with LeakyReLU activation.

parameters: {"multiplier":3,"activation":"LeakyReLU"}

Partial RoPE

Rotary position embeddings applied to only part of the head dimensions.

parameters: {"dimensions":"16/64"}

U-Net skip connections

Skip connections added in a U-Net style across the network.

parameters: null

BigramHash

Hashed bigram embedding table used as an input component.

parameters: {"buckets":10240}

Weight Averaging

EMA

parameters: {"decay":0.997,"qat_reset":true}

Quantization

mixed int6/int4

bits: 6

scope: attn/MLP/bigram

Compression

zstd

level: null

Optimizer

Parallel Muon

weight_decay: null

momentum: null

other_params: {"async_reduce_scatter":true,"no_ddp":true}

Test-Time Training

full TTT

parameters: {"learning_rate":0.002,"epochs":3,"legal":true,"score_first":true}

Regularization

LN scale

parameters: {"formula":"1/sqrt(layer+1)"}

Other

other

Late QAT trigger based on wallclock fraction of the training budget.

parameters: {"fraction":0.65}

Novel Contributions

Value embeddings injected into later attention layers
Parallel Muon with async reduce-scatter on banked gradients
Banked model tensors for qo/kv/mlp_up/mlp_down
EMA with QAT-reset at quantization activation
Mixed INT4/INT6 quantization with zstd compression
XSA-all experiment achieving a new quality best