PR #1453

open

Non-record: Depth Recurrence + Int7 Mixed Quant — val_bpb 1.1324 (3-seed mean)

by iverbovoyView on GitHub

val_bpb

1.1324

Architecture

Transformer

Optimizer

Muon

Artifact Size

~15.40 MB

Training Techniques

Architecture

depth recurrence

3 shared transformer blocks repeated 4 times for 12 effective layers with cross-repeat skip and loop embeddings.

parameters: {"layers":3,"repeats":4,"effective_layers":12}

MLP3x

Wider 3x MLP configuration used to increase model capacity.

parameters: {"multiplier":3,"hidden_dim":2640}

XSA

Exclusive self-attention applied on the last 4 effective layers to prevent attention collapse.

parameters: {"last_n":4}

LeakyReLU

LeakyReLU(0.5)^2 activation used for better gradient flow in deep recurrent models.

parameters: {"slope":0.5}

Quantization

mixed int7/int5

bits: 7

scope: attention and MLP

Weight Averaging

SWA

parameters: {"every":30,"start_frac":0.6}

Optimizer

Muon

weight_decay: 0.04

momentum: 0.95

other_params: {"matrix_lr":0.018,"scalar_lr":0.018,"tied_embed_lr":0.021}

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: null

hedge mixer

parameters: {"experts":5,"parallel_gpus":8}

LR Schedule

warmdown

parameters: {"iters":3000}

Regularization

logit softcap

parameters: {"value":30}

Sequence Length

sequence_length

train_length: 1024

eval_length: null

Novel Contributions

Int7 attention with Int5 MLP mixed quantization to fit a wider model within the 16 MB budget
Depth recurrence with 3 shared blocks repeated 4 times for 12 effective layers
Parallelized hedge mixer evaluation across 8 GPUs to reduce eval time
Wider MLP 3x model enabled by saved quantization budget
Progressive depth training with earlier phase transitions and SWA