PR #325

open

Add Looped Transformer Design non-record submission (non tuned)

by Aum08DesaiView on GitHub

val_bpb

1.1462

Architecture

Looped Transformer

Optimizer

Muon

Artifact Size

15,589,099 bytes

Training Techniques

Architecture

depth recurrence / looped transformer

Transformer with a shared recurrent core and repeated looped execution to increase effective depth.

parameters: {"num_layers":6,"loop_core_layers":2,"loop_repeats":5,"loop_attn_every":2,"effective_executed_layers":14}

RoPE

Partial rotary positional embeddings applied only to a subset of dimensions.

parameters: {"dimensions":16}

KV head count

Uses fewer KV heads than attention heads.

parameters: {"num_heads":10,"num_kv_heads":5}

XSA

Includes XSA attention extras over the last few tokens.

parameters: {"last_n":4}

Bigram features

Adds token-side bigram vocabulary and embedding features.

parameters: {"bigram_vocab_size":2048,"bigram_dim":128}

Quantization

int6 QAT

bits: 6

scope: all

Optimizer

Muon

weight_decay: 0.04

momentum: null

other_params: {"matrix_lr":0.025,"scalar_lr":0.025,"tied_embed_lr":0.035}

Weight Averaging

EMA

parameters: {"decay":0.997}

Compression

zstd

level: null

Evaluation

sliding window eval

parameters: {"exact":true}

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

Regularization

layerwise LN scale

parameters: {"ln_scale":1}

Other

other

Late quantization-aware training applied after initial training.

parameters: {"late_qat":1,"qat_threshold":0.1}

Novel Contributions

Looped transformer with a shared recurrent core
Partial RoPE with LN scaling
Late QAT for int6 artifact fitting
XSA attention over the last 4 tokens
Bigram token-side features
Demonstrates a non-record recurrent-depth design point under the 10-minute and 16MB constraints