PR #290

open

Record: 11L + Partial XSA + TTT + BatchOpt (val_bpb=1.1354)

by ibarrajoView on GitHub

val_bpb

1.1354

Architecture

11L Transformer

Optimizer

Muon + AdamW

Artifact Size

15.85 MB

Training Techniques

Quantization

int6

bits: 6

scope: all

Compression

zstd

level: 22

Architecture

XSA

Partial exclusive self-attention applied only to the last 3 layers to debias self-attention efficiently in a GQA-aware way.

parameters: {"layers":3}

RoPE

Extended positional encoding using a larger RoPE base.

parameters: {"base":50000}

SmearGate

Custom gating mechanism used in the base architecture.

parameters: null

BigramHash

Bigram hashing with 2048 buckets used in the base architecture.

parameters: {"buckets":2048}

Test-Time Training

full TTT

parameters: {"epochs":3,"learning_rate":0.002,"freeze_blocks":2}

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"momentum_warmup_start":0.92,"momentum_warmup_steps":1500}

AdamW

weight_decay: 0.04

momentum: null

other_params: {"learning_rate":0.025}

Weight Averaging

SWA

parameters: {"checkpoints_averaged":7}

Evaluation

sliding window eval

parameters: {"stride":64}

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

LR Schedule

warmdown

parameters: {"warmdown_iters":3000,"warmup_steps":1500}

Initialization

OrthoInit

Orthogonal initialization used in the base architecture.

Novel Contributions

Partial XSA applied to the last 3 layers
Test-time training with 3-epoch full-model SGD and early block freezing
Batch size optimization to 524K tokens for more gradient updates
RoPE base increased to 50K
Sliding-window evaluation with stride 64
Int6 quantization with zstd-22 compression under the 16MB limit