PR #1127

open

Record: 11L XSA4 + EMA + LoRA TTT + Partial RoPE + dim480 — val_bpb 1.13112 (3-seed)

by dentity007View on GitHub

val_bpb

1.1311

Architecture

Transformer

Optimizer

—

Artifact Size

~15.5 MB

Training Techniques

Architecture

XSA

Applied XSA to the deepest 4 layers.

parameters: {"layers":4}

Partial RoPE

Used rotary positional embeddings on only part of the dimensions.

parameters: {"dimensions":"16/64"}

BigramHash

Added bigram hash embeddings with SmearGate.

parameters: {"buckets":8192,"dim":128}

SmearGate

Enabled SmearGate alongside BigramHash.

parameters: null

MLP3x

Used a widened MLP hidden size at 3x model dimension.

parameters: {"hidden_size":1536}

KV head count

Set the number of KV heads.

parameters: {"heads":4}

U-Net skip connections

Used U-Net style skip connections in the 11-layer architecture.

parameters: {"layers":11}

Weight Averaging

EMA

parameters: {"decay":0.9985}

Quantization

late QAT

bits: 6

scope: model weights

Test-Time Training

LoRA TTT

parameters: {"rank":8,"learning_rate":0.01,"epochs":1}

Compression

zstd

level: 22

Sequence Length

sequence_length

train_length: 2048

eval_length: 1024

LR Schedule

cosine decay

parameters: null

Novel Contributions

11-layer compressed Transformer architecture fit under the 16MB limit with MODEL_DIM=480
EMA with decay 0.9985
Partial RoPE using 16/64 dimensions
Late int6 QAT with STE threshold 0.15
Single-pass LoRA test-time training
XSA on the deepest 4 layers
BigramHash with SmearGate
int6 plus zstd-22 artifact compression