PR #1938

open

S0/PR1851 + Cap Tokenizer + LQER + Global TTT (val_bpb = 1.0713)

by lijuncheng16View on GitHub

val_bpb

1.0713

Architecture

Transformer

Optimizer

Muon

Artifact Size

16.09 MB

Training Techniques

Architecture

weight tying

Tied input and output embeddings.

parameters: null

Partial RoPE

Applied rotary position embeddings to only part of the head dimensions.

parameters: {"dimensions":"16/64"}

depth recurrence

Repeated selected layers during the forward pass to increase effective depth.

parameters: {"layers":[3,4,5],"loops":2}

U-Net skip connections

Added skip connections from later layers in a U-Net-like pattern.

parameters: {"start_layer":8}

LeakyReLU

Used LeakyReLU squared activation in the MLP.

parameters: {"slope":0.5}

SmearGate

Mixed neighboring token representations with a sliding window gate.

parameters: {"gate_window":12}

Gated Attention

Used sparse attention gates to selectively prune attention patterns.

parameters: null

KV head count

Used grouped-query style attention with fewer KV heads than attention heads.

parameters: {"heads":8,"kv_heads":4}

Regularization

logit softcap

parameters: {"value":30}

Optimizer

Muon

weight_decay: 0.095

momentum: 0.97

other_params: {"backend_steps":5,"matrix_lr":0.026,"embed_lr":0.6,"scalar_lr":0.02,"gradient_clip":0.3}

AdamW

weight_decay: 0.085

momentum: null

other_params: {"scope":"embeddings"}

AdamW

weight_decay: null

momentum: null

other_params: {"scope":"scalars"}

Quantization

GPTQ

bits: 6

scope: attention and MLP matrices

GPTQ

bits: 8

scope: tied embeddings

LQER

bits: 4

scope: top-3 highest-error layers

Compression

Brotli

level: null

Test-Time Training

LoRA TTT

parameters: {"rank":96,"learning_rate":0.001,"phases":1,"prefix_docs":2000}

LR Schedule

cosine decay

parameters: {"peak_lr":0.001}

warmdown

parameters: {"warmdown_fraction":0.75}

Weight Averaging

EMA

parameters: {"decay":0.9965}

Sequence Length

sequence_length

train_length: null

eval_length: 48000

Novel Contributions

Cap tokenizer with lowercasing plus a special capital token
LQER low-rank quantization error correction on top-error layers
Global phased test-time training with LoRA adaptation
GPTQ int6/int8 mixed quantization with Brotli compression
Depth recurrence, U-Net skip connections, and SmearGate-style architecture tweaks