PR #1938
openS0/PR1851 + Cap Tokenizer + LQER + Global TTT (val_bpb = 1.0713)
by lijuncheng16View on GitHub
val_bpb
1.0713
Architecture
Transformer
Optimizer
Muon
Artifact Size
16.09 MB
Training Techniques
Architecture
weight tying
Tied input and output embeddings.
parameters: null
Partial RoPE
Applied rotary position embeddings to only part of the head dimensions.
parameters: {"dimensions":"16/64"}
depth recurrence
Repeated selected layers during the forward pass to increase effective depth.
parameters: {"layers":[3,4,5],"loops":2}
U-Net skip connections
Added skip connections from later layers in a U-Net-like pattern.
parameters: {"start_layer":8}
LeakyReLU
Used LeakyReLU squared activation in the MLP.
parameters: {"slope":0.5}
SmearGate
Mixed neighboring token representations with a sliding window gate.
parameters: {"gate_window":12}
Gated Attention
Used sparse attention gates to selectively prune attention patterns.
parameters: null
KV head count
Used grouped-query style attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
Regularization
logit softcap
parameters: {"value":30}
Optimizer
Muon
weight_decay: 0.095
momentum: 0.97
other_params: {"backend_steps":5,"matrix_lr":0.026,"embed_lr":0.6,"scalar_lr":0.02,"gradient_clip":0.3}
AdamW
weight_decay: 0.085
momentum: null
other_params: {"scope":"embeddings"}
AdamW
weight_decay: null
momentum: null
other_params: {"scope":"scalars"}
Quantization
GPTQ
bits: 6
scope: attention and MLP matrices
GPTQ
bits: 8
scope: tied embeddings
LQER
bits: 4
scope: top-3 highest-error layers
Compression
Brotli
level: null
Test-Time Training
LoRA TTT
parameters: {"rank":96,"learning_rate":0.001,"phases":1,"prefix_docs":2000}
LR Schedule
cosine decay
parameters: {"peak_lr":0.001}
warmdown
parameters: {"warmdown_fraction":0.75}
Weight Averaging
EMA
parameters: {"decay":0.9965}
Sequence Length
sequence_length
train_length: null
eval_length: 48000
Novel Contributions
- Cap tokenizer with lowercasing plus a special capital token
- LQER low-rank quantization error correction on top-error layers
- Global phased test-time training with LoRA adaptation
- GPTQ int6/int8 mixed quantization with Brotli compression
- Depth recurrence, U-Net skip connections, and SmearGate-style architecture tweaks