PR #1127
openRecord: 11L XSA4 + EMA + LoRA TTT + Partial RoPE + dim480 — val_bpb 1.13112 (3-seed)
by dentity007View on GitHub
val_bpb
1.1311
Architecture
Transformer
Optimizer
—
Artifact Size
~15.5 MB
Training Techniques
Architecture
XSA
Applied XSA to the deepest 4 layers.
parameters: {"layers":4}
Partial RoPE
Used rotary positional embeddings on only part of the dimensions.
parameters: {"dimensions":"16/64"}
BigramHash
Added bigram hash embeddings with SmearGate.
parameters: {"buckets":8192,"dim":128}
SmearGate
Enabled SmearGate alongside BigramHash.
parameters: null
MLP3x
Used a widened MLP hidden size at 3x model dimension.
parameters: {"hidden_size":1536}
KV head count
Set the number of KV heads.
parameters: {"heads":4}
U-Net skip connections
Used U-Net style skip connections in the 11-layer architecture.
parameters: {"layers":11}
Weight Averaging
EMA
parameters: {"decay":0.9985}
Quantization
late QAT
bits: 6
scope: model weights
Test-Time Training
LoRA TTT
parameters: {"rank":8,"learning_rate":0.01,"epochs":1}
Compression
zstd
level: 22
Sequence Length
sequence_length
train_length: 2048
eval_length: 1024
LR Schedule
cosine decay
parameters: null
Novel Contributions
- 11-layer compressed Transformer architecture fit under the 16MB limit with MODEL_DIM=480
- EMA with decay 0.9985
- Partial RoPE using 16/64 dimensions
- Late int6 QAT with STE threshold 0.15
- Single-pass LoRA test-time training
- XSA on the deepest 4 layers
- BigramHash with SmearGate
- int6 plus zstd-22 artifact compression