PR #327
openSubmission TrigramHash + PartialRoPE + HeadTemp + stride32 (val_bpb: 1.1450)and
by AnanddnaView on GitHub
val_bpb
1.1450
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.9 MB
Training Techniques
Architecture
TrigramHash
Adds learned hashed embeddings for consecutive token triplets to capture 3-token patterns as atomic units.
parameters: {"buckets":8192,"dim":64}
Partial RoPE
Applies rotary position embeddings to only part of each attention head dimension, leaving the rest position-free.
parameters: {"fraction":0.5}
Per-head temperature scaling
Learns a separate temperature parameter for each attention head to vary attention sharpness.
parameters: null
BigramHash
Uses hashed embeddings for token pairs as a complementary n-gram feature.
parameters: {"buckets":10240}
SmearGate
Gating component used in the model architecture.
parameters: null
MLP3x
Uses a 3x expanded MLP hidden size.
parameters: {"expansion":3}
KV head count
Uses grouped-query attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
tied embeddings
Shares embedding weights.
parameters: null
U-Net skip connections
Adds skip connections in a U-Net-like pattern across layers.
parameters: null
Evaluation
sliding window eval
parameters: {"stride":32}
Test-Time Training
LoRA TTT
parameters: {"rank":4}
Weight Averaging
SWA
parameters: {"frac":0.4,"every":50}
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: null
Compression
zstd
level: 22
Novel Contributions
- TrigramHashEmbedding for hashing token triplets into learned embeddings
- Partial RoPE applied to only 50% of head dimensions
- Per-head temperature scaling in attention
- Reduced evaluation stride from 64 to 32
- LoRA-based test-time training infrastructure