PR #327

open

Submission TrigramHash + PartialRoPE + HeadTemp + stride32 (val_bpb: 1.1450)and

by AnanddnaView on GitHub
val_bpb
1.1450
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.9 MB

Training Techniques

Architecture
TrigramHash
Adds learned hashed embeddings for consecutive token triplets to capture 3-token patterns as atomic units.
parameters: {"buckets":8192,"dim":64}
Partial RoPE
Applies rotary position embeddings to only part of each attention head dimension, leaving the rest position-free.
parameters: {"fraction":0.5}
Per-head temperature scaling
Learns a separate temperature parameter for each attention head to vary attention sharpness.
parameters: null
BigramHash
Uses hashed embeddings for token pairs as a complementary n-gram feature.
parameters: {"buckets":10240}
SmearGate
Gating component used in the model architecture.
parameters: null
MLP3x
Uses a 3x expanded MLP hidden size.
parameters: {"expansion":3}
KV head count
Uses grouped-query attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
tied embeddings
Shares embedding weights.
parameters: null
U-Net skip connections
Adds skip connections in a U-Net-like pattern across layers.
parameters: null
Evaluation
sliding window eval
parameters: {"stride":32}
Test-Time Training
LoRA TTT
parameters: {"rank":4}
Weight Averaging
SWA
parameters: {"frac":0.4,"every":50}
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: null
Compression
zstd
level: 22

Novel Contributions

  • TrigramHashEmbedding for hashing token triplets into learned embeddings
  • Partial RoPE applied to only 50% of head dimensions
  • Per-head temperature scaling in attention
  • Reduced evaluation stride from 64 to 32
  • LoRA-based test-time training infrastructure