PR #418

open

Non-record: PrismLM v3 — DiffTransformer V2 + NorMuon + TrigramHash (val_bpb=1.1715)

by yashvermsView on GitHub
val_bpb
1.1715
Architecture
Transformer
Optimizer
NorMuon
Artifact Size
15.59MB

Training Techniques

Architecture
DiffTransformer V2
Differential attention in the last 2 layers using two softmax maps and subtraction to cancel noise.
parameters: {"layers":2}
TrigramHash
Adds a trigram hash table to capture three-token patterns alongside BigramHash.
parameters: {"buckets":2048,"dimensions":64}
BigramHash
Bigram n-gram memory component used with context-aware gating.
parameters: {"buckets":2048,"dimensions":128}
Partial RoPE
Applies rotary position embeddings to only part of the head dimensions.
parameters: {"dimensions":16,"total_dimensions":64}
XSA
Uses XSA attention in the last 6 layers.
parameters: {"layers":6}
SmearGate
Includes SmearGate in the architecture.
parameters: null
tied embeddings
Input and output embeddings are tied.
parameters: null
KV head count
Grouped-query attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
MLP3x
Expanded MLP with 3x hidden size and ReLU² activation.
parameters: {"expansion":3}
U-Net skips
Uses U-Net style skip connections.
parameters: null
Optimizer
NorMuon
weight_decay: 0.02
momentum: 0.95
other_params: {"beta2":0.95,"lr":0.04}
Regularization
layerwise LN scale
parameters: {"scale":"1/sqrt(layer+1)"}
weight decay
parameters: {"matrices":0.02,"embeddings_scalars":0.01}
Weight Averaging
SWA
parameters: {"every_steps":200}
Quantization
int6
bits: 6
scope: MLP and attention weight matrices
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: null
LR Schedule
warmdown
parameters: {"warmdown_iters":1200,"warmup_steps":20}
Other
other
Late QAT enabled when learning-rate scale drops below 0.1.
parameters: {"threshold":0.1}

Novel Contributions

  • DiffTransformer V2 attention in the last 2 layers
  • NorMuon optimizer with per-neuron row normalization after Newton-Schulz orthogonalization
  • TrigramHash with context-aware n-gram gating
  • First submission using differential attention in the competition
  • First submission using NorMuon optimizer
  • First submission with context-aware n-gram gating