PR #418

open

Non-record: PrismLM v3 — DiffTransformer V2 + NorMuon + TrigramHash (val_bpb=1.1715)

by yashvermsView on GitHub

val_bpb

1.1715

Architecture

Transformer

Optimizer

NorMuon

Artifact Size

15.59MB

Training Techniques

Architecture

DiffTransformer V2

Differential attention in the last 2 layers using two softmax maps and subtraction to cancel noise.

parameters: {"layers":2}

TrigramHash

Adds a trigram hash table to capture three-token patterns alongside BigramHash.

parameters: {"buckets":2048,"dimensions":64}

BigramHash

Bigram n-gram memory component used with context-aware gating.

parameters: {"buckets":2048,"dimensions":128}

Partial RoPE

Applies rotary position embeddings to only part of the head dimensions.

parameters: {"dimensions":16,"total_dimensions":64}

XSA

Uses XSA attention in the last 6 layers.

parameters: {"layers":6}

SmearGate

Includes SmearGate in the architecture.

parameters: null

tied embeddings

Input and output embeddings are tied.

parameters: null

KV head count

Grouped-query attention with fewer KV heads than attention heads.

parameters: {"heads":8,"kv_heads":4}

MLP3x

Expanded MLP with 3x hidden size and ReLU² activation.

parameters: {"expansion":3}

U-Net skips

Uses U-Net style skip connections.

parameters: null

Optimizer

NorMuon

weight_decay: 0.02

momentum: 0.95

other_params: {"beta2":0.95,"lr":0.04}

Regularization

layerwise LN scale

parameters: {"scale":"1/sqrt(layer+1)"}

weight decay

parameters: {"matrices":0.02,"embeddings_scalars":0.01}

Weight Averaging

SWA

parameters: {"every_steps":200}

Quantization

int6

bits: 6

scope: MLP and attention weight matrices

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: null

LR Schedule

warmdown

parameters: {"warmdown_iters":1200,"warmup_steps":20}

Other

other

Late QAT enabled when learning-rate scale drops below 0.1.

parameters: {"threshold":0.1}

Novel Contributions

DiffTransformer V2 attention in the last 2 layers
NorMuon optimizer with per-neuron row normalization after Newton-Schulz orthogonalization
TrigramHash with context-aware n-gram gating
First submission using differential attention in the competition
First submission using NorMuon optimizer
First submission with context-aware n-gram gating