PR #1252
openRecord(?): WARP (Word-Aware Representation Priors) — val_bpb 1.0713 | 1xH100 10min | 13.65 MB
by ahmetdenizyilmazView on GitHub
val_bpb
1.0713
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
13.65 MB
Training Techniques
Architecture
LeakyReLU
LeakyReLU(0.5) squared MLP activation in the base transformer stack.
parameters: {"squared":true,"negative_slope":0.5}
BigramHash
Bigram hash embedding/bucket mechanism retained in the architecture.
parameters: {"buckets":2816}
XSA
XSA attention/sequence architecture used across all layers.
parameters: {"layers":11}
SmearGate
SmearGate enabled in the model.
parameters: null
Partial RoPE
Partial rotary position embedding used in the base architecture.
parameters: null
WARP-Len
Word length embedding injected at layer 0 based on BPE word length.
parameters: {"params":6657}
WARP-Pos
Word position bias applied to Q and K before RoPE using within-word position embeddings.
parameters: {"params":1035}
WARP-Type
Word type logit bias module using a classifier and learned type-vocabulary bias matrix.
parameters: {"params":176128,"types":64}
Optimizer
Parallel Muon
weight_decay: null
momentum: null
other_params: {"adam_split":true}
Weight Averaging
EMA
parameters: {"decay":0,"disabled":true}
Quantization
GPTQ
bits: 6
scope: model
Compression
lzma
level: null
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
score-first TTT
parameters: {"epochs":2,"freeze_blocks":2,"learning_rate":0.002}
SLOT
parameters: {"learning_rate":0.005,"steps":8,"context_only":true}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
LR Schedule
warmdown
parameters: {"warmup_steps":20,"warmdown_iters":250}
Regularization
logit softcap
parameters: null
Novel Contributions
- WARP-Len: word length embeddings injected at the input layer
- WARP-Pos: word position bias applied to queries and keys before RoPE
- WARP-Type: word type logit bias module at the output layer
- compute_word_boundary_maps() derived word boundaries from token IDs using SentencePiece leading-space conventions
- Disabling EMA for short runs improved validation performance
- Context-only SLOT variant that excludes new tokens from the loss