PR #1252

open

Record(?): WARP (Word-Aware Representation Priors) — val_bpb 1.0713 | 1xH100 10min | 13.65 MB

by ahmetdenizyilmazView on GitHub

val_bpb

1.0713

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

13.65 MB

Training Techniques

Architecture

LeakyReLU

LeakyReLU(0.5) squared MLP activation in the base transformer stack.

parameters: {"squared":true,"negative_slope":0.5}

BigramHash

Bigram hash embedding/bucket mechanism retained in the architecture.

parameters: {"buckets":2816}

XSA

XSA attention/sequence architecture used across all layers.

parameters: {"layers":11}

SmearGate

SmearGate enabled in the model.

parameters: null

Partial RoPE

Partial rotary position embedding used in the base architecture.

parameters: null

WARP-Len

Word length embedding injected at layer 0 based on BPE word length.

parameters: {"params":6657}

WARP-Pos

Word position bias applied to Q and K before RoPE using within-word position embeddings.

parameters: {"params":1035}

WARP-Type

Word type logit bias module using a classifier and learned type-vocabulary bias matrix.

parameters: {"params":176128,"types":64}

Optimizer

Parallel Muon

weight_decay: null

momentum: null

other_params: {"adam_split":true}

Weight Averaging

EMA

parameters: {"decay":0,"disabled":true}

Quantization

GPTQ

bits: 6

scope: model

Compression

lzma

level: null

Evaluation

sliding window eval

parameters: {"stride":64}

Test-Time Training

score-first TTT

parameters: {"epochs":2,"freeze_blocks":2,"learning_rate":0.002}

SLOT

parameters: {"learning_rate":0.005,"steps":8,"context_only":true}

Sequence Length

sequence_length

train_length: 2048

eval_length: null

LR Schedule

warmdown

parameters: {"warmup_steps":20,"warmdown_iters":250}

Regularization

logit softcap

parameters: null

Novel Contributions

WARP-Len: word length embeddings injected at the input layer
WARP-Pos: word position bias applied to queries and keys before RoPE
WARP-Type: word type logit bias module at the output layer
compute_word_boundary_maps() derived word boundaries from token IDs using SentencePiece leading-space conventions
Disabling EMA for short runs improved validation performance
Context-only SLOT variant that excludes new tokens from the loss