PR #351

closed

Memory Tokens + Mixed Quantization (val_bpb: 1.1659)

by sp00mmView on GitHub

val_bpb

1.1659

Architecture

Transformer

Optimizer

Muon

Artifact Size

15,070,662 bytes

Training Techniques

Architecture

Memory Tokens

64 learnable embedding vectors overwrite/prepend the first K positions of each sequence to provide shared global context scratchpad access via causal attention.

parameters: {"tokens":64}

MLP3x

Transformer uses 3x MLP expansion.

parameters: {"multiplier":3}

BigramHash

Hashes consecutive token pairs to inject local context through a BigramHashEmbedding.

parameters: {"vocab_size":10240}

SmearGate

Learned blend with the previous token at the embedding level.

parameters: null

Partial RoPE

Applies rotary position encoding to only part of the head dimensions.

parameters: {"dimensions":16,"total_dimensions":64}

tied embeddings

Input and output embeddings are tied.

parameters: null

KV head count

Uses grouped-query attention with fewer KV heads than attention heads.

parameters: {"attention_heads":8,"kv_heads":4}

Optimizer

Muon

weight_decay: 0.04

momentum: 0.95

other_params: {"matrix_lr":0.04}

AdamW

weight_decay: 0.04

momentum: null

other_params: {"scope":"embed/scalar"}

Weight Averaging

EMA

parameters: {"decay":0.997,"every_steps":10}

Quantization

mixed int5/int6 QAT

bits: null

scope: MLP weights and attention weights

fp16

bits: 16

scope: tied embeddings and small tensors

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":128,"seq_len":1024,"batched_windows":256}

Sequence Length

sequence_length

train_length: 2048

eval_length: 1024

LR Schedule

warmdown

parameters: {"warmdown_steps":3000}

Regularization

weight decay

parameters: {"memory_tokens_exempt":true,"weight_decay":0.04}

Other

other

Late QAT with fake int6 quantization (STE) when lr_scale < 0.1.

parameters: {"lr_scale_threshold":0.1,"quant_bits":6}

Novel Contributions

Memory tokens: 64 learnable embedding vectors used as a global context scratchpad.
A/B tested improvement from memory tokens of -0.014 BPB versus an identical config without them.
Mixed quantization scheme using int5 for MLP weights and int6 for attention weights.
Batched sliding-window evaluation with compiled forward_logits.