PR #532

closed

Record: pcloadloveletter v6 — Novel Codebook+Huffman Compression + AdamW TTT (val_bpb=1.0487)

by NotADevIAmaMeatPopsicleView on GitHub

val_bpb

1.0487

Architecture

Transformer

Optimizer

AdamW

Artifact Size

14.12 MB

Training Techniques

Quantization

mixed int8/fp16 with custom codebook quantization

bits: 8

scope: all weights except tied embeddings; per-tensor codebook levels for MLP/QKV/proj

Architecture

tied embeddings

Input and output embeddings are tied; token embedding kept in fp16 in the submission README.

parameters: null

KV head count

Grouped-query attention with fewer KV heads than attention heads.

parameters: {"heads":8,"kv_heads":4}

Partial RoPE

Rotary positional embeddings applied to only part of the head dimensions.

parameters: {"dimensions":"16/64"}

XSA

XSA applied on the last 4 layers.

parameters: {"layers":4}

weight tying

Tied embeddings between token embedding and output projection.

parameters: null

Optimizer

AdamW

weight_decay: 0.04

momentum: null

other_params: {"matrix_lr":0.03,"ema_decay":0.997,"hybrid_with":"NorMuon"}

Weight Averaging

EMA

parameters: {"decay":0.997}

Compression

custom

level: null

Evaluation

sliding window eval

parameters: {"stride":32}

Test-Time Training

AdamW TTT

parameters: {"epochs":10,"learning_rate":0.001,"grad_clip":1,"all_params_unfrozen":true}

Initialization

OrthoInit

Sequence Length

sequence_length

train_length: 2048

eval_length: null

LR Schedule

cosine decay

parameters: {"ttt_epochs":10}

Regularization

weight decay

parameters: {"weight_decay":0.04}

layerwise LN scale

parameters: {"scale":"1/sqrt(layer+1)"}

Other

other

Codebook quantization with per-tensor k-means codebooks and Huffman entropy coding of indices, followed by zstd-22 final compression.

parameters: {"codebook_sizes":{"mlp":48,"qkv":80,"proj":64}}

Novel Contributions

Per-tensor k-means codebook quantization tuned across multiple experiments
Huffman entropy coding of codebook indices to exploit non-uniform distributions
Custom PCLL binary format with final zstd-22 compression
AdamW test-time training with per-layer learning-rate groups
Combining codebook compression with Huffman coding to make the artifact fit under the 16 MB cap