PR #532
closedRecord: pcloadloveletter v6 — Novel Codebook+Huffman Compression + AdamW TTT (val_bpb=1.0487)
by NotADevIAmaMeatPopsicleView on GitHub
val_bpb
1.0487
Architecture
Transformer
Optimizer
AdamW
Artifact Size
14.12 MB
Training Techniques
Quantization
mixed int8/fp16 with custom codebook quantization
bits: 8
scope: all weights except tied embeddings; per-tensor codebook levels for MLP/QKV/proj
Architecture
tied embeddings
Input and output embeddings are tied; token embedding kept in fp16 in the submission README.
parameters: null
KV head count
Grouped-query attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
Partial RoPE
Rotary positional embeddings applied to only part of the head dimensions.
parameters: {"dimensions":"16/64"}
XSA
XSA applied on the last 4 layers.
parameters: {"layers":4}
weight tying
Tied embeddings between token embedding and output projection.
parameters: null
Optimizer
AdamW
weight_decay: 0.04
momentum: null
other_params: {"matrix_lr":0.03,"ema_decay":0.997,"hybrid_with":"NorMuon"}
Weight Averaging
EMA
parameters: {"decay":0.997}
Compression
custom
level: null
Evaluation
sliding window eval
parameters: {"stride":32}
Test-Time Training
AdamW TTT
parameters: {"epochs":10,"learning_rate":0.001,"grad_clip":1,"all_params_unfrozen":true}
Initialization
OrthoInit
Sequence Length
sequence_length
train_length: 2048
eval_length: null
LR Schedule
cosine decay
parameters: {"ttt_epochs":10}
Regularization
weight decay
parameters: {"weight_decay":0.04}
layerwise LN scale
parameters: {"scale":"1/sqrt(layer+1)"}
Other
other
Codebook quantization with per-tensor k-means codebooks and Huffman entropy coding of indices, followed by zstd-22 final compression.
parameters: {"codebook_sizes":{"mlp":48,"qkv":80,"proj":64}}
Novel Contributions
- Per-tensor k-means codebook quantization tuned across multiple experiments
- Huffman entropy coding of codebook indices to exploit non-uniform distributions
- Custom PCLL binary format with final zstd-22 compression
- AdamW test-time training with per-layer learning-rate groups
- Combining codebook compression with Huffman coding to make the artifact fit under the 16 MB cap