PR #421
openNon-record: 11L mixed int5/int6 + working QAT + TTT (val_bpb=1.1466)
by vytautas-buneviciusView on GitHub
val_bpb
1.1466
Architecture
Transformer
Optimizer
Muon
Artifact Size
14.7MB
Training Techniques
Quantization
mixed int5/int6 QAT
bits: null
scope: MLP int5, attention int6, embeddings int8
Architecture
BigramHash
Increased bigram hash size for token/context representation.
parameters: {"size":10240}
memory tokens
Added learnable global context tokens prepended during evaluation and masked during training.
parameters: {"tokens":64}
backout connection
Learned scalar connection subtracting encoder/decoder boundary state from final output.
parameters: {"parameters":1}
per-head temperature
Learned temperature parameter per attention head.
parameters: {"parameters":88}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"lr":0.025}
Weight Averaging
EMA
parameters: {"decay":0.997}
Evaluation
sliding window eval
parameters: {"stride":32}
Test-Time Training
full TTT
parameters: {"epochs":3,"optimizer":"SGD","time":"83s"}
Initialization
ortho+muP init
Orthogonal plus muP initialization.
Regularization
layerwise LN scale
parameters: null
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
Novel Contributions
- Working QAT fix by swapping per-instance forward methods to avoid torch.compile constant folding
- Mixed int5 MLP / int6 attention quantization with 3% magnitude pruning
- Test-time training with post-quantization SGD on validation tokens
- Expanded BigramHash from 2048 to 10240
- Added 64 learnable memory tokens
- Added a learned backout connection
- Added per-head temperature parameters
- Reduced evaluation stride to 32