PR #937
open[Non-Record Submission] CompressedUT CE + EMA Export + Export-Aligned Late QAT (1.4457 BPB)
by mihir-s-05View on GitHub
val_bpb
1.4457
Architecture
Transformer
Optimizer
—
Artifact Size
14,707,311 bytes
Training Techniques
Architecture
BigramHash
Uses hashed bigram features in the byte-level compressed_ut model.
parameters: {"dimensions":96}
Partial RoPE
Uses partial rotary position encoding in the transformer backbone.
parameters: {"dimensions":32}
weight tying
Not explicitly stated in the PR body, but no evidence of it is present.
parameters: null
Weight Averaging
EMA
parameters: {"decay":0.997}
Quantization
late QAT
bits: 6
scope: exported artifact
Compression
zlib
level: 9
Sequence Length
sequence_length
train_length: 1536
eval_length: 32768
Evaluation
full validation eval
parameters: {"scope":"full FineWeb validation split"}
Other
other
Export-aligned quantization-aware training to match the quantizer used at artifact export.
parameters: {"threshold":0.05}
Novel Contributions
- EMA export weights for the shipped artifact
- Export-aligned late QAT to reduce quantization gap
- Stronger int6 clip-search during packing
- Larger compressed-UT capacity within the 16MB budget
- CE-only training for the compressed_ut path