PR #937

open

[Non-Record Submission] CompressedUT CE + EMA Export + Export-Aligned Late QAT (1.4457 BPB)

by mihir-s-05View on GitHub
val_bpb
1.4457
Architecture
Transformer
Optimizer
Artifact Size
14,707,311 bytes

Training Techniques

Architecture
BigramHash
Uses hashed bigram features in the byte-level compressed_ut model.
parameters: {"dimensions":96}
Partial RoPE
Uses partial rotary position encoding in the transformer backbone.
parameters: {"dimensions":32}
weight tying
Not explicitly stated in the PR body, but no evidence of it is present.
parameters: null
Weight Averaging
EMA
parameters: {"decay":0.997}
Quantization
late QAT
bits: 6
scope: exported artifact
Compression
zlib
level: 9
Sequence Length
sequence_length
train_length: 1536
eval_length: 32768
Evaluation
full validation eval
parameters: {"scope":"full FineWeb validation split"}
Other
other
Export-aligned quantization-aware training to match the quantizer used at artifact export.
parameters: {"threshold":0.05}

Novel Contributions

  • EMA export weights for the shipped artifact
  • Export-aligned late QAT to reduce quantization gap
  • Stronger int6 clip-search during packing
  • Larger compressed-UT capacity within the 16MB budget
  • CE-only training for the compressed_ut path