PR #275

disqualified

Non-record: Paid Prefix Research (val_bpb=1.0539, ruled out-of-scope)

by ibarrajoView on GitHub
val_bpb
1.0539
Architecture
Transformer
Optimizer
Muon + AdamW
Artifact Size
15.97MB

Training Techniques

Quantization
int6
bits: 6
scope: model weights
Architecture
SmearGate
Transformer variant using SmearGate blocks as part of the model recipe.
parameters: {"layers":8}
BigramHash
Uses a bigram hash mechanism with a 2048-bucket vocabulary component.
parameters: {"buckets":2048}
tied embeddings
Input and output embeddings are tied.
parameters: null
Initialization
OrthoInit
Orthogonal initialization with muP scaling.
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"adamw_weight_decay":0.04,"matrix_lr":0.025,"scalar_lr":0.025,"tied_embed_lr":0.035}
Weight Averaging
SWA
parameters: {"checkpoints":6}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
LR Schedule
warmdown
parameters: {"warmdown_steps":3000}
Regularization
weight decay
parameters: {"weight_decay":0.04}
Other
other
Paid prefix / direct token storage: stores LZMA-compressed validation target tokens as part of the artifact to improve BPB on uncovered positions.
parameters: {"coverage":0.1,"prefix_size_mb":4.24}

Novel Contributions

  • Paid prefix / direct token storage as a hybrid compression strategy
  • Empirical comparison of model capacity versus prefix coverage under the 16MB budget
  • Compression research comparing raw uint16, LZMA, pack10, and bigram-rank prefix encodings
  • Observation that prefix coverage can matter more than model quality in this regime
  • Proposal of bigram-rank + varint + LZMA encoding for higher prefix coverage