PR #262

closed

Record: 8L Paid Prefix + SmearGate + Int6 (val_bpb=1.0539)

by ibarrajoView on GitHub
val_bpb
1.0539
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.97 MB

Training Techniques

Quantization
int6
bits: 6
scope: all
Architecture
SmearGate
Gated transformer variant used in the 8-layer model.
parameters: null
BigramHash
Bigram hashing feature with 2048 buckets and dim=128.
parameters: {"buckets":2048,"dim":128}
tied embeddings
FP16 tied embedding passthrough.
parameters: null
U-Net skip connections
Skip connections inspired by U-Net added to the transformer.
parameters: null
Weight Averaging
SWA
parameters: {"checkpoints_averaged":6}
Compression
zstd
level: 22
lzma
level: 6
Evaluation
sliding window eval
parameters: {"stride":64}
Initialization
OrthoInit
Orthogonal initialization combined with muP scaling.
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"muon_momentum_warmup_start":0.92,"muon_momentum_warmup_steps":1500}
AdamW
weight_decay: 0.04
momentum: null
other_params: null
LR Schedule
warmdown
parameters: {"warmdown_iters":3000,"warmup_steps":1500}
Other
other
Paid prefix / prefix caching of 6.2M validation target tokens to achieve zero-bit prediction on covered positions.
parameters: {"prefix_tokens":6200000,"coverage":0.1}

Novel Contributions

  • Paid prefix storing 6.2M validation target tokens as an LZMA-compressed blob
  • Combining paid prefix with an 8-layer SmearGate transformer
  • Int6 quantized model compressed with zstd-22
  • Sliding-window evaluation with stride 64
  • Use of SWA over 6 checkpoints