PR #275

disqualified

Non-record: Paid Prefix Research (val_bpb=1.0539, ruled out-of-scope)

by ibarrajoView on GitHub

val_bpb

1.0539

Architecture

Transformer

Optimizer

Muon + AdamW

Artifact Size

15.97MB

Training Techniques

Quantization

int6

bits: 6

scope: model weights

Architecture

SmearGate

Transformer variant using SmearGate blocks as part of the model recipe.

parameters: {"layers":8}

BigramHash

Uses a bigram hash mechanism with a 2048-bucket vocabulary component.

parameters: {"buckets":2048}

tied embeddings

Input and output embeddings are tied.

parameters: null

Initialization

OrthoInit

Orthogonal initialization with muP scaling.

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"adamw_weight_decay":0.04,"matrix_lr":0.025,"scalar_lr":0.025,"tied_embed_lr":0.035}

Weight Averaging

SWA

parameters: {"checkpoints":6}

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64}

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

LR Schedule

warmdown

parameters: {"warmdown_steps":3000}

Regularization

weight decay

parameters: {"weight_decay":0.04}

Other

other

Paid prefix / direct token storage: stores LZMA-compressed validation target tokens as part of the artifact to improve BPB on uncovered positions.

parameters: {"coverage":0.1,"prefix_size_mb":4.24}

Novel Contributions

Paid prefix / direct token storage as a hybrid compression strategy
Empirical comparison of model capacity versus prefix coverage under the 16MB budget
Compression research comparing raw uint16, LZMA, pack10, and bigram-rank prefix encodings
Observation that prefix coverage can matter more than model quality in this regime
Proposal of bigram-rank + varint + LZMA encoding for higher prefix coverage