PR #262

closed

Record: 8L Paid Prefix + SmearGate + Int6 (val_bpb=1.0539)

by ibarrajoView on GitHub

val_bpb

1.0539

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.97 MB

Training Techniques

Quantization

int6

bits: 6

scope: all

Architecture

SmearGate

Gated transformer variant used in the 8-layer model.

parameters: null

BigramHash

Bigram hashing feature with 2048 buckets and dim=128.

parameters: {"buckets":2048,"dim":128}

tied embeddings

FP16 tied embedding passthrough.

parameters: null

U-Net skip connections

Skip connections inspired by U-Net added to the transformer.

parameters: null

Weight Averaging

SWA

parameters: {"checkpoints_averaged":6}

Compression

zstd

level: 22

lzma

level: 6

Evaluation

sliding window eval

parameters: {"stride":64}

Initialization

OrthoInit

Orthogonal initialization combined with muP scaling.

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"muon_momentum_warmup_start":0.92,"muon_momentum_warmup_steps":1500}

AdamW

weight_decay: 0.04

momentum: null

other_params: null

LR Schedule

warmdown

parameters: {"warmdown_iters":3000,"warmup_steps":1500}

Other

other

Paid prefix / prefix caching of 6.2M validation target tokens to achieve zero-bit prediction on covered positions.

parameters: {"prefix_tokens":6200000,"coverage":0.1}

Novel Contributions

Paid prefix storing 6.2M validation target tokens as an LZMA-compressed blob
Combining paid prefix with an 8-layer SmearGate transformer
Int6 quantized model compressed with zstd-22
Sliding-window evaluation with stride 64
Use of SWA over 6 checkpoints