PR #694

open

10L Int5-MLP + BigramHash(4096) + SWA (1.1507 BPB)

by BortlesboatView on GitHub

val_bpb

1.1507

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.62MB

Training Techniques

Quantization

mixed int5/int6 QAT

bits: 5

scope: MLP and attention

Architecture

BigramHash

Adds a bigram hash embedding/cache-like component to the model.

parameters: {"size":4096,"dim":128}

SmearGate

Gating mechanism used in the architecture.

parameters: null

U-Net skips

Skip connections inspired by U-Net are added to the transformer blocks.

parameters: null

KV head count

Uses grouped-query attention with fewer KV heads than attention heads.

parameters: {"layers":10,"heads":8,"kv_heads":4,"d_model":512}

Optimizer

Muon

weight_decay: 0.04

momentum: null

other_params: {"matrix_lr":0.02}

Weight Averaging

SWA

parameters: {"fraction":0.4}

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64}

Test-Time Training

LoRA TTT

parameters: {"rank":8}

Initialization

orthogonal init

Orthogonal weight initialization.

Sequence Length

sequence_length

train_length: 2048

eval_length: null

LR Schedule

warmdown

parameters: {"warmdown_steps":3000}

Regularization

weight decay

parameters: {"weight_decay":0.04}

Other

other

Neural cache used during evaluation to interpolate cached hidden-state predictions with model outputs.

parameters: null

Novel Contributions

Reduced BigramHash size for reliable artifact size margin across seeds
Mixed int5 MLP / int6 attention quantization with post-quantization roundtrip
Stochastic Weight Averaging over the last 40% of warmdown
Neural cache evaluation-time interpolation
Per-document LoRA test-time training
Quantization-aware training with STE fake quantization