PR #1311

open

Non-record: 11L LeakyReLU² + EMA + LZMA Int6 (val_bpb: 1.1303, 2-seed mean)

by htrung1105View on GitHub

val_bpb

1.1303

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.87 MB

Training Techniques

Architecture

LeakyReLU

LeakyReLU(0.5)^2 MLP activation in the Transformer stack

parameters: {"slope":0.5,"squared":true}

U-Net skip connections

U-Net style skip connections across encoder/decoder layers

parameters: {"encoder_layers":5,"decoder_layers":6}

XSA

Exclusive Self-Attention applied to the last layers

parameters: {"layers":4}

Partial RoPE

Partial rotary position embeddings with NTK-aware scaling

parameters: {"dimensions":16,"total_dimensions":64}

BigramHash

Bigram hash embedding with reduced bucket count

parameters: {"buckets":2048,"dimensions":128}

TrigramHash

Trigram hash disabled

parameters: {"enabled":false}

Gated Attention

Gated attention disabled

parameters: {"enabled":false}

Value Residual

Value residual disabled

parameters: {"enabled":false}

weight tying

Tied input and output embeddings

parameters: null

SmearGate

SmearGate component used in the model

parameters: null

KV head count

Grouped query attention with fewer KV heads than attention heads

parameters: {"heads":8,"kv_heads":4}

Weight Averaging

EMA

parameters: {"decay":0.997,"start_frac":0.4}

SWA

parameters: {"every_steps":50}

Quantization

GPTQ-lite

bits: 6

scope: all large weight matrices

late QAT

bits: 6

scope: model

Compression

lzma

level: 9

Evaluation

sliding window eval

parameters: {"stride":16}

temperature scaling

parameters: {"temperature":0.9}

Optimizer

Muon

weight_decay: 0.04

momentum: 0.92

other_params: {"momentum_final":0.99,"momentum_warmup_steps":1500,"lr":0.025}

AdamW

weight_decay: 0.04

momentum: null

other_params: {"lr":0.035}

Regularization

LN scale

parameters: {"formula":"1/sqrt(layer_idx+1)"}

logit softcap

parameters: {"value":30}

Initialization

OrthoInit

Orthogonal initialization used for model initialization

LR Schedule

warmdown

parameters: {"warmdown_steps":3500}

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

Novel Contributions

LeakyReLU(0.5)^2 MLP activation
LZMA extreme compression replacing zlib/zstd
Temperature-scaled evaluation at T=0.90
GPTQ-lite per-row clip search for int6 quantization
Reduced BigramHash size to 2048 buckets
Combination of EMA, SWA, and late QAT in an 11-layer Transformer