PR #695

open

Record: 11L XSA6 + Warmdown3000 + QAT@0.30 (val_bpb=1.1352, 2-seed mean)

by 0xNoramiyaView on GitHub

val_bpb

1.1360

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.88 MB

Training Techniques

Architecture

XSA

Extended efficient partial XSA to the last 6 layers instead of the last 4.

parameters: {"layers":6}

BigramHash

Uses BigramHash for token/context representation.

parameters: {"buckets":2048,"dim":128}

SmearGate

Includes SmearGate in the architecture.

parameters: null

Partial RoPE

Applies partial rotary positional embeddings with NTK-aware scaling.

parameters: {"dimensions":"16/64"}

tied embeddings

Input and output embeddings are tied.

parameters: null

LR Schedule

warmdown

parameters: {"warmdown_steps":3000}

Quantization

STE QAT

bits: 6

scope: MLP and attention weights

int8

bits: 8

scope: embeddings

Weight Averaging

EMA

parameters: {"decay":0.997}

Evaluation

sliding window eval

parameters: {"stride":64}

Initialization

OrthoInit

Orthogonal initialization used with muP-scaled output projections.

Regularization

layerwise LN scale

parameters: {"scale":"1/sqrt(layer_idx+1)"}

Compression

zstd

level: 22

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"warmup":"0.92->0.99 over 1500 steps","lr":0.025}

AdamW

weight_decay: 0.04

momentum: null

other_params: {"lr_embeddings":0.035,"lr_scalars":0.025}

Novel Contributions

Extended XSA from the last 4 layers to the last 6 layers
Shortened warmdown from 3500 to 3000 iterations
Raised late QAT threshold from 0.15 to 0.30
Selected hyperparameters via 37 local ablation experiments on an RTX 4060 Ti
Used STE int6 QAT for MLP and attention weights with int8 embeddings
Submitted the best seed from a 2-seed run with 2-seed mean reporting