PR #1528

open

Non-record: 11L s2048 4h on 1xA100 — 1.1104 BPB

by xiehuanyiView on GitHub

val_bpb

1.1104

Architecture

Transformer

Optimizer

Muon

Artifact Size

16,040,603 bytes

Training Techniques

Architecture

XSA

XSA-all attention variant used in the stack

parameters: {"last_n":11}

BigramHash

Bigram hash embedding/feature component

parameters: {"vocab_size":2048}

Partial RoPE

Partial rotary positional embedding

parameters: {"dimensions":16,"denominator":64}

SmearGate

SmearGate gating mechanism

parameters: null

U-Net skip connections

U-Net style skip connections in the model

parameters: null

LeakyReLU

LeakyReLU squared activation

parameters: {"slope":0.5,"squared":true}

Regularization

LN scale

parameters: null

Optimizer

Muon

weight_decay: 0.04

momentum: null

other_params: {"adamw":true}

Weight Averaging

EMA

parameters: {"decay":0.997,"start_fraction":0.2}

SWA

parameters: null

Quantization

GPTQ

bits: 6

scope: all

late QAT

bits: null

scope: null

Compression

lzma

level: 9

Evaluation

sliding window eval

parameters: {"stride":64}

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

LR Schedule

warmdown

parameters: {"warmdown_steps":3500}

Other

other

Deferred EMA start to avoid random-init contamination on shorter runs

parameters: {"start_fraction":0.2}

other

PyTorch SDP flash-backend fallback used when FA3 is unavailable

parameters: null

Novel Contributions

Longer context training at seq_len=2048
Extended training time to 4 hours on a single A100
A100-compatible fallback from FA3 to PyTorch SDP flash backend
Deferred EMA start for shorter runs
Int6 GPTQ + LZMA compressed submission under 16 MiB
Sliding window evaluation with stride 64