PR #1070

open

Non-record: Aweb Ultimate — 1.1190 BPB (10min 8×H100, independent PR #549 reproduction)

by manfromnowhere143View on GitHub

val_bpb

1.1190

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

15,948,863 bytes

Training Techniques

Architecture

LeakyReLU

LeakyReLU squared activation

parameters: {"squared":true,"negative_slope":0.5}

XSA

Cross-layer attention applied to the last layers

parameters: {"layers":4}

Partial RoPE

Rotary positional encoding applied to a subset of head dimensions

parameters: {"head_dims":16,"total_head_dims":64}

SmearGate

Input enrichment gate

parameters: null

BigramHash

Bigram hash input feature

parameters: {"size":2048}

ValueEmbedding

Value embedding input enrichment

parameters: {"dimensions":128}

U-Net skip connections

Encoder-decoder skip connections with learned skip weights

parameters: null

Regularization

LN scale

parameters: {"scale":"1/sqrt(layer+1)"}

Weight Averaging

EMA + SWA

parameters: {"ema_decay":0.997}

Optimizer

Parallel Muon

weight_decay: null

momentum: null

other_params: {"phases":3,"overlapped_comms":true}

Quantization

GPTQ-lite

bits: 6

scope: MLP+attn

STE QAT

bits: 6

scope: late QAT

Compression

lzma

level: null

Test-Time Training

score-first TTT

parameters: {"epochs":3,"optimizer":"SGD","learning_rate":0.002,"momentum":0.9}

Evaluation

sliding window eval

parameters: {"stride":64}

Novel Contributions

Independent reproduction of PR #549 SOTA stack
11-layer 512-dimensional Transformer with the full proven stack
LeakyReLU squared activation
XSA on the last 4 layers
Partial RoPE on 16/64 head dimensions
EMA plus SWA weight averaging
Parallel Muon optimizer with overlapped communications
GPTQ-lite mixed int6/int8 quantization with LZMA compression
SmearGate, BigramHash, and ValueEmbedding input enrichment
Legal score-first test-time training
U-Net skip connections with learned skip weights
Late QAT with int6 STE