PR #1442

closed

Non-record: No-FA3 stack combination — val_bpb 1.1854 (1-seed, 8xH100)

by akaiHuangView on GitHub

val_bpb

1.1854

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

13.51 MB

Training Techniques

Architecture

XSA

Applied XSA attention across all 11 layers.

parameters: {"layers":11}

BigramHash

Added bigram hash embeddings.

parameters: {"buckets":3072,"dimensions":112}

LeakyReLU

Used LeakyReLU-based MLP activation.

parameters: {"mlp_multiplier":3}

Optimizer

Parallel Muon

weight_decay: null

momentum: null

other_params: {"adamw_scalars":true}

Weight Averaging

EMA

parameters: {"decay":0.997,"start_fraction":0.8}

LR Schedule

warmdown

parameters: {"warmdown_steps":2000,"total_steps":3500}

Quantization

mixed Q4/Q5/Q6

bits: null

scope: all weights

Compression

lzma

level: 9

Evaluation

sliding window eval

parameters: {"stride":32,"temperature":0.9}

Sequence Length

sequence_length

train_length: 1024

eval_length: 1024

Regularization

logit softcap

parameters: null

Novel Contributions

Demonstrates a legal stack that runs without Flash Attention 3 on the stock RunPod PyTorch container.
Uses mixed Q4/Q5/Q6 quantization as a simpler alternative to Full Hessian GPTQ with self-generated calibration.
Documents a step-based warmdown trigger bug and its fix.
Shows strong validation performance without SLOT, TTT, or validation-data access during eval.
Combines XSA-all, BigramHash, Parallel Muon, EMA, and sliding-window eval with temperature scaling.