PR #1442

closed

Non-record: No-FA3 stack combination — val_bpb 1.1854 (1-seed, 8xH100)

by akaiHuangView on GitHub
val_bpb
1.1854
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
13.51 MB

Training Techniques

Architecture
XSA
Applied XSA attention across all 11 layers.
parameters: {"layers":11}
BigramHash
Added bigram hash embeddings.
parameters: {"buckets":3072,"dimensions":112}
LeakyReLU
Used LeakyReLU-based MLP activation.
parameters: {"mlp_multiplier":3}
Optimizer
Parallel Muon
weight_decay: null
momentum: null
other_params: {"adamw_scalars":true}
Weight Averaging
EMA
parameters: {"decay":0.997,"start_fraction":0.8}
LR Schedule
warmdown
parameters: {"warmdown_steps":2000,"total_steps":3500}
Quantization
mixed Q4/Q5/Q6
bits: null
scope: all weights
Compression
lzma
level: 9
Evaluation
sliding window eval
parameters: {"stride":32,"temperature":0.9}
Sequence Length
sequence_length
train_length: 1024
eval_length: 1024
Regularization
logit softcap
parameters: null

Novel Contributions

  • Demonstrates a legal stack that runs without Flash Attention 3 on the stock RunPod PyTorch container.
  • Uses mixed Q4/Q5/Q6 quantization as a simpler alternative to Full Hessian GPTQ with self-generated calibration.
  • Documents a step-based warmdown trigger bug and its fix.
  • Shows strong validation performance without SLOT, TTT, or validation-data access during eval.
  • Combines XSA-all, BigramHash, Parallel Muon, EMA, and sliding-window eval with temperature scaling.