PR #394
openNon-record: 11L PR315 Backout + Native FA3 RunPod (val_bpb=1.1247)
by greqoneView on GitHub
val_bpb
1.1247
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,545,662 bytes
Training Techniques
Quantization
int6
bits: 6
scope: model artifact
Architecture
XSA
Uses an 11-layer PR315-style transformer line with XSA-related settings.
parameters: {"layers":11,"xsa_last_n":4}
RoPE
Applies RoPE with reduced dimensions.
parameters: {"dimensions":16}
tied embeddings
Uses tied embeddings / tied embedding learning rate.
parameters: null
BigramHash
Includes a bigram vocabulary component.
parameters: {"vocab_size":2048}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"matrix_lr":0.025,"scalar_lr":0.025,"tied_embed_lr":0.035,"momentum_warmup_start":0.92,"momentum_warmup_steps":1500}
Weight Averaging
EMA
parameters: {"decay":0.997}
Evaluation
sliding window eval
parameters: {"stride":64}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
LR Schedule
warmdown
parameters: {"warmdown_iters":3000}
Regularization
weight decay
parameters: {"adam_wd":0.04,"muon_wd":0.04}
Other
other
Native Hopper FlashAttention and torch.compile were used for training efficiency.
parameters: {"flash_attn_backend":"native","torch_compile":true}
other
Backout residual subtraction from the mid-network hidden state.
parameters: {"backout_enabled":true,"backout_lambda_init":0.2,"backout_layer":-1}
Novel Contributions
- Non-record 10-minute-track submission packaged under track_non_record_16mb
- Faithful RunPod 8xH100 SXM PR315-style run with native Hopper FlashAttention
- Backout residual subtraction added as a cheap orthogonal improvement
- Self-contained train_gpt.py with inlined flash_attn_interface helper
- Exact training log and submission artifacts packaged within the 16MB cap