PR #394

open

Non-record: 11L PR315 Backout + Native FA3 RunPod (val_bpb=1.1247)

by greqoneView on GitHub

val_bpb

1.1247

Architecture

Transformer

Optimizer

Muon

Artifact Size

15,545,662 bytes

Training Techniques

Quantization

int6

bits: 6

scope: model artifact

Architecture

XSA

Uses an 11-layer PR315-style transformer line with XSA-related settings.

parameters: {"layers":11,"xsa_last_n":4}

RoPE

Applies RoPE with reduced dimensions.

parameters: {"dimensions":16}

tied embeddings

Uses tied embeddings / tied embedding learning rate.

parameters: null

BigramHash

Includes a bigram vocabulary component.

parameters: {"vocab_size":2048}

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"matrix_lr":0.025,"scalar_lr":0.025,"tied_embed_lr":0.035,"momentum_warmup_start":0.92,"momentum_warmup_steps":1500}

Weight Averaging

EMA

parameters: {"decay":0.997}

Evaluation

sliding window eval

parameters: {"stride":64}

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

LR Schedule

warmdown

parameters: {"warmdown_iters":3000}

Regularization

weight decay

parameters: {"adam_wd":0.04,"muon_wd":0.04}

Other

other

Native Hopper FlashAttention and torch.compile were used for training efficiency.

parameters: {"flash_attn_backend":"native","torch_compile":true}

other

Backout residual subtraction from the mid-network hidden state.

parameters: {"backout_enabled":true,"backout_lambda_init":0.2,"backout_layer":-1}

Novel Contributions

Non-record 10-minute-track submission packaged under track_non_record_16mb
Faithful RunPod 8xH100 SXM PR315-style run with native Hopper FlashAttention
Backout residual subtraction added as a cheap orthogonal improvement
Self-contained train_gpt.py with inlined flash_attn_interface helper
Exact training log and submission artifacts packaged within the 16MB cap