PR #197

open

Non-record: staging profile (LAWA + slide eval) on 8xH100 (val_bpb=1.18926428)

val_bpb

1.1893

Architecture

GPT

Optimizer

Muon

Artifact Size

15,292,665 bytes

Training Techniques

Architecture

weight tying

Merged-baseline defaults include tied embeddings / tied weights as part of the staging profile.

parameters: null

Optimizer

Muon

weight_decay: 0.02

momentum: null

other_params: {"adam_weight_decay":0.01}

Regularization

weight decay

parameters: {"muon_weight_decay":0.02,"adam_weight_decay":0.01}

Evaluation

sliding window eval

parameters: {"stride":512}

Test-Time Training

LoRA TTT

parameters: null

Sequence Length

sequence_length

train_length: 1024

eval_length: null

LR Schedule

warmdown

parameters: {"warmdown_iters":2500}

Other

other

Staging profile that injects merged-baseline defaults and enables LAWA for production-scale reproducible validation.

parameters: {"staging_profile":1,"lawa_enabled":1}