PR #197
openNon-record: staging profile (LAWA + slide eval) on 8xH100 (val_bpb=1.18926428)
by machdragonView on GitHub
val_bpb
1.1893
Architecture
GPT
Optimizer
Muon
Artifact Size
15,292,665 bytes
Training Techniques
Architecture
weight tying
Merged-baseline defaults include tied embeddings / tied weights as part of the staging profile.
parameters: null
Optimizer
Muon
weight_decay: 0.02
momentum: null
other_params: {"adam_weight_decay":0.01}
Regularization
weight decay
parameters: {"muon_weight_decay":0.02,"adam_weight_decay":0.01}
Evaluation
sliding window eval
parameters: {"stride":512}
Test-Time Training
LoRA TTT
parameters: null
Sequence Length
sequence_length
train_length: 1024
eval_length: null
LR Schedule
warmdown
parameters: {"warmdown_iters":2500}
Other
other
Staging profile that injects merged-baseline defaults and enables LAWA for production-scale reproducible validation.
parameters: {"staging_profile":1,"lawa_enabled":1}
Novel Contributions
- STAGING_PROFILE=1 merged-baseline recipe
- LAWA enabled
- Sliding-window evaluation with EVAL_STRIDE=512
- 8xH100 production-scale reproducible validation run
- Reported TTT LoRA evaluation alongside standard validation