val_bpb
1.1697
Architecture
Transformer
Optimizer
AdamW
Artifact Size
—
Training Techniques
Test-Time Training
LoRA TTT
parameters: null
Sequence Length
sequence_length
train_length: 2048
eval_length: null
Other
other
Updated baseline hyperparameters for the 10min/16mb track, including an 8x768 configuration, 262144-token training batch, lower learning rates, lower logit softcap, and beta1=0.70.
parameters: {"num_layers":8,"model_dim":768,"train_batch_tokens":262144,"logit_softcap":10,"tied_embed_lr":0.03,"matrix_lr":0.02,"scalar_lr":0.02,"beta1":0.7}
Novel Contributions
- Adds the 2026-03-20_better_baseline_params record for the 10min/16mb track
- Keeps the same LoRA TTT evaluation path as 2026-03-17_LoRA_TTT
- Updates baseline defaults to an 8x768 configuration with 2048 sequence length and 262144-token training batch
- Uses lower learning rates, lower logit softcap, and beta1=0.70
- Includes submitted train_gpt.py, run logs, and aggregated submission.json