PR #221

open

Submission: 10L + Sliding Window eval (mean val_bpb=1.1899)

by shajalahamedcseView on GitHub
val_bpb
1.1899
Architecture
Optimizer
Muon
Artifact Size
≤ 16MB

Training Techniques

Sequence Length
sequence_length
train_length: 4096
eval_length: null
Evaluation
sliding window eval
parameters: {"stride":64}
Architecture
num_layers
10-layer model configuration
parameters: {"layers":10}
Optimizer
Muon
weight_decay: null
momentum: 0.95
other_params: {"matrix_lr":0.04}
LR Schedule
warmdown
parameters: {"warmdown_steps":3600}
Initialization
Overtone init
Regularization
weight decay
parameters: null

Novel Contributions

  • Training on 4096-token sequences instead of 1024-token sequences
  • Using sliding window evaluation with stride 64
  • 10-layer configuration combined with long-sequence training
  • Reported consistent mean validation bpb across three random seeds