PR #1307

open

Add 07c1 strict RunPod base submission

by amrayachView on GitHub
val_bpb
1.1101
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.73 MB

Training Techniques

Architecture
weight tying
Tied embeddings are used in the model.
parameters: null
GQA
Grouped query attention with fewer KV heads than attention heads.
parameters: {"attention_heads":8,"kv_heads":4}
LeakyReLU
Leaky ReLU squared MLP activation is used.
parameters: {"slope":0.5}
BigramHash
Bigram vocabulary component used in the architecture.
parameters: {"vocab_size":5120}
VE128
Value embedding dimension setting used in the model.
parameters: {"dimensions":128}
Gated Attention
Windowed attention layers are used at selected depths.
parameters: {"layers":[2,4,6,8,10],"window_size":512}
Optimizer
Muon
weight_decay: null
momentum: 0.985
other_params: null
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
TTT
parameters: {"enabled":false}
Sequence Length
sequence_length
train_length: 2048
eval_length: 6144
LR Schedule
warmdown
parameters: {"warmdown_iters":4000}
Quantization
int6
bits: 6
scope: per-row export
Compression
Brotli
level: null

Novel Contributions

  • Strict RunPod H100 SXM base proof for the 07c1 line
  • Faithful reproduction of PR #1212 with eval/export hygiene fixes
  • Base-path-only submission with TTT disabled
  • Four-seed strict proof under a 598-second wallclock budget
  • Per-row int6 export with Brotli compression
  • Sliding-window evaluation at stride 64