PR #994

open

Add Kshitij submission (1x H100, val_bpb 1.4315, env-based config)

by singhaikshitijjainView on GitHub
val_bpb
1.4315
Architecture
Transformer
Optimizer
Muon
Artifact Size
10.63 MB

Training Techniques

Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"Newton-Schulz orthogonalization":true}
Quantization
int8
bits: 8
scope: all
Compression
zlib
level: null
Evaluation
sliding window eval
parameters: null
Architecture
U-Net skip connections
U-Net style encoder-decoder transformer with skip connections.
parameters: null
weight tying
Tied embeddings.
parameters: null
GQA
Grouped query attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":2}
RoPE
Rotary embeddings with partial RoPE.
parameters: null
Weight Averaging
EMA
parameters: null
LR Schedule
cosine decay
parameters: {"warmup":true,"warmdown":true}
Regularization
weight decay
parameters: {"type":"decoupled","style":"AdamW"}
gradient clipping
parameters: null
Sequence Length
sequence_length
train_length: 256
eval_length: null
Other
other
Environment-variable-based hyperparameter configuration visible in logs.
parameters: null
other
Flash attention for higher throughput and memory efficiency.
parameters: null
other
Distributed token streaming loader.
parameters: null

Novel Contributions

  • Muon optimizer with Newton-Schulz orthogonalization
  • Int8 post-training quantization with per-row scaling
  • Zlib-compressed artifact
  • Tokenizer-agnostic BPB evaluation
  • U-Net style encoder-decoder transformer with skip connections
  • Tied embeddings with custom learning rates
  • RMSNorm, rotary embeddings, GQA, and SwiGLU MLP
  • Distributed token streaming loader
  • EMA weight averaging
  • Environment-variable-driven hyperparameter configuration
  • Sliding-window validation