PR #994
openAdd Kshitij submission (1x H100, val_bpb 1.4315, env-based config)
by singhaikshitijjainView on GitHub
val_bpb
1.4315
Architecture
Transformer
Optimizer
Muon
Artifact Size
10.63 MB
Training Techniques
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"Newton-Schulz orthogonalization":true}
Quantization
int8
bits: 8
scope: all
Compression
zlib
level: null
Evaluation
sliding window eval
parameters: null
Architecture
U-Net skip connections
U-Net style encoder-decoder transformer with skip connections.
parameters: null
weight tying
Tied embeddings.
parameters: null
GQA
Grouped query attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":2}
RoPE
Rotary embeddings with partial RoPE.
parameters: null
Weight Averaging
EMA
parameters: null
LR Schedule
cosine decay
parameters: {"warmup":true,"warmdown":true}
Regularization
weight decay
parameters: {"type":"decoupled","style":"AdamW"}
gradient clipping
parameters: null
Sequence Length
sequence_length
train_length: 256
eval_length: null
Other
other
Environment-variable-based hyperparameter configuration visible in logs.
parameters: null
other
Flash attention for higher throughput and memory efficiency.
parameters: null
other
Distributed token streaming loader.
parameters: null
Novel Contributions
- Muon optimizer with Newton-Schulz orthogonalization
- Int8 post-training quantization with per-row scaling
- Zlib-compressed artifact
- Tokenizer-agnostic BPB evaluation
- U-Net style encoder-decoder transformer with skip connections
- Tied embeddings with custom learning rates
- RMSNorm, rotary embeddings, GQA, and SwiGLU MLP
- Distributed token streaming loader
- EMA weight averaging
- Environment-variable-driven hyperparameter configuration
- Sliding-window validation