PR #1117

closed

Add run_17 8xH100 submission (1.118685, <16MB)

by adityakm24View on GitHub
val_bpb
1.1187
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
15,985,833 bytes

Training Techniques

Architecture
GQA
Grouped query attention
parameters: {"num_heads":8,"num_kv_heads":4}
XSA
XSA applied to the last 4 layers
parameters: {"layers":4}
Partial RoPE
Rotary positional embeddings applied partially
parameters: {"rope_dims":16}
SmearGate
SmearGate gating mechanism
parameters: null
BigramHash
Bigram hash features
parameters: {"size":1536}
TrigramHash
Trigram hash features
parameters: {"size":1024}
ValueEmbedding
Value embedding component
parameters: null
Value Residual
Value residual pathway
parameters: null
Optimizer
Parallel Muon
weight_decay: null
momentum: null
other_params: {"adamw":true}
Weight Averaging
EMA
parameters: null
SWA
parameters: null
Quantization
late QAT
bits: null
scope: all
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.0025,"epochs":6,"freeze_blocks":0}
Compression
lzma
level: null
Sequence Length
sequence_length
train_length: null
eval_length: null

Novel Contributions

  • Legal score-first test-time training for improved validation bpb
  • Compact int6+lzma artifact under the 16MB submission cap
  • Parameter-banking Transformer with GQA, XSA, Partial RoPE, SmearGate, BigramHash, TrigramHash, ValueEmbedding, and Value Residual
  • Parallel Muon + AdamW optimization with EMA and SWA
  • Sliding-window evaluation combined with late QAT