PR #698

open

Add MergedTop3_v3 clean 8xH100 record-track submission

by hesong0222-devView on GitHub
val_bpb
1.1642
Architecture
Transformer
Optimizer
Muon/AdamW
Artifact Size
15,635,201 bytes

Training Techniques

Architecture
XSA
Applied XSA on the last 4 layers.
parameters: {"layers":4}
MLP3x
Used 3x MLP blocks.
parameters: {"multiplier":3}
SmearGate
Included SmearGate in the model.
parameters: null
BigramHash
Used BigramHash feature with 2048 buckets.
parameters: {"buckets":2048}
Partial RoPE
Applied partial RoPE with reduced rotary dimensions.
parameters: {"dimensions":16}
GPTQ-lite
Used GPTQ-lite clip search.
parameters: null
Weight Averaging
EMA
parameters: null
Quantization
mixed int6
bits: 6
scope: all
Compression
zstd
level: null
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
Optimizer
Muon/AdamW
weight_decay: 0.04
momentum: null
other_params: null
Evaluation
sliding window eval
parameters: {"stride":64}
Regularization
layerwise LN scale
parameters: null
LR Schedule
warmdown3500
parameters: {"warmdown_steps":3500}

Novel Contributions

  • Merged top-stack recipe built from public leaderboard lineage
  • 11-layer model with XSA on the last 4 layers
  • EMA-only training
  • 3x MLP blocks
  • SmearGate integration
  • BigramHash with 2048 buckets
  • Mixed int6 quantization with zstd compression
  • Sliding-window evaluation with stride 64
  • Partial RoPE with ROPE_DIMS=16
  • Layerwise LN scaling
  • GPTQ-lite clip search
  • Clean rerun package with strict runtime gates for uninterrupted 8x H100 execution