PR #209

open

Add non-record 11L int6 challenger 8xH100 attempt

by JWLBOYCEView on GitHub
val_bpb
1.1624
Architecture
Transformer
Optimizer
Muon
Artifact Size
16MB

Training Techniques

Quantization
int6
bits: 6
scope: weight bits for model weights; embeddings kept at 16 bits
Architecture
tied embeddings
Uses tied embedding weights and keeps selected tensors in float for stability/size tradeoffs.
parameters: {"layers":11,"vocab":1024,"dim":512,"heads":8,"kv":4,"mlp_hidden":1536}
Optimizer
Muon
weight_decay: 0.038
momentum: null
other_params: {"matrix_lr":0.025,"scalar_lr":0.025,"tied_embed_lr":0.03}
Compression
zstd
level: null
Evaluation
stride-based eval
parameters: {"stride":64,"eval_seq_len":2048}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
Other
other
Non-record submission capturing the exact code snapshot and remote train log from the strongest 8xH100 run, which was terminated during export before roundtrip scoring.
parameters: {"wallclock_cap_seconds":600,"batch_tokens":786432,"keep_float_tensors":["tok_emb.weight","blocks.9.attn.c_k.weight","blocks.10.attn.c_k.weight"],"context_features_enabled":{"bigram":0,"smeargate":0,"swa":0}}

Novel Contributions

  • Non-record 11-layer int6 challenger attempt for the 16MB track
  • Exact code snapshot and copied remote train.log from the strongest 8xH100 run
  • Reported strongest measured pre-roundtrip validation result of 1.1624 bpb
  • Kept selected tensors in float while quantizing the rest to int6
  • Used a Muon optimizer configuration with separate matrix, scalar, and tied-embedding learning rates