val_bpb
1.1624
Architecture
Transformer
Optimizer
Muon
Artifact Size
16MB
Training Techniques
Quantization
int6
bits: 6
scope: weight bits for model weights; embeddings kept at 16 bits
Architecture
tied embeddings
Uses tied embedding weights and keeps selected tensors in float for stability/size tradeoffs.
parameters: {"layers":11,"vocab":1024,"dim":512,"heads":8,"kv":4,"mlp_hidden":1536}
Optimizer
Muon
weight_decay: 0.038
momentum: null
other_params: {"matrix_lr":0.025,"scalar_lr":0.025,"tied_embed_lr":0.03}
Compression
zstd
level: null
Evaluation
stride-based eval
parameters: {"stride":64,"eval_seq_len":2048}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
Other
other
Non-record submission capturing the exact code snapshot and remote train log from the strongest 8xH100 run, which was terminated during export before roundtrip scoring.
parameters: {"wallclock_cap_seconds":600,"batch_tokens":786432,"keep_float_tensors":["tok_emb.weight","blocks.9.attn.c_k.weight","blocks.10.attn.c_k.weight"],"context_features_enabled":{"bigram":0,"smeargate":0,"swa":0}}
Novel Contributions
- Non-record 11-layer int6 challenger attempt for the 16MB track
- Exact code snapshot and copied remote train.log from the strongest 8xH100 run
- Reported strongest measured pre-roundtrip validation result of 1.1624 bpb
- Kept selected tensors in float while quantizing the rest to int6
- Used a Muon optimizer configuration with separate matrix, scalar, and tied-embedding learning rates