PR #209

open

Add non-record 11L int6 challenger 8xH100 attempt

by JWLBOYCEView on GitHub

val_bpb

1.1624

Architecture

Transformer

Optimizer

Muon

Artifact Size

16MB

Training Techniques

Quantization

int6

bits: 6

scope: weight bits for model weights; embeddings kept at 16 bits

Architecture

tied embeddings

Uses tied embedding weights and keeps selected tensors in float for stability/size tradeoffs.

parameters: {"layers":11,"vocab":1024,"dim":512,"heads":8,"kv":4,"mlp_hidden":1536}

Optimizer

Muon

weight_decay: 0.038

momentum: null

other_params: {"matrix_lr":0.025,"scalar_lr":0.025,"tied_embed_lr":0.03}

Compression

zstd

level: null

Evaluation

stride-based eval

parameters: {"stride":64,"eval_seq_len":2048}

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

Other

other

Non-record submission capturing the exact code snapshot and remote train log from the strongest 8xH100 run, which was terminated during export before roundtrip scoring.

parameters: {"wallclock_cap_seconds":600,"batch_tokens":786432,"keep_float_tensors":["tok_emb.weight","blocks.9.attn.c_k.weight","blocks.10.attn.c_k.weight"],"context_features_enabled":{"bigram":0,"smeargate":0,"swa":0}}

Novel Contributions

Non-record 11-layer int6 challenger attempt for the 16MB track
Exact code snapshot and copied remote train.log from the strongest 8xH100 run
Reported strongest measured pre-roundtrip validation result of 1.1624 bpb
Kept selected tensors in float while quantizing the rest to int6
Used a Muon optimizer configuration with separate matrix, scalar, and tied-embedding learning rates