PR #356
openNon-record: PR315 repro on 1xH100 PCIe, int6+zstd (val_bpb=1.8338)
by sjp611View on GitHub
val_bpb
1.8338
Architecture
Transformer
Optimizer
Muon
Artifact Size
10.0MB
Training Techniques
Architecture
XSA
Applies XSA to the last layers of the model.
parameters: {"layers":4}
Partial RoPE
Uses rotary positional embeddings on only part of the dimensions.
parameters: {"dimensions":16,"total_dimensions":64}
LN Scale
Applies layer norm scaling.
parameters: null
BigramHash
Uses a bigram hashing vocabulary mechanism.
parameters: {"vocab_size":2048}
SmearGate
Uses SmearGate gating in the architecture.
parameters: null
Weight Averaging
EMA
parameters: {"decay":0.997}
Quantization
QAT
bits: 6
scope: all
Compression
zstd
level: null
Evaluation
sliding window eval
parameters: {"stride":64}
LR Schedule
warmdown
parameters: {"warmdown_iters":400,"momentum_warmup_steps":200}
Regularization
weight decay
parameters: {"value":0.04}
Novel Contributions
- Reproduction of PR #315 recipe on a single H100 PCIe GPU
- Adaptation of the training schedule for 1 GPU with warmdown and momentum warmup
- Use of Flash Attention 2 instead of Flash Attention 3
- int6 quantization with zstd compression to fit within the artifact size limit
- Late QAT enabled near the end of training under a constrained training budget