PR #356

open

Non-record: PR315 repro on 1xH100 PCIe, int6+zstd (val_bpb=1.8338)

val_bpb
1.8338
Architecture
Transformer
Optimizer
Muon
Artifact Size
10.0MB

Training Techniques

Architecture
XSA
Applies XSA to the last layers of the model.
parameters: {"layers":4}
Partial RoPE
Uses rotary positional embeddings on only part of the dimensions.
parameters: {"dimensions":16,"total_dimensions":64}
LN Scale
Applies layer norm scaling.
parameters: null
BigramHash
Uses a bigram hashing vocabulary mechanism.
parameters: {"vocab_size":2048}
SmearGate
Uses SmearGate gating in the architecture.
parameters: null
Weight Averaging
EMA
parameters: {"decay":0.997}
Quantization
QAT
bits: 6
scope: all
Compression
zstd
level: null
Evaluation
sliding window eval
parameters: {"stride":64}
LR Schedule
warmdown
parameters: {"warmdown_iters":400,"momentum_warmup_steps":200}
Regularization
weight decay
parameters: {"value":0.04}

Novel Contributions

  • Reproduction of PR #315 recipe on a single H100 PCIe GPU
  • Adaptation of the training schedule for 1 GPU with warmdown and momentum warmup
  • Use of Flash Attention 2 instead of Flash Attention 3
  • int6 quantization with zstd compression to fit within the artifact size limit
  • Late QAT enabled near the end of training under a constrained training budget