PR #356

open

Non-record: PR315 repro on 1xH100 PCIe, int6+zstd (val_bpb=1.8338)

by sjp611View on GitHub

val_bpb

1.8338

Architecture

Transformer

Optimizer

Muon

Artifact Size

10.0MB

Training Techniques

Architecture

XSA

Applies XSA to the last layers of the model.

parameters: {"layers":4}

Partial RoPE

Uses rotary positional embeddings on only part of the dimensions.

parameters: {"dimensions":16,"total_dimensions":64}

LN Scale

Applies layer norm scaling.

parameters: null

BigramHash

Uses a bigram hashing vocabulary mechanism.

parameters: {"vocab_size":2048}

SmearGate

Uses SmearGate gating in the architecture.

parameters: null

Weight Averaging

EMA

parameters: {"decay":0.997}

Quantization

QAT

bits: 6

scope: all

Compression

zstd

level: null

Evaluation

sliding window eval

parameters: {"stride":64}

LR Schedule

warmdown

parameters: {"warmdown_iters":400,"momentum_warmup_steps":200}

Regularization

weight decay

parameters: {"value":0.04}

Novel Contributions

Reproduction of PR #315 recipe on a single H100 PCIe GPU
Adaptation of the training schedule for 1 GPU with warmdown and momentum warmup
Use of Flash Attention 2 instead of Flash Attention 3
int6 quantization with zstd compression to fit within the artifact size limit
Late QAT enabled near the end of training under a constrained training budget