PR #292

closed

Add baseline H100 training report and process docs

by xuafengView on GitHub

val_bpb

1.3274

Architecture

Transformer

Optimizer

—

Artifact Size

13.8 MB

Training Techniques

Quantization

int8

bits: 8

scope: model weights

QAT

bits: 5

scope: MLP layers

QAT

bits: 6

scope: attention layers

Compression

zlib

level: null

Evaluation

int8+zlib roundtrip evaluation

parameters: null

sliding window eval

parameters: null

Architecture

BigramHash

Existing hash-based architectural component used in the model; referenced as part of the baseline and extended with trigram hash in later experiments.

parameters: null

TrigramHash

Added a trigram hash table alongside the existing BigramHash.

parameters: {"buckets":4096,"dim":32}

Other

other

Straight-Through Estimator fake quantization applied in CastedLinear.forward() during QAT.

parameters: {"formula":"w + (w_quantized - w).detach()"}

Test-Time Training

LoRA TTT

parameters: {"rank":8,"targets":["Q projections","V projections","LM head"],"layers":10}

Weight Averaging

SWA

parameters: null

Regularization

magnitude pruning

parameters: null

Novel Contributions

Baseline 1x H100 training report with 1.3274 BPB under a 600s wallclock cap
End-to-end RunPod and runpodctl process guide for training and evaluation
QAT experiments with int5/int6 fake quantization on top of the leaderboard architecture
Trigram hash extension to the existing bigram hash mechanism
Implemented but untested LoRA-based test-time training pipeline
Documented next-step ideas including QAT, 3x MLP, SwiGLU gating, and bigram hash improvements