val_bpb
1.3274
Architecture
Transformer
Optimizer
—
Artifact Size
13.8 MB
Training Techniques
Quantization
int8
bits: 8
scope: model weights
QAT
bits: 5
scope: MLP layers
QAT
bits: 6
scope: attention layers
Compression
zlib
level: null
Evaluation
int8+zlib roundtrip evaluation
parameters: null
sliding window eval
parameters: null
Architecture
BigramHash
Existing hash-based architectural component used in the model; referenced as part of the baseline and extended with trigram hash in later experiments.
parameters: null
TrigramHash
Added a trigram hash table alongside the existing BigramHash.
parameters: {"buckets":4096,"dim":32}
Other
other
Straight-Through Estimator fake quantization applied in CastedLinear.forward() during QAT.
parameters: {"formula":"w + (w_quantized - w).detach()"}
Test-Time Training
LoRA TTT
parameters: {"rank":8,"targets":["Q projections","V projections","LM head"],"layers":10}
Weight Averaging
SWA
parameters: null
Regularization
magnitude pruning
parameters: null
Novel Contributions
- Baseline 1x H100 training report with 1.3274 BPB under a 600s wallclock cap
- End-to-end RunPod and runpodctl process guide for training and evaluation
- QAT experiments with int5/int6 fake quantization on top of the leaderboard architecture
- Trigram hash extension to the existing bigram hash mechanism
- Implemented but untested LoRA-based test-time training pipeline
- Documented next-step ideas including QAT, 3x MLP, SwiGLU gating, and bigram hash improvements