PR #316
openNon-record: 12L Low-Rank Q + QAT (1xH100, pre-quant 1.2035)
by SkywardSyntaxView on GitHub
val_bpb
1.2035
Architecture
12-layer Transformer
Optimizer
Muon
Artifact Size
15.2MB
Training Techniques
Architecture
MLP3x
Uses a 3x MLP expansion in the transformer blocks.
parameters: null
SmearGate
Inherited gating modification from prior SOTA records.
parameters: null
BigramHash
Inherited bigram-based hashing component from prior SOTA records.
parameters: null
Low-Rank Q
Factorizes Q as dim→128→dim to reduce parameters and speed up training.
parameters: {"rank":128}
12 layers
Increases transformer depth from 10 to 12 layers using savings from Low-Rank Q.
parameters: {"layers":12}
Quantization
QAT
bits: 7
scope: all
int6
bits: 6
scope: all
Evaluation
sliding window eval
parameters: {"stride":64}
stride-based eval
parameters: {"stride":1024}
Other
other
FTLE-guided per-row precision allocation was tested as a quantization strategy but found to be a negative result.
parameters: null
other
Stride-OGD evaluation-time vocabulary bias optimization was implemented but found too slow as-is.
parameters: null
Initialization
overtone spectral init
Spectral initialization inherited from prior SOTA records.
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: null
Regularization
weight decay
parameters: {"value":0.04}
LR Schedule
warmdown
parameters: null
Novel Contributions
- Low-Rank Q factorization (r=128) to reduce Q parameters and speed up training
- Adding a 12th transformer layer using the compute savings from Low-Rank Q
- Quantization-aware training with STE for int7 to reduce the pre-quant/post-quant gap
- FTLE-guided per-row precision exploration with a clear negative result showing uniform quantization is better
- Stride-OGD evaluation-time vocabulary bias optimization
- Cross-disciplinary research pipeline spanning Apple Silicon prototyping, A100 validation, and H100 refinement