PR #316

open

Non-record: 12L Low-Rank Q + QAT (1xH100, pre-quant 1.2035)

by SkywardSyntaxView on GitHub

val_bpb

1.2035

Architecture

12-layer Transformer

Optimizer

Muon

Artifact Size

15.2MB

Training Techniques

Architecture

MLP3x

Uses a 3x MLP expansion in the transformer blocks.

parameters: null

SmearGate

Inherited gating modification from prior SOTA records.

parameters: null

BigramHash

Inherited bigram-based hashing component from prior SOTA records.

parameters: null

Low-Rank Q

Factorizes Q as dim→128→dim to reduce parameters and speed up training.

parameters: {"rank":128}

12 layers

Increases transformer depth from 10 to 12 layers using savings from Low-Rank Q.

parameters: {"layers":12}

Quantization

QAT

bits: 7

scope: all

int6

bits: 6

scope: all

Evaluation

sliding window eval

parameters: {"stride":64}

stride-based eval

parameters: {"stride":1024}

Other

other

FTLE-guided per-row precision allocation was tested as a quantization strategy but found to be a negative result.

parameters: null

other

Stride-OGD evaluation-time vocabulary bias optimization was implemented but found too slow as-is.

parameters: null

Initialization

overtone spectral init

Spectral initialization inherited from prior SOTA records.

Optimizer

Muon

weight_decay: 0.04

momentum: null

other_params: null

Regularization

weight decay

parameters: {"value":0.04}

LR Schedule

warmdown

parameters: null

Novel Contributions

Low-Rank Q factorization (r=128) to reduce Q parameters and speed up training
Adding a 12th transformer layer using the compute savings from Low-Rank Q
Quantization-aware training with STE for int7 to reduce the pre-quant/post-quant gap
FTLE-guided per-row precision exploration with a clear negative result showing uniform quantization is better
Stride-OGD evaluation-time vocabulary bias optimization
Cross-disciplinary research pipeline spanning Apple Silicon prototyping, A100 validation, and H100 refinement