val_bpb
1.1650
Architecture
Transformer
Optimizer
—
Artifact Size
15.97 MB
Training Techniques
Architecture
BigramHash
Bigram hash table used in the model, quantized and trained with INT4 bQAT.
parameters: {"buckets":10240}
MLP3x
Three-layer MLP with LeakyReLU activation.
parameters: null
LeakyReLU
LeakyReLU(0.5) squared activation used in the MLP.
parameters: {"slope":0.5}
XSA
Cross-layer shared attention applied to the last 4 layers.
parameters: {"last_n_layers":4}
RoPE
Partial rotary positional embedding.
parameters: {"dimensions":16,"total_dimensions":64}
U-Net skip connections
U-Net style skip connections in the residual stream.
parameters: null
resid_mix
Learnable x/x0 blend always active.
parameters: null
Regularization
LN scale
parameters: {"formula":"1/sqrt(layer+1)"}
Weight Averaging
EMA
parameters: {"decay":0.997,"qat_activation_reset":true}
Quantization
QAT
bits: 4
scope: MLP and bigram; INT6 attention
late QAT
bits: 4
scope: training
INT4
bits: 4
scope: BigramHash
Compression
zstd
level: null
Test-Time Training
full TTT
parameters: {"learning_rate":0.002,"epochs":3,"legal_score_first":true}
LR Schedule
warmdown
parameters: {"late_qat_frac":0.65,"late_qat_threshold":0.9}
Novel Contributions
- INT4 bigram QAT to quantize the bigram table below INT6 and fit 12 layers within 16MB
- EMA reset when QAT activates to avoid quantization degradation from pre-QAT EMA weights
- Deterministic wallclock-based QAT trigger to remove seed-to-seed timing variance on multi-GPU runs