val_bpb
1.4352
Architecture
Transformer
Optimizer
—
Artifact Size
9.90 MB
Training Techniques
Regularization
spectral floor
parameters: {"lambda":0.01,"eps":0.01}
Other
other
cosine-MSE latent prediction auxiliary loss with stop-gradient target and L2-normalized hidden states
parameters: {"alpha":0.1,"layers":"2-5"}
Architecture
KV head count
Reduced key/value heads from 4 to 1 in MQA variants
parameters: {"kv_heads":1}
Value Residual
Added value embeddings to first/last layers in MQA variants
parameters: null
MLP3x
Increased MLP width multiplier to use freed parameters
parameters: {"multiplier":3}
Quantization
int8
bits: 8
scope: all
Compression
zlib
level: null
Novel Contributions
- JEPA-style auxiliary losses for next-token prediction in the parameter golf regime
- Spectral variance floor applied to hidden-state deltas to prevent dimensional collapse
- Cosine-MSE latent prediction auxiliary head with stop-gradient target
- Experimental analysis showing torch.compile confounded an apparent improvement
- MQA plus value embeddings and 3x MLP architectural variants evaluated as alternatives