val_bpb
1.6768
Architecture
Transformer
Optimizer
Adam
Artifact Size
5909270 bytes
Training Techniques
Architecture
TrigramHash
Trigram hash embedding wired into GPT forward and logits paths.
parameters: {"vocab_size":4096,"dimensions":128}
Value Residual
Blends attention values with a carried-over v0 residual using vr_lambda.
parameters: null
Quantization
QAT
bits: 6
scope: Q/K/V/O and MLP up/down bank slices
mixed int6/int7/int5
bits: null
scope: gradient-sensitive bank tensors
Test-Time Training
Adam TTT
parameters: {"enabled":false}
Evaluation
sliding window eval
parameters: {"stride":64}
Compression
lzma
level: 9
LR Schedule
warmdown
parameters: {"warmdown_iters":4000}
Novel Contributions
- TrigramHash embedding integrated into the GPT forward and logits paths
- Value Residual attention modification with v0 carry-over
- Bank-level QAT on attention and MLP bank slices
- GradQuant tiered mixed quantization with rebanking/unbanking export path
- Multi-token prediction heads trained but excluded from export size
- Legal score-first Adam-based TTT support
- Temperature calibration using training tokens only
- Extended warmdown schedule with lzma preset 9 compression