val_bpb
1.0543
Architecture
Transformer
Optimizer
—
Artifact Size
~28M params
Training Techniques
Architecture
attention
11-layer attention model
parameters: {"layers":11}
Sequence Length
sequence_length
train_length: null
eval_length: 2048
Quantization
int8
bits: 8
scope: all
Compression
lzma
level: null
Test-Time Training
TTT
parameters: {"rank":8}
Other
other
Residual signs
parameters: null
other
Outlier filtering
parameters: null
other
Stochastic eval
parameters: null
Regularization
logit softcap
parameters: {"temperature":1.02}
Novel Contributions
- 11-layer attention architecture
- 2048 sequence length evaluation
- int8 quantization with lzma compression
- Residual signs eval trick
- TTT with rank 8
- Outlier filtering
- Stochastic evaluation
- Logit temperature scaling at 1.02