val_bpb
1.1957
Architecture
Transformer
Optimizer
—
Artifact Size
15,880,385 bytes
Training Techniques
Architecture
KV head count
Uses a baseline-sized Transformer with 8 attention heads and 4 KV heads.
parameters: {"layers":9,"model_dim":512,"heads":8,"kv_heads":4}
Quantization
int8
bits: 8
scope: all
Compression
zlib
level: null
Test-Time Training
LoRA TTT
parameters: {"rank":null}
LR Schedule
warmdown
parameters: {"warmdown_steps":3000}
Sequence Length
sequence_length
train_length: null
eval_length: null
Novel Contributions
- TTT-LoRA evaluation path achieves the best score
- Baseline-sized 512d Transformer with 8 heads and 4 KV heads
- Training capped at 10 minutes on 8x H100 SXM GPUs
- Per-row int8 quantization with zlib compression to fit under 16MB
- Warmdown-3000 training setup