val_bpb
1.1957
Architecture
Transformer
Optimizer
—
Artifact Size
15,880,385 bytes
Training Techniques
Architecture
KV head count
Uses a baseline-sized Transformer with 8 attention heads and 4 KV heads.
parameters: {"layers":9,"model_dim":512,"num_heads":8,"num_kv_heads":4}
LR Schedule
warmdown
parameters: {"warmdown_steps":3000}
Quantization
int8
bits: 8
scope: all
Compression
zlib
level: null
Test-Time Training
LoRA TTT
parameters: null
Novel Contributions
- Uses LoRA-based test-time training to improve compression performance.
- Shows that the TTT-LoRA evaluation path outperforms the plain int8 roundtrip.
- Fits an int8 + zlib artifact within the 16 MB submission limit.
- Uses a 512-dimensional baseline Transformer with 8 heads and 4 KV heads under a 10-minute training budget.