PR #161

open

Record:Add TTT-LoRA 512d submission (val_bpb=1.1957)

by santosh5541View on GitHub
val_bpb
1.1957
Architecture
Transformer
Optimizer
Artifact Size
15,880,385 bytes

Training Techniques

Architecture
KV head count
Baseline-sized Transformer configured with 8 attention heads and 4 KV heads.
parameters: {"layers":9,"dimensions":512,"heads":8,"kv_heads":4}
Quantization
int8
bits: 8
scope: all
Compression
zlib
level: null
Test-Time Training
LoRA TTT
parameters: {"rank":null}
LR Schedule
warmdown
parameters: {"warmdown_steps":3000}
Other
other
Training capped at a 10-minute wallclock budget on 8x H100 SXM GPUs.
parameters: {"max_wallclock_seconds":600,"hardware":"8x H100 SXM"}

Novel Contributions

  • TTT-LoRA evaluation path achieved the best score over standard int8 roundtrip.
  • Baseline-sized 512d Transformer with 8 heads and 4 KV heads under the 10-minute/16MB constraint.
  • Per-row int8 quantization combined with zlib compression to fit within the artifact size limit.
  • Warmdown training schedule with 3000 iterations.