PR #161

open

Record:Add TTT-LoRA 512d submission (val_bpb=1.1957)

val_bpb

1.1957

Architecture

Transformer

Optimizer

—

Artifact Size

15,880,385 bytes

Training Techniques

Architecture

KV head count

Baseline-sized Transformer configured with 8 attention heads and 4 KV heads.

parameters: {"layers":9,"dimensions":512,"heads":8,"kv_heads":4}

Quantization

int8

bits: 8

scope: all

Compression

zlib

level: null

Test-Time Training

LoRA TTT

parameters: {"rank":null}

LR Schedule

warmdown

parameters: {"warmdown_steps":3000}

Other

other

Training capped at a 10-minute wallclock budget on 8x H100 SXM GPUs.

parameters: {"max_wallclock_seconds":600,"hardware":"8x H100 SXM"}

TTT-LoRA evaluation path achieved the best score over standard int8 roundtrip.
Baseline-sized 512d Transformer with 8 heads and 4 KV heads under the 10-minute/16MB constraint.
Per-row int8 quantization combined with zlib compression to fit within the artifact size limit.
Warmdown training schedule with 3000 iterations.