PR #1250

open

Non-record: Full Attention + LZMA + small BigramHash (val_bpb=1.2094)

val_bpb

1.2094

Architecture

Transformer

Optimizer

—

Artifact Size

12.3 MB

Training Techniques

Architecture

KV head count

Uses full attention with 8 KV heads instead of GQA.

parameters: {"kv_heads":8}

BigramHash

Uses a smaller BigramHash embedding table.

parameters: {"dimensions":"3072x112"}

Compression

lzma

level: null

Test-Time Training

TTT

parameters: null

Full attention with 8 KV heads
LZMA artifact compression
Smaller BigramHash table (3072x112)
Negative result showing that reducing BigramHash size significantly hurts quality
Negative result showing full attention is too slow to train enough steps within the time budget