PR #1118

open

Submission: 11L XSA4 + TrigramHash + ValueResidual + Legal TTT (val_bpb=1.1187)

by adityakm24View on GitHub

val_bpb

1.1187

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

15,985,833 bytes

Training Techniques

Architecture

XSA

XSA applied to the last 4 layers

parameters: {"layers":4}

TrigramHash

Trigram hash embedding used in the model

parameters: {"dimensions":1024}

BigramHash

Bigram hash embedding used in the model

parameters: {"dimensions":1536}

SmearGate

SmearGate used in the model

parameters: null

Partial RoPE

Partial rotary positional embeddings

parameters: {"dimensions":16}

GQA

Grouped query attention with fewer KV heads than attention heads

parameters: {"heads":8,"kv_heads":4}

ValueResidual

Value residual connection used in the model

parameters: null

ValueEmbedding

Value embedding used in the model

parameters: null

Quantization

late QAT

bits: null

scope: artifact

Weight Averaging

EMA

parameters: null

SWA

parameters: null

Evaluation

sliding window eval

parameters: {"stride":64}

Test-Time Training

score-first TTT

parameters: {"learning_rate":0.0025,"epochs":6,"freeze_blocks":0}

Compression

lzma

level: null

Sequence Length

sequence_length

train_length: 9000

eval_length: null

Optimizer

Parallel Muon

weight_decay: null

momentum: null

other_params: {"adamw":true}

Novel Contributions

11-layer Transformer with GQA, XSA on the last 4 layers, Partial RoPE, SmearGate, BigramHash, TrigramHash, ValueEmbedding, and ValueResidual
Parallel Muon + AdamW optimization with EMA and SWA
Late QAT and int6+lzma artifact compression to fit under the 16MB limit
Sliding-window evaluation combined with legal score-first TTT
Achieved val_bpb=1.11868501 with total artifact size 15,985,833 bytes