PR #1098

open

Add run_17 8xH100 submission (1.118685, <16MB)

by adityakm24View on GitHub

val_bpb

1.1187

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

15,985,833 bytes

Training Techniques

Architecture

GQA

Grouped query attention used in the model.

parameters: {"num_heads":8,"num_kv_heads":4}

XSA

XSA applied to the last layers of the model.

parameters: {"layers":4}

Partial RoPE

Partial rotary positional embeddings.

parameters: {"rope_dims":16}

SmearGate

SmearGate component included in the architecture.

parameters: null

BigramHash

Bigram hash embeddings used in the model.

parameters: {"size":1536}

TrigramHash

Trigram hash embeddings used in the model.

parameters: {"size":1024}

ValueEmbedding

Value embeddings included in the architecture.

parameters: null

ValueResidual

Value residual connections included in the architecture.

parameters: null

Optimizer

Parallel Muon

weight_decay: null

momentum: null

other_params: {"adamw":true}

Weight Averaging

EMA

parameters: null

SWA

parameters: null

Quantization

late QAT

bits: null

scope: all

Evaluation

sliding window eval

parameters: {"stride":64}

Test-Time Training

score-first TTT

parameters: {"learning_rate":0.0025,"epochs":6,"freeze_blocks":0}

Compression

lzma

level: null

Sequence Length

sequence_length

train_length: null

eval_length: null

Novel Contributions

Legal score-first test-time training to improve validation bpb
Int6 + lzma artifact packaging under the 16MB submission cap
Parameter-banking Transformer with GQA, XSA, Partial RoPE, SmearGate, BigramHash, TrigramHash, ValueEmbedding, and ValueResidual
Parallel Muon + AdamW optimization with EMA and SWA
Sliding-window evaluation with stride 64