PR #1747

open

Submission: SP8192 + Partial RoPE (16/64) + GPTQ SDClip + SGD TTT — val_bpb 1.0820 (3-seed mean)

by swapp1990View on GitHub
val_bpb
1.0820
Architecture
Transformer
Optimizer
SGD
Artifact Size
~15.86 MB

Training Techniques

Architecture
Partial RoPE
Rotate only the first 16 of 64 head dimensions with RoPE, leaving the remaining 48 dimensions unrotated.
parameters: {"dimensions":16,"head_dim":64}
GQA
Grouped query attention with 8 heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
MLP4x
11-layer Transformer with 4x MLP expansion.
parameters: {"layers":11,"intermediate_dim":2048}
Quantization
GPTQ
bits: 6
scope: attention and MLP weights
int8
bits: 8
scope: embeddings
Compression
Brotli
level: 11
Optimizer
SGD
weight_decay: null
momentum: 0.9
other_params: {"learning_rate":0.005,"scope":"all-weights","chunk_size":2048}
Weight Averaging
EMA
parameters: {"decay":0.9965}
Evaluation
sliding window eval
parameters: {"prefix_only":true}
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.005,"momentum":0.9,"chunk_size":2048}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
LR Schedule
warmdown
parameters: {"warmup_steps":20}
Other
other
QK-Gain scaling applied to attention logits.
parameters: {"gain":5.25}

Novel Contributions

  • Partial RoPE using only the first 16 of 64 head dimensions
  • GPTQ + SDClip quantization stack with int6 weights and int8 embeddings
  • SGD all-weights test-time training with score-first compliance
  • Brotli-11 compression with byte shuffling
  • Reported 3-seed mean validation BPB of 1.0820