PR #1747

open

Submission: SP8192 + Partial RoPE (16/64) + GPTQ SDClip + SGD TTT — val_bpb 1.0820 (3-seed mean)

by swapp1990View on GitHub

val_bpb

1.0820

Architecture

Transformer

Optimizer

SGD

Artifact Size

~15.86 MB

Training Techniques

Architecture

Partial RoPE

Rotate only the first 16 of 64 head dimensions with RoPE, leaving the remaining 48 dimensions unrotated.

parameters: {"dimensions":16,"head_dim":64}

GQA

Grouped query attention with 8 heads and 4 KV heads.

parameters: {"heads":8,"kv_heads":4}

MLP4x

11-layer Transformer with 4x MLP expansion.

parameters: {"layers":11,"intermediate_dim":2048}

Quantization

GPTQ

bits: 6

scope: attention and MLP weights

int8

bits: 8

scope: embeddings

Compression

Brotli

level: 11

Optimizer

SGD

weight_decay: null

momentum: 0.9

other_params: {"learning_rate":0.005,"scope":"all-weights","chunk_size":2048}

Weight Averaging

EMA

parameters: {"decay":0.9965}

Evaluation

sliding window eval

parameters: {"prefix_only":true}

Test-Time Training

score-first TTT

parameters: {"learning_rate":0.005,"momentum":0.9,"chunk_size":2048}

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

LR Schedule

warmdown

parameters: {"warmup_steps":20}

Other

other

QK-Gain scaling applied to attention logits.

parameters: {"gain":5.25}

Novel Contributions

Partial RoPE using only the first 16 of 64 head dimensions
GPTQ + SDClip quantization stack with int6 weights and int8 embeddings
SGD all-weights test-time training with score-first compliance
Brotli-11 compression with byte shuffling
Reported 3-seed mean validation BPB of 1.0820