PR #1714

open

SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.5 + SGD-TTT [LoRA-TTT Future Work]

by AnakintanoView on GitHub

val_bpb

1.0857

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.97 MB

Training Techniques

Architecture

depth recurrence

Layers 3, 4, and 5 are looped twice, creating 17 effective layers from 11 physical layers.

parameters: {"layers":[3,4,5],"loops":2}

U-Net skip connections

Skip gates provide U-Net style connections between layers.

parameters: null

GQA

Grouped query attention with 4 KV heads.

parameters: {"heads":8,"kv_heads":4}

Partial RoPE

Rotary position embeddings applied to a subset of dimensions.

parameters: {"dimensions":16}

LeakyReLU

LeakyReLU squared MLP activation.

parameters: {"slope":0.5}

weight tying

Input and output embeddings are tied.

parameters: null

Regularization

logit softcap

parameters: {"value":30}

layerwise LN scale

parameters: null

Quantization

GPTQ

bits: 6

scope: weight matrices

GPTQ

bits: 8

scope: embeddings

Optimizer

Muon

weight_decay: 0.095

momentum: null

other_params: {"row_normalized":true}

AdamW

weight_decay: 0.02

momentum: null

other_params: {"used_for":"scalars/embeddings"}

Weight Averaging

EMA

parameters: {"decay":0.9965}

Evaluation

sliding window eval

parameters: {"stride":64,"context_length":2048}

Test-Time Training

SGD TTT

parameters: {"epochs_per_chunk":3,"momentum":0.9,"score_before_update":true}

LoRA TTT

parameters: {"rank_qv":4,"rank_mlp_gate":2,"epochs_per_chunk":12,"frozen_base":true}

LR Schedule

cosine decay

parameters: {"warmup_steps":20,"warmdown_fraction":0.72}

Sequence Length

sequence_length

train_length: null

eval_length: 2048

Novel Contributions

LoRA-TTT with frozen base model and low-rank adapters updated during test-time training
Recur-Alpha learned carry scalar for recurrent blocks
QK-Gain 5.5 per-head query scaling
3-layer depth recurrence with parallel residuals
SGD-TTT fallback with score-before-update compliance