PR #2034

open

Recurrent Transformer RT-KV experiment

by Maheshram1View on GitHub

val_bpb

1.0576

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.98 MB

Training Techniques

Sequence Length

sequence_length

train_length: 3072

eval_length: 3072

LR Schedule

warmdown

parameters: {"warmdown_frac":0.85}

Test-Time Training

LoRA TTT

parameters: {"rank":80,"mask":"no_qv","local_lr_mult":0.75,"short_score_first":true}

Architecture

Partial RoPE

Uses partial rotary positional embeddings.

parameters: {"dimensions":16}

depth recurrence

Layers are looped recurrently in the middle of the network.

parameters: {"layers":[3,4,5],"frac":0.35}

XSA

Applies XSA across all layers.

parameters: {"layers":11}

SmearGate

BOS-fixed SmearGate gating is used.

parameters: null

Gated Attention

SparseAttnGate with gated attention behavior.

parameters: {"gate_window":12,"scale":0.5}

GQA

Grouped-query attention with 8 query heads and 4 KV heads.

parameters: {"query_heads":8,"kv_heads":4}

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"adam_on_embedding_scalars":true,"beta2":0.99}

Weight Averaging

EMA

parameters: {"decay":0.9965}

Quantization

GPTQ

bits: 6

scope: matrices

mixed int7/int6

bits: 7

scope: embeddings and matrices

LQER

bits: 4

scope: correction

Compression

pergroup

level: null

Evaluation

stride-based eval

parameters: {"stride":1536,"context_length":3072}

Regularization

weight decay

parameters: {"value":0.5}

Novel Contributions

Progressive training-context schedule from 1k to 3k context
Short-document score-first TTT chunk schedule
Long-context TTT mask removing Q/V adapters
Combined recurrent-transformer RT-KV experiment on the CaseOps/SP8192 lineage
Maintains full validation target coverage while staying under the artifact cap