PR #1766

open

SP8192 + CaseOps + Loop345 + Recur-Alpha + PhasedTTT

by tashapaisView on GitHub

val_bpb

1.0655

Architecture

Transformer

Optimizer

AdamW

Artifact Size

~16MB

Training Techniques

Architecture

depth recurrence

3-layer looped recurrence over layers 3, 4, 5 with 2 loops, creating 17 virtual layers and activating at 35% training.

parameters: {"layers":3,"loops":2,"virtual_layers":17,"activate_at":35}

Gated Attention

Per-head sigmoid output gate for attention.

parameters: {"init_std":0.01}

CaseOps

Bijective lossless case preprocessing with operator tokens and byte sidecar for BPB on original UTF-8 bytes.

parameters: null

Parallel Residuals

GPT-J style parallel residual connections from layer 8.

parameters: {"start_layer":8}

QK-Gain

Learned per-head query scalar gain.

parameters: {"gain":5}

Recur-Alpha

Learned scalar carry per looped block that adds a weighted copy of the first-visit activation to later recurrence passes.

parameters: {"init":0,"scalars":3}

Quantization

GPTQ

bits: 6

scope: matrices; embeddings int8

int8

bits: 8

scope: attn_gate_w per-row

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"row_normalized":true,"combined_with":"AdamW"}

Test-Time Training

LoRA TTT

parameters: {"mode":"score-first","reset_per_doc":true,"lr_decay":"cosine"}

Compression

brotli

level: null

LR Schedule

cosine decay

parameters: {"phase":"TTT"}

Novel Contributions

Recur-Alpha: a learned scalar carry added to recurrent looped blocks
First composition of Recur-Alpha with the CaseOps + phased TTT stack
3-scalar carry mechanism across the looped blocks in depth recurrence
Byte-shuffle + Brotli compression