PR #1771

open

Record: SP8192 CaseOps + V13 Curriculum + SmearGate + LoRA-TTT — val_bpb 1.06513 (3-seed mean)

by bigbagView on GitHub

val_bpb

1.0651

Architecture

Transformer

Optimizer

AdamW

Artifact Size

~15.98 MB

Training Techniques

Architecture

depth recurrence

Phased recurrence-depth curriculum with evaluation at depth 4.

parameters: {"layers":11,"depths":[1,3,4]}

SmearGate

Per-layer learned smoothing gate blending adjacent token representations.

parameters: null

Gated Attention

Full-dimension attention output gating with QuantGate passthrough.

parameters: null

LeakyReLU

Uses LeakyReLU squared activation.

parameters: {"slope":0.5}

Partial RoPE

Applies rotary position embeddings to a subset of dimensions.

parameters: {"dimensions":16,"total_dimensions":64}

weight tying

Tied input embeddings and output embeddings.

parameters: null

Quantization

GPTQ

bits: 6

scope: attention/MLP

GPTQ

bits: 7

scope: embeddings

Test-Time Training

LoRA TTT

parameters: {"alpha":144,"rank":96,"weight_decay":1,"warm_start_A":true,"phased":true,"prefix_docs":2000,"num_phases":3}

Optimizer

AdamW

weight_decay: 1

momentum: null

other_params: {"learning_rate":0.0001}

Regularization

logit softcap

parameters: {"value":30}

layerwise LN scale

parameters: null

Novel Contributions

SP8192 CaseOps reversible case normalization tokenizer
Recurrence depth curriculum with prewarmed depths 3 and 4
SmearGate combined with Gated Attention
QuantGate-enabled GPTQ passthrough for gates
LoRA TTT improvements including alpha/rank scaling and warm-start A
Phased score-first TTT without rescoring