PR #1090

open

Non-record: 11L XSA + Score-First LoRA TTT (1.1573, 1xH100)

by swapp1990View on GitHub

val_bpb

1.1573

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.02 MB

Training Techniques

Architecture

XSA

Cross-sequence attention applied to the last 4 layers to force cross-position context.

parameters: {"layers":4}

SwiGLU

3x MLP with SwiGLU gated activation.

parameters: {"mlp_multiplier":3}

SmearGate

Blends each token embedding with the previous token embedding to add bigram context.

parameters: null

U-Net skip connections

Encoder-decoder style skip connections across layers to improve gradient flow.

parameters: {"layers":11}

Initialization

OrthoInit

Orthogonal initialization for all weight matrices.

Optimizer

Muon

weight_decay: 0.04

momentum: null

other_params: null

Weight Averaging

SWA

parameters: {"checkpoints":15}

Quantization

mixed int5/int6/int8

bits: null

scope: MLP, attention, embeddings

Compression

zstd

level: 22

Test-Time Training

LoRA TTT

parameters: {"rank":8,"learning_rate":0.05,"chunk_size":256,"targets":"Q+V","score_first":true}

Sequence Length

sequence_length

train_length: 2048

eval_length: null

LR Schedule

warmdown

parameters: {"warmdown_iters":3000}

Regularization

weight decay

parameters: {"value":0.04}

Novel Contributions

Score-first LoRA TTT where each 256-token chunk is scored before being used for adaptation
XSA applied to the last 4 layers
SmearGate embedding blending for bigram context
U-Net skip connections in an 11-layer transformer
Mixed int5/int6/int8 quantization with zstd-22 compression to fit under 16MB