PR #492

open

Record: 11L XSA4 + EMA + Partial RoPE + Rank-8 TTT Hooks (1.1591 bpb)

by Divyesh-ThirukondaView on GitHub

val_bpb

1.1591

Architecture

Transformer

Optimizer

—

Artifact Size

15,528,215 bytes

Training Techniques

Architecture

XSA

Cross Self-Attention on the last 4 layers

parameters: {"layers":4}

Partial RoPE

Rotary Positional Embeddings applied partially to head dimensions

parameters: {"head_dims":"16/64"}

layerwise LN scale

Layer normalization scaling applied per layer

parameters: null

SmearGate + BigramHash embeddings

Embedding modifications using SmearGate and BigramHash

parameters: null

tied embeddings

Input and output embeddings are tied

parameters: null

Weight Averaging

EMA

parameters: {"decay":0.997}

Quantization

mixed int6/int8

bits: null

scope: null

Compression

zstd

level: null

Test-Time Training

LoRA TTT

parameters: {"rank":8}

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

Other

other

Adaptive eval path keeps variable-length short-document no-TTT scoring path eager to avoid Torch Dynamo recompile-limit failures

parameters: null

Novel Contributions

Integration of long-document LoRA TTT hooks with rank 8
Use of partial RoPE applied to a subset of head dimensions (16/64)
Layerwise layer normalization scaling
Mixed int6/int8 quantization with zstd compression
SmearGate and BigramHash embedding modifications
EMA with decay 0.997 for weight averaging
Non-SOTA leaderboard submission with exact roundtrip metric under 600s training budget