val_bpb
1.1573
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.02 MB
Training Techniques
Architecture
XSA
Cross-sequence attention applied to the last 4 layers to force cross-position context.
parameters: {"layers":4}
SwiGLU
3x MLP with SwiGLU gated activation.
parameters: {"mlp_multiplier":3}
SmearGate
Blends each token embedding with the previous token embedding to add bigram context.
parameters: null
U-Net skip connections
Encoder-decoder style skip connections across layers to improve gradient flow.
parameters: {"layers":11}
Initialization
OrthoInit
Orthogonal initialization for all weight matrices.
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: null
Weight Averaging
SWA
parameters: {"checkpoints":15}
Quantization
mixed int5/int6/int8
bits: null
scope: MLP, attention, embeddings
Compression
zstd
level: 22
Test-Time Training
LoRA TTT
parameters: {"rank":8,"learning_rate":0.05,"chunk_size":256,"targets":"Q+V","score_first":true}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
LR Schedule
warmdown
parameters: {"warmdown_iters":3000}
Regularization
weight decay
parameters: {"value":0.04}
Novel Contributions
- Score-first LoRA TTT where each 256-token chunk is scored before being used for adaptation
- XSA applied to the last 4 layers
- SmearGate embedding blending for bigram context
- U-Net skip connections in an 11-layer transformer
- Mixed int5/int6/int8 quantization with zstd-22 compression to fit under 16MB