PR #457
open11L + XSA + VRL + SWA + seq4096 + cross-doc TTT - val_bpb 1.1839
by carlesonielfaView on GitHub
val_bpb
1.1839
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.35 MB
Training Techniques
Sequence Length
sequence_length
train_length: 4096
eval_length: null
Architecture
XSA
Exclusive Self-Attention subtracts the component of attention output aligned with the value vector in the deepest layers.
parameters: {"layers":4}
VRL
Value Residual Learning adds a learnable residual from layer-0 value vectors into each layer's value vectors.
parameters: {"layers":[1,10]}
SmearGate
Learned token-blending gate at the embedding layer that mixes each token with the previous token.
parameters: null
weight tying
Tied embeddings / tied input-output embeddings.
parameters: null
Weight Averaging
SWA
parameters: {"checkpoints":24,"fraction_last_warmdown":0.4}
Test-Time Training
LoRA TTT
parameters: {"rank":8}
Quantization
QAT
bits: 8
scope: all
Compression
zlib
level: null
Evaluation
sliding window eval
parameters: {"stride":64}
LR Schedule
warmdown
parameters: {"warmdown_iters":1200}
Initialization
OvertoneInit
Used with phase-transition resid_mix.
Other
other
Cross-document test-time training with per-document rank-8 LoRA adapters trained on already-evaluated tokens and reset between documents.
parameters: {"reset_between_documents":true}
Novel Contributions
- Long-context training with sequence length 4096
- Exclusive Self-Attention (XSA) on the deepest 4 layers
- Value Residual Learning (VRL) using layer-0 value vectors
- SmearGate token-blending gate at the embedding layer
- Stochastic Weight Averaging over 24 checkpoints
- Cross-document test-time training with rank-8 LoRA adapters
- Warmdown-QAT to minimize quantization penalty