PR #492
openRecord: 11L XSA4 + EMA + Partial RoPE + Rank-8 TTT Hooks (1.1591 bpb)
by Divyesh-ThirukondaView on GitHub
val_bpb
1.1591
Architecture
Transformer
Optimizer
—
Artifact Size
15,528,215 bytes
Training Techniques
Architecture
XSA
Cross Self-Attention on the last 4 layers
parameters: {"layers":4}
Partial RoPE
Rotary Positional Embeddings applied partially to head dimensions
parameters: {"head_dims":"16/64"}
layerwise LN scale
Layer normalization scaling applied per layer
parameters: null
SmearGate + BigramHash embeddings
Embedding modifications using SmearGate and BigramHash
parameters: null
tied embeddings
Input and output embeddings are tied
parameters: null
Weight Averaging
EMA
parameters: {"decay":0.997}
Quantization
mixed int6/int8
bits: null
scope: null
Compression
zstd
level: null
Test-Time Training
LoRA TTT
parameters: {"rank":8}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
Other
other
Adaptive eval path keeps variable-length short-document no-TTT scoring path eager to avoid Torch Dynamo recompile-limit failures
parameters: null
Novel Contributions
- Integration of long-document LoRA TTT hooks with rank 8
- Use of partial RoPE applied to a subset of head dimensions (16/64)
- Layerwise layer normalization scaling
- Mixed int6/int8 quantization with zstd compression
- SmearGate and BigramHash embedding modifications
- EMA with decay 0.997 for weight averaging
- Non-SOTA leaderboard submission with exact roundtrip metric under 600s training budget