PR #642
closedRecord: 11L + Score-Every-Epoch LoRA TTT 5ep (3-seed mean val_bpb=0.8173)
by minh-stakcView on GitHub
val_bpb
0.8173
Architecture
11L Transformer
Optimizer
Muon
Artifact Size
17.13 MB
Training Techniques
Quantization
GPTQ-lite
bits: 6
scope: all
Architecture
XSA
Cross/self-attention variant applied to the last 4 layers
parameters: {"layers":4}
Partial RoPE
Rotary positional embeddings applied partially
parameters: {"dimensions":"16/64"}
SmearGate
Additional gating mechanism in the architecture
parameters: null
BigramHash
Bigram hashing feature/module
parameters: {"size":2048}
MLP3x
Expanded MLP width to 3x
parameters: null
Weight Averaging
EMA
parameters: {"decay":0.997}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"window":"pre-TTT sliding window"}
Test-Time Training
LoRA TTT
parameters: {"epochs":5,"lm_rank":16,"lora_rank":8,"learning_rate":0.01,"temperature":0.98,"score_every_epoch":true}
Initialization
OrthoInit
Orthogonal initialization
LR Schedule
cosine decay
parameters: {"across_total_ttt_steps":true}
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: {"warmdown":3500}
Regularization
layerwise LN scale
parameters: null
Novel Contributions
- Score-every-epoch multi-scale LoRA TTT
- Per-document LoRA adaptation with epoch-wise rescoring of all chunks
- Only the final epoch's scores contribute to BPB
- Multi-scale LoRA configuration with different ranks and learning rates for LM head, Q/V projections, and per-block bias tuning
- Post-TTT temperature rescaling
- 3-seed validation showing mean val_bpb of 0.8173