PR #642

closed

Record: 11L + Score-Every-Epoch LoRA TTT 5ep (3-seed mean val_bpb=0.8173)

by minh-stakcView on GitHub
val_bpb
0.8173
Architecture
11L Transformer
Optimizer
Muon
Artifact Size
17.13 MB

Training Techniques

Quantization
GPTQ-lite
bits: 6
scope: all
Architecture
XSA
Cross/self-attention variant applied to the last 4 layers
parameters: {"layers":4}
Partial RoPE
Rotary positional embeddings applied partially
parameters: {"dimensions":"16/64"}
SmearGate
Additional gating mechanism in the architecture
parameters: null
BigramHash
Bigram hashing feature/module
parameters: {"size":2048}
MLP3x
Expanded MLP width to 3x
parameters: null
Weight Averaging
EMA
parameters: {"decay":0.997}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"window":"pre-TTT sliding window"}
Test-Time Training
LoRA TTT
parameters: {"epochs":5,"lm_rank":16,"lora_rank":8,"learning_rate":0.01,"temperature":0.98,"score_every_epoch":true}
Initialization
OrthoInit
Orthogonal initialization
LR Schedule
cosine decay
parameters: {"across_total_ttt_steps":true}
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: {"warmdown":3500}
Regularization
layerwise LN scale
parameters: null

Novel Contributions

  • Score-every-epoch multi-scale LoRA TTT
  • Per-document LoRA adaptation with epoch-wise rescoring of all chunks
  • Only the final epoch's scores contribute to BPB
  • Multi-scale LoRA configuration with different ranks and learning rates for LM head, Q/V projections, and per-block bias tuning
  • Post-TTT temperature rescaling
  • 3-seed validation showing mean val_bpb of 0.8173