PR #642

closed

Record: 11L + Score-Every-Epoch LoRA TTT 5ep (3-seed mean val_bpb=0.8173)

by minh-stakcView on GitHub

val_bpb

0.8173

Architecture

11L Transformer

Optimizer

Muon

Artifact Size

17.13 MB

Training Techniques

Quantization

GPTQ-lite

bits: 6

scope: all

Architecture

XSA

Cross/self-attention variant applied to the last 4 layers

parameters: {"layers":4}

Partial RoPE

Rotary positional embeddings applied partially

parameters: {"dimensions":"16/64"}

SmearGate

Additional gating mechanism in the architecture

parameters: null

BigramHash

Bigram hashing feature/module

parameters: {"size":2048}

MLP3x

Expanded MLP width to 3x

parameters: null

Weight Averaging

EMA

parameters: {"decay":0.997}

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"window":"pre-TTT sliding window"}

Test-Time Training

LoRA TTT

parameters: {"epochs":5,"lm_rank":16,"lora_rank":8,"learning_rate":0.01,"temperature":0.98,"score_every_epoch":true}

Initialization

OrthoInit

Orthogonal initialization

LR Schedule

cosine decay

parameters: {"across_total_ttt_steps":true}

Optimizer

Muon

weight_decay: 0.04

momentum: null

other_params: {"warmdown":3500}

Regularization

layerwise LN scale

parameters: null

Novel Contributions

Score-every-epoch multi-scale LoRA TTT
Per-document LoRA adaptation with epoch-wise rescoring of all chunks
Only the final epoch's scores contribute to BPB
Multi-scale LoRA configuration with different ranks and learning rates for LM head, Q/V projections, and per-block bias tuning
Post-TTT temperature rescaling
3-seed validation showing mean val_bpb of 0.8173