PR #304

open

Non-record: QAT + Neural Cache + LoRA TTT

by BortlesboatView on GitHub
val_bpb
1.4245
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.77 MB

Training Techniques

Quantization
STE QAT
bits: 5
scope: MLP layers
STE QAT
bits: 6
scope: attention layers
Architecture
BigramHash
Added a BigramHash module as part of the training recipe.
parameters: {"size":10240,"dim":128}
SmearGate
Included SmearGate in the model architecture.
parameters: null
MLP3x
Used a 3x-width MLP block.
parameters: {"hidden_size":1536}
KV head count
Used grouped-query attention with fewer KV heads than attention heads.
parameters: {"layers":10,"dim":512,"heads":8,"kv_heads":4}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"matrix_lr":0.02}
Weight Averaging
SWA
parameters: {"start_fraction":0.6,"interval_steps":50,"checkpoints":24}
Evaluation
sliding window eval
parameters: {"stride":64}
neural cache
parameters: {"hidden_state_dim":512,"dtype":"bf16","interpolation":"logaddexp"}
Test-Time Training
LoRA TTT
parameters: {"rank":8}
Initialization
orthogonal init
Orthogonal initialization used for model components.
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
LR Schedule
warmdown
parameters: null
Regularization
weight decay
parameters: {"weight_decay":0.04}
Compression
zstd
level: null

Novel Contributions

  • Quantization-aware training with STE fake-quantization matched to int5/int6 export format
  • Neural cache during sliding-window evaluation using hidden-state similarity and logaddexp interpolation
  • Per-document rank-8 LoRA test-time training with entropy-gated updates
  • Stacking QAT, neural cache, and LoRA TTT on top of the Int5-MLP + BigramHash + SWA recipe