PR #304

open

Non-record: QAT + Neural Cache + LoRA TTT

by BortlesboatView on GitHub

val_bpb

1.4245

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.77 MB

Training Techniques

Quantization

STE QAT

bits: 5

scope: MLP layers

STE QAT

bits: 6

scope: attention layers

Architecture

BigramHash

Added a BigramHash module as part of the training recipe.

parameters: {"size":10240,"dim":128}

SmearGate

Included SmearGate in the model architecture.

parameters: null

MLP3x

Used a 3x-width MLP block.

parameters: {"hidden_size":1536}

KV head count

Used grouped-query attention with fewer KV heads than attention heads.

parameters: {"layers":10,"dim":512,"heads":8,"kv_heads":4}

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"matrix_lr":0.02}

Weight Averaging

SWA

parameters: {"start_fraction":0.6,"interval_steps":50,"checkpoints":24}

Evaluation

sliding window eval

parameters: {"stride":64}

neural cache

parameters: {"hidden_state_dim":512,"dtype":"bf16","interpolation":"logaddexp"}

Test-Time Training

LoRA TTT

parameters: {"rank":8}

Initialization

orthogonal init

Orthogonal initialization used for model components.

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

LR Schedule

warmdown

parameters: null

Regularization

weight decay

parameters: {"weight_decay":0.04}

Compression

zstd

level: null

Novel Contributions

Quantization-aware training with STE fake-quantization matched to int5/int6 export format
Neural cache during sliding-window evaluation using hidden-state similarity and logaddexp interpolation
Per-document rank-8 LoRA test-time training with entropy-gated updates
Stacking QAT, neural cache, and LoRA TTT on top of the Int5-MLP + BigramHash + SWA recipe