PR #1457

open

Record: LoRA TTT on GPTQ — val_bpb TBD (10min_16mb)

by DilpreetBansiView on GitHub

val_bpb

1.1454

Architecture

Transformer

Optimizer

SGD

Artifact Size

16.0 MB

Training Techniques

Quantization

GPTQ

bits: 6

scope: MLP+attn

int8

bits: 8

scope: embeddings

Architecture

GQA

Grouped query attention with 4 KV heads

parameters: {"heads":8,"kv_heads":4}

LeakyReLU

LeakyReLU squared activation in the MLP

parameters: null

XSA

Exclusive self-attention on all layers

parameters: {"layers":11}

BigramHash

Bigram hash embedding

parameters: {"buckets":3072,"dim":112}

Partial RoPE

Partial rotary position embeddings

parameters: {"dimensions":16}

SmearGate

SmearGate mechanism

parameters: null

VE128

Value Embedding with 128 dimensions

parameters: {"dim":128,"layers":[9,10]}

Weight Averaging

EMA

parameters: {"decay":0.997}

Compression

lzma

level: 9

Evaluation

sliding window eval

parameters: {"stride":64,"context_length":4096}

Test-Time Training

LoRA TTT

parameters: {"rank":8,"learning_rate":0.01,"momentum":0.9,"weight_decay":0.01,"epochs":3,"chunk_tokens":32768}

Optimizer

SGD

weight_decay: 0.01

momentum: 0.9

other_params: {"grad_clip":1}

LR Schedule

cosine decay

parameters: null

Regularization

weight decay

parameters: {"value":0.01}

LN scale

parameters: {"scale":"1/sqrt(layer+1)"}

Sequence Length

sequence_length

train_length: 2048

eval_length: 4096

Novel Contributions

First demonstration of test-time training on GPTQ-quantized models
LoRA-constrained TTT to preserve the GPTQ loss basin
Score-first TTT protocol with inference-mode scoring before training
Extended-context evaluation with 4096-token windows
Layer-wise learning-rate decay for LoRA TTT