PR #1457

open

Record: LoRA TTT on GPTQ — val_bpb TBD (10min_16mb)

by DilpreetBansiView on GitHub
val_bpb
1.1454
Architecture
Transformer
Optimizer
SGD
Artifact Size
16.0 MB

Training Techniques

Quantization
GPTQ
bits: 6
scope: MLP+attn
int8
bits: 8
scope: embeddings
Architecture
GQA
Grouped query attention with 4 KV heads
parameters: {"heads":8,"kv_heads":4}
LeakyReLU
LeakyReLU squared activation in the MLP
parameters: null
XSA
Exclusive self-attention on all layers
parameters: {"layers":11}
BigramHash
Bigram hash embedding
parameters: {"buckets":3072,"dim":112}
Partial RoPE
Partial rotary position embeddings
parameters: {"dimensions":16}
SmearGate
SmearGate mechanism
parameters: null
VE128
Value Embedding with 128 dimensions
parameters: {"dim":128,"layers":[9,10]}
Weight Averaging
EMA
parameters: {"decay":0.997}
Compression
lzma
level: 9
Evaluation
sliding window eval
parameters: {"stride":64,"context_length":4096}
Test-Time Training
LoRA TTT
parameters: {"rank":8,"learning_rate":0.01,"momentum":0.9,"weight_decay":0.01,"epochs":3,"chunk_tokens":32768}
Optimizer
SGD
weight_decay: 0.01
momentum: 0.9
other_params: {"grad_clip":1}
LR Schedule
cosine decay
parameters: null
Regularization
weight decay
parameters: {"value":0.01}
LN scale
parameters: {"scale":"1/sqrt(layer+1)"}
Sequence Length
sequence_length
train_length: 2048
eval_length: 4096

Novel Contributions

  • First demonstration of test-time training on GPTQ-quantized models
  • LoRA-constrained TTT to preserve the GPTQ loss basin
  • Score-first TTT protocol with inference-mode scoring before training
  • Extended-context evaluation with 4096-token windows
  • Layer-wise learning-rate decay for LoRA TTT