PR #2128

open

Non-record: Post-Quantization LoRA Distillation (LCQ) on PR #1855 stack, val_bpb=1.06767

val_bpb
1.0677
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,912,974 bytes

Training Techniques

Weight Averaging
EMA
parameters: {"decay":0.9965}
Quantization
GPTQ
bits: 6
scope: attn/MLP; embeddings int7
mixed int6/int7
bits: null
scope: attn/MLP int6, embeddings int7
Architecture
SmearGate
BOS-fixed smear gate used in the base stack
parameters: null
Gated Attention
Sparse/gated attention path in the base stack
parameters: {"gate_scale":0.5}
depth recurrence
Depth recurrence / TTT-style infrastructure in the base stack
parameters: null
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
LoRA TTT
parameters: {"rank":4,"learning_rate":0.001}
Other
other
Post-quantization LoRA distillation (LCQ) using KL divergence against pre-quantization BF16 teacher logits on TRAIN data only
parameters: {"rank":4,"time_s":60}
other
cu_seqlens-aware variable-length attention dispatch for LoRA-augmented evaluation
parameters: null

Novel Contributions

  • Post-Quantization LoRA Distillation (LCQ) after GPTQ quantization
  • KL distillation of a rank-4 LoRA against pre-quantization BF16 teacher logits on TRAIN data only
  • cu_seqlens-aware variable-length attention support for LoRA-augmented sliding-window evaluation
  • Keeping the trained LoRA in memory across the train-to-eval boundary within the same process
  • Negative-result analysis showing LCQ recovers little beyond plain sliding window and costs main training budget