PR #2128

open

Non-record: Post-Quantization LoRA Distillation (LCQ) on PR #1855 stack, val_bpb=1.06767

val_bpb

1.0677

Architecture

Transformer

Optimizer

Muon

Artifact Size

15,912,974 bytes

Training Techniques

Weight Averaging

EMA

parameters: {"decay":0.9965}

Quantization

GPTQ

bits: 6

scope: attn/MLP; embeddings int7

mixed int6/int7

bits: null

scope: attn/MLP int6, embeddings int7

Architecture

SmearGate

BOS-fixed smear gate used in the base stack

parameters: null

Gated Attention

Sparse/gated attention path in the base stack

parameters: {"gate_scale":0.5}

depth recurrence

Depth recurrence / TTT-style infrastructure in the base stack

parameters: null

Evaluation

sliding window eval

parameters: {"stride":64}

Test-Time Training

LoRA TTT

parameters: {"rank":4,"learning_rate":0.001}

Other

other

Post-quantization LoRA distillation (LCQ) using KL divergence against pre-quantization BF16 teacher logits on TRAIN data only

parameters: {"rank":4,"time_s":60}

other

cu_seqlens-aware variable-length attention dispatch for LoRA-augmented evaluation

parameters: null

Post-Quantization LoRA Distillation (LCQ) after GPTQ quantization
KL distillation of a rank-4 LoRA against pre-quantization BF16 teacher logits on TRAIN data only
cu_seqlens-aware variable-length attention support for LoRA-augmented sliding-window evaluation
Keeping the trained LoRA in memory across the train-to-eval boundary within the same process
Negative-result analysis showing LCQ recovers little beyond plain sliding window and costs main training budget