PR #1797

open

Record: PR #1787 base + Smear Gate + LQER Asym — val_bpb 1.06157

by dexhunterView on GitHub
val_bpb
1.0616
Architecture
Transformer
Optimizer
SGD
Artifact Size
15.95MB

Training Techniques

Architecture
SmearGate
Causal content-conditioned gate over the last 12 tokens of the residual stream.
parameters: {"window":12}
Gated Attention
Learned scalar gating on attention outputs with quant-gate enabled.
parameters: null
depth recurrence
Loop4-5 recurrent depth structure with parallel residuals.
parameters: {"loop_start":3,"loop_end":5,"parallel_start_layer":8}
KV head count
Grouped-query style attention with fewer KV heads than attention heads.
parameters: {"num_heads":8,"num_kv_heads":4}
RoPE
Rotary positional embedding configuration.
parameters: {"base":10000,"dimensions":16}
Quantization
GPTQ
bits: 6
scope: MLP
Test-Time Training
score-first TTT
parameters: {"phased":true,"prefix_docs":2000,"num_phases":3}
Evaluation
sliding window eval
parameters: {"stride":64}
Sequence Length
sequence_length
train_length: null
eval_length: 2048
Regularization
logit softcap
parameters: {"value":30}
Other
other
LQER asymmetric rank-4 correction for int6 MLP rows, with int4 factors and per-group-64 asymmetric scaling.
parameters: {"rank":4,"group_size":64}

Novel Contributions

  • SmearGate over the last 12 residual tokens
  • LQER asymmetric rank-4 correction for int6 MLP quantization
  • Stacking SmearGate and LQER on top of the PR #1787 base stack
  • 3-seed mean val_bpb of 1.06157