PR #1797

open

Record: PR #1787 base + Smear Gate + LQER Asym — val_bpb 1.06157

by dexhunterView on GitHub

val_bpb

1.0616

Architecture

Transformer

Optimizer

SGD

Artifact Size

15.95MB

Training Techniques

Architecture

SmearGate

Causal content-conditioned gate over the last 12 tokens of the residual stream.

parameters: {"window":12}

Gated Attention

Learned scalar gating on attention outputs with quant-gate enabled.

parameters: null

depth recurrence

Loop4-5 recurrent depth structure with parallel residuals.

parameters: {"loop_start":3,"loop_end":5,"parallel_start_layer":8}

KV head count

Grouped-query style attention with fewer KV heads than attention heads.

parameters: {"num_heads":8,"num_kv_heads":4}

RoPE

Rotary positional embedding configuration.

parameters: {"base":10000,"dimensions":16}

Quantization

GPTQ

bits: 6

scope: MLP

Test-Time Training

score-first TTT

parameters: {"phased":true,"prefix_docs":2000,"num_phases":3}

Evaluation

sliding window eval

parameters: {"stride":64}

Sequence Length

sequence_length

train_length: null

eval_length: 2048

Regularization

logit softcap

parameters: {"value":30}

Other

other

LQER asymmetric rank-4 correction for int6 MLP rows, with int4 factors and per-group-64 asymmetric scaling.

parameters: {"rank":4,"group_size":64}

Novel Contributions

SmearGate over the last 12 residual tokens
LQER asymmetric rank-4 correction for int6 MLP quantization
Stacking SmearGate and LQER on top of the PR #1787 base stack
3-seed mean val_bpb of 1.06157