PR #1945

RECORDopen

Record: PR #1908 base + AWQ-lite + Asymmetric Logit Rescale - val_bpb 1.05932 (3-seed mean)

val_bpb

1.0593

Architecture

Transformer

Optimizer

—

Artifact Size

15,986,941 bytes

Training Techniques

Quantization

GPTQ-lite

bits: 6

scope: all

Architecture

SmearGate

Causal residual gate using a 1-token lookback and a content-conditioned sigmoid gate over the first feature dimensions.

parameters: {"window":12}

SparseAttnGate

Sparse per-head multiplicative gate inside attention.

parameters: null

Regularization

logit softcap

parameters: {"asymmetric":true,"pos":"softcap_pos","neg":"softcap_neg"}

Test-Time Training

LoRA TTT

parameters: {"phases":3,"score_first":true}

Evaluation

sliding window eval

parameters: {"stride":64,"context_length":2048}

Sequence Length

sequence_length

train_length: null

eval_length: 2048

Compression

brotli

level: null

Combines PR #1908 AWQ-lite quantization with asymmetric logit rescaling at eval time.
Shows that asymmetric logit rescale improves TTT recovery when paired with AWQ-lite quantization.
Uses eval-only surgical edits to train_gpt.py while preserving the training path.
Achieves a 3-seed mean val_bpb of 1.05932 under the competition constraints.