PR #1667

RECORDopen

RECORD: SmearGate + Attention Output Gate + Legal TTT | val_bpb=1.07139

by MarioPaerleView on GitHub

val_bpb

1.0714

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.927 MB

Training Techniques

Architecture

SmearGate

Input-dependent SmearGate reintroduced in a modded NanoGPT-style form.

parameters: {"width":12}

Gated Attention

Per-head multiplicative attention output gate applied to attention outputs.

parameters: {"width":12,"layers":11}

Partial RoPE

Rotary positional embeddings applied to a subset of dimensions.

parameters: {"dimensions":"16/64"}

GQA

Grouped query attention with fewer KV heads than attention heads.

parameters: {"heads":8,"kv_heads":4}

LeakyReLU

MLP uses LeakyReLU(0.5)^2 activation.

parameters: {"negative_slope":0.5}

U-Net skip connections

Skip gates with sigmoid-gated U-Net style connections.

parameters: null

depth recurrence

Three-layer depth recurrence applied to selected layers.

parameters: {"layers":[3,4,5],"activated_frac":0.35}

parallel residuals

Parallel residual pathway used in later layers.

parameters: {"start_layer":7}

Regularization

logit softcap

parameters: {"value":30}

layerwise LN scale

parameters: {"formula":"1/sqrt(layer_idx+1)"}

Optimizer

Muon

weight_decay: 0.095

momentum: null

other_params: {"mlr":0.026,"ema":0.9965}

Quantization

GPTQ

bits: 6

scope: matrices

GPTQ

bits: 7

scope: embeddings

Compression

Brotli

level: 11

Test-Time Training

score-first TTT

parameters: {"optimizer":"SGD","learning_rate":0.005,"epochs_per_chunk":3}

Novel Contributions

Reintroduced SmearGate with input dependence
Added per-head attention output gating
Combined SmearGate, attention output gating, and score-first TTT in a single record
Used GPTQ quantization with int6 matrices and int7 embeddings
Applied Brotli-11 compression with byte-shuffle