PR #1929

open

Record: SP8192 + SLOT scored-position + cross-batch EMA warmup: val_bpb=0.94569

by davie2009khView on GitHub

val_bpb

0.9457

Architecture

Transformer

Optimizer

AdamW

Artifact Size

15.87MB

Training Techniques

Quantization

GPTQ

bits: 6

scope: model weights

Architecture

depth recurrence

3-layer depth recurrence stack

parameters: {"layers":3}

parallel residuals

Parallel residual connections in upper layers

parameters: null

QK-Gain

QK-Gain scaling applied to attention

parameters: {"gain":5.25}

Weight Averaging

EMA

parameters: {"decay":0.9965}

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"Huber weight decay":true}

AdamW

weight_decay: null

momentum: null

other_params: {"steps":24,"learning_rate_start":0.008,"learning_rate_end":0.0008}

Test-Time Training

score-first TTT

parameters: {"per_sample_delta":true,"per_sample_logit_bias":true,"cross_batch_ema_warmup":true,"warmup_decay":0.5,"scored_positions_only":true}

Evaluation

sliding window eval

parameters: {"stride":64}

LR Schedule

cosine decay

parameters: {"start":0.008,"end":0.0008}

Sequence Length

sequence_length

train_length: 8192

eval_length: 8192

Compression

brotli

level: null

Novel Contributions

Scored-position SLOT applied only to scored past tokens during evaluation
Per-sample delta and logit_bias optimization in fp32
Cross-batch EMA warmup that carries converged delta/logit_bias means to the next batch
AdamW-based SLOT optimization with a 24-step cosine schedule
SLOT restricted to eval_val_sliding on the quantized model without changing training