PR #1572

open

Record: SP8192 + Depth Recurrence x2 + GPTQ + Score-First TTT + fused-softcap-ce -- val_bpb 1.07974 (3-seed mean)

by anthony-maioView on GitHub

val_bpb

1.0797

Architecture

Transformer

Optimizer

SGD

Artifact Size

~15.99 MB

Training Techniques

Architecture

depth recurrence

Layers 3-5 are looped twice using virtual encoder/decoder sequences.

parameters: {"layers":[3,4,5],"num_loops":2}

Quantization

GPTQ

bits: 6

scope: weights; embeddings int8

Weight Averaging

EMA

parameters: {"decay":0.997}

Optimizer

SGD

weight_decay: null

momentum: null

other_params: {"learning_rate":0.005}

Test-Time Training

score-first TTT

parameters: {"epochs":3,"chunks":1238}

Evaluation

sliding window eval

parameters: null

Compression

lzma

level: null

Regularization

logit softcap

parameters: {"qk_gain":5.25}