PR #1001

open

Non-record: Three Approaches + Lessons Learned (best: 1.1188 BPB)

val_bpb

1.1188

Architecture

Transformer

Optimizer

—

Artifact Size

15.3MB

Training Techniques

Quantization

GPTQ

bits: 5

scope: all

Test-Time Training

score-first TTT

parameters: null

Architecture

LeakyReLU

Uses LeakyReLU in the #569-based approach.

parameters: null

ReLU²

Uses ReLU squared in the #569-based approach.

parameters: null

Value Residual

Uses a Value Residual-based architecture in approach A.

parameters: null

Gated Attention

Uses gated attention in the #569-based approach.

parameters: null

weight tying

Uses tied embeddings / weight tying in the referenced base models if applicable.

parameters: null

Sequence Length

sequence_length

train_length: null

eval_length: null

Compression

lzma

level: null

Reports a legal s_0-only TTT score to avoid illegal re-scoring after training
Compares three approaches and identifies 1.1188 BPB as the best legal result
Shows that GPTQ calibration can be completed within the 600s training budget
Documents an int5 penalty on the d=512 model variant
Highlights that artifact size constraints can exclude a stronger GEPA-based approach