PR #1001

open

Non-record: Three Approaches + Lessons Learned (best: 1.1188 BPB)

by ibarrajoView on GitHub
val_bpb
1.1188
Architecture
Transformer
Optimizer
Artifact Size
15.3MB

Training Techniques

Quantization
GPTQ
bits: 5
scope: all
Test-Time Training
score-first TTT
parameters: null
Architecture
LeakyReLU
Uses LeakyReLU in the #569-based approach.
parameters: null
ReLU²
Uses ReLU squared in the #569-based approach.
parameters: null
Value Residual
Uses a Value Residual-based architecture in approach A.
parameters: null
Gated Attention
Uses gated attention in the #569-based approach.
parameters: null
weight tying
Uses tied embeddings / weight tying in the referenced base models if applicable.
parameters: null
Sequence Length
sequence_length
train_length: null
eval_length: null
Compression
lzma
level: null

Novel Contributions

  • Reports a legal s_0-only TTT score to avoid illegal re-scoring after training
  • Compares three approaches and identifies 1.1188 BPB as the best legal result
  • Shows that GPTQ calibration can be completed within the 600s training budget
  • Documents an int5 penalty on the d=512 model variant
  • Highlights that artifact size constraints can exclude a stronger GEPA-based approach