PR #1001
openNon-record: Three Approaches + Lessons Learned (best: 1.1188 BPB)
by ibarrajoView on GitHub
val_bpb
1.1188
Architecture
Transformer
Optimizer
—
Artifact Size
15.3MB
Training Techniques
Quantization
GPTQ
bits: 5
scope: all
Test-Time Training
score-first TTT
parameters: null
Architecture
LeakyReLU
Uses LeakyReLU in the #569-based approach.
parameters: null
ReLU²
Uses ReLU squared in the #569-based approach.
parameters: null
Value Residual
Uses a Value Residual-based architecture in approach A.
parameters: null
Gated Attention
Uses gated attention in the #569-based approach.
parameters: null
weight tying
Uses tied embeddings / weight tying in the referenced base models if applicable.
parameters: null
Sequence Length
sequence_length
train_length: null
eval_length: null
Compression
lzma
level: null
Novel Contributions
- Reports a legal s_0-only TTT score to avoid illegal re-scoring after training
- Compares three approaches and identifies 1.1188 BPB as the best legal result
- Shows that GPTQ calibration can be completed within the 600s training budget
- Documents an int5 penalty on the d=512 model variant
- Highlights that artifact size constraints can exclude a stronger GEPA-based approach