PR #1972

open

Record: SP10240 SimCTG + PreQuantTTT — 1.03983 sliding-window (3-seed)

by BharathSShankarView on GitHub
val_bpb
1.0398
Architecture
Transformer
Optimizer
AdamW
Artifact Size
15.948 MB

Training Techniques

Test-Time Training
full TTT
parameters: {"epochs":21,"learning_rate_start":0.0005,"learning_rate_end":0.00005,"frozen_layers":2,"frozen_parameters":["tok_emb.weight"]}
Architecture
weight tying
Tied embeddings are used.
parameters: null
depth recurrence
Encoder loops layers 3-5 for recurrence.
parameters: {"layers":[3,5]}
Parallel Residuals
Parallel residual connections are enabled from layer 7.
parameters: {"start_layer":7}
LeakyReLU
LeakyReLU(0.5)^2 SwiGLU activation variant.
parameters: {"negative_slope":0.5}
Partial RoPE
Partial rotary positional embeddings are used.
parameters: {"ratio":"16/64"}
XSA
XSA attention is used on all layers.
parameters: {"layers":11}
SP10240 tokenizer
SP10240 tokenizer is used.
parameters: null
Quantization
GPTQ
bits: 6
scope: model
Compression
lzma
level: null
Evaluation
sliding window eval
parameters: {"stride":64}
Weight Averaging
EMA
parameters: null
Regularization
SimCTG
parameters: {"lambda":0.3,"margin":0.4}

Novel Contributions

  • Pre-quantization Test-Time Training (PreQuantTTT) applied after the legal pre-quantization grade and before serialization.
  • SimCTG paired with PreQuantTTT, showing the contrastive regularizer survives the test-time fine-tuning stage.
  • 3-seed validation of the PreQuantTTT recipe on the SP10240 N9 base architecture.
  • Self-extracting train_gpt.py packaging using lzma+base85+exec to fit the artifact size cap.