PR #1972

open

Record: SP10240 SimCTG + PreQuantTTT — 1.03983 sliding-window (3-seed)

by BharathSShankarView on GitHub

val_bpb

1.0398

Architecture

Transformer

Optimizer

AdamW

Artifact Size

15.948 MB

Training Techniques

Test-Time Training

full TTT

parameters: {"epochs":21,"learning_rate_start":0.0005,"learning_rate_end":0.00005,"frozen_layers":2,"frozen_parameters":["tok_emb.weight"]}

Architecture

weight tying

Tied embeddings are used.

parameters: null

depth recurrence

Encoder loops layers 3-5 for recurrence.

parameters: {"layers":[3,5]}

Parallel Residuals

Parallel residual connections are enabled from layer 7.

parameters: {"start_layer":7}

LeakyReLU

LeakyReLU(0.5)^2 SwiGLU activation variant.

parameters: {"negative_slope":0.5}

Partial RoPE

Partial rotary positional embeddings are used.

parameters: {"ratio":"16/64"}

XSA

XSA attention is used on all layers.

parameters: {"layers":11}

SP10240 tokenizer

SP10240 tokenizer is used.

parameters: null

Quantization

GPTQ

bits: 6

scope: model

Compression

lzma

level: null

Evaluation

sliding window eval

parameters: {"stride":64}

Weight Averaging

EMA

parameters: null

Regularization

SimCTG

parameters: {"lambda":0.3,"margin":0.4}

Novel Contributions

Pre-quantization Test-Time Training (PreQuantTTT) applied after the legal pre-quantization grade and before serialization.
SimCTG paired with PreQuantTTT, showing the contrastive regularizer survives the test-time fine-tuning stage.
3-seed validation of the PreQuantTTT recipe on the SP10240 N9 base architecture.
Self-extracting train_gpt.py packaging using lzma+base85+exec to fit the artifact size cap.