PR #1572
openRecord: SP8192 + Depth Recurrence x2 + GPTQ + Score-First TTT + fused-softcap-ce -- val_bpb 1.07974 (3-seed mean)
by anthony-maioView on GitHub
val_bpb
1.0797
Architecture
Transformer
Optimizer
SGD
Artifact Size
~15.99 MB
Training Techniques
Architecture
depth recurrence
Layers 3-5 are looped twice using virtual encoder/decoder sequences.
parameters: {"layers":[3,4,5],"num_loops":2}
Quantization
GPTQ
bits: 6
scope: weights; embeddings int8
Weight Averaging
EMA
parameters: {"decay":0.997}
Optimizer
SGD
weight_decay: null
momentum: null
other_params: {"learning_rate":0.005}
Test-Time Training
score-first TTT
parameters: {"epochs":3,"chunks":1238}
Evaluation
sliding window eval
parameters: null
Compression
lzma
level: null
Regularization
logit softcap
parameters: {"qk_gain":5.25}
Novel Contributions
- SP8192 tokenizer integration
- depth recurrence x2 with looped layers 3-5
- GPTQ quantization with mixed int6 weights and int8 embeddings
- score-first TTT pipeline
- fused-softcap-ce CUDA kernel for faster scoring
- lzma+base85+exec compressed train script shim