PR #1400

open

Record: Hadamard-Rotated GPTQ + dTTT + Recur2 (1.1035 BPB)

by tmancinoView on GitHub
val_bpb
1.1035
Architecture
Transformer
Optimizer
Artifact Size
~15.88 MB

Training Techniques

Quantization
GPTQ
bits: 6
scope: all
Architecture
depth recurrence
Re-runs the last transformer layers to create more effective layers from fewer stored layers.
parameters: {"layers":2}
weight tying
Tied input and output embeddings.
parameters: null
GQA
Grouped query attention with fewer KV heads than attention heads.
parameters: {"attention_heads":8,"kv_heads":4}
BigramHash
Adds bigram hash embeddings to the architecture.
parameters: {"dimension":128,"size":2048}
U-Net skip connections
Uses U-Net style skip connections in the model.
parameters: null
SmearGate
Includes SmearGate in the architecture.
parameters: null
XSA
Applies XSA across all layers.
parameters: {"layers":11}
LeakyReLU
Uses LeakyReLU squared MLP activation.
parameters: {"slope":0.5}
RoPE
Uses rotary positional embeddings.
parameters: {"dimensions":16}
Test-Time Training
full TTT
parameters: {"epochs":10,"adaptive_lr":true,"per_block_lr":true}
LR Schedule
cosine decay
parameters: {"epochs":10}
Regularization
weight decay
parameters: {"value":0.03}
Weight Averaging
EMA
parameters: {"tau":0.997}
Compression
lzma
level: 9
Evaluation
sliding window eval
parameters: {"stride":64}

Novel Contributions

  • Hadamard rotation before GPTQ quantization to reduce reconstruction error
  • Discriminative test-time training with per-block adaptive learning rates
  • 2-layer depth recurrence to increase effective depth without storing more layers
  • Selective ±2 pruning with LZMA-based binary search to fit the 16MB budget