PR #1400

open

Record: Hadamard-Rotated GPTQ + dTTT + Recur2 (1.1035 BPB)

by tmancinoView on GitHub

val_bpb

1.1035

Architecture

Transformer

Optimizer

—

Artifact Size

~15.88 MB

Training Techniques

Quantization

GPTQ

bits: 6

scope: all

Architecture

depth recurrence

Re-runs the last transformer layers to create more effective layers from fewer stored layers.

parameters: {"layers":2}

weight tying

Tied input and output embeddings.

parameters: null

GQA

Grouped query attention with fewer KV heads than attention heads.

parameters: {"attention_heads":8,"kv_heads":4}

BigramHash

Adds bigram hash embeddings to the architecture.

parameters: {"dimension":128,"size":2048}

U-Net skip connections

Uses U-Net style skip connections in the model.

parameters: null

SmearGate

Includes SmearGate in the architecture.

parameters: null

XSA

Applies XSA across all layers.

parameters: {"layers":11}

LeakyReLU

Uses LeakyReLU squared MLP activation.

parameters: {"slope":0.5}

RoPE

Uses rotary positional embeddings.

parameters: {"dimensions":16}

Test-Time Training

full TTT

parameters: {"epochs":10,"adaptive_lr":true,"per_block_lr":true}

LR Schedule

cosine decay

parameters: {"epochs":10}

Regularization

weight decay

parameters: {"value":0.03}

Weight Averaging

EMA

parameters: {"tau":0.997}

Compression

lzma

level: 9

Evaluation

sliding window eval

parameters: {"stride":64}

Novel Contributions

Hadamard rotation before GPTQ quantization to reduce reconstruction error
Discriminative test-time training with per-block adaptive learning rates
2-layer depth recurrence to increase effective depth without storing more layers
Selective ±2 pruning with LZMA-based binary search to fit the 16MB budget