PR #1101

open

Add non-record pre-TTT anchor submission

by amrayachView on GitHub

val_bpb

1.1290

Architecture

Transformer

Optimizer

Muon

Artifact Size

15751324 bytes

Training Techniques

Architecture

U-Net skip connections

Learnable skip connections in the model backbone.

parameters: null

ReLU²

Uses relu^2 MLP activation.

parameters: null

GQA

Grouped Query Attention with fewer KV heads than query heads.

parameters: {"heads":8,"kv_heads":4}

SmearGate

Sigmoid gate interpolates current and previous token representations.

parameters: null

BigramHash

XOR-hash bigram embedding with 2048 buckets.

parameters: {"buckets":2048,"embed_dim":128}

XSA

Applied XSA on the last 4 layers with SDPA layout adaptation.

parameters: {"layers":4}

Partial RoPE

Rotates only the first 16 dimensions while passing the rest through.

parameters: {"rotated_dims":16,"total_dims":64}

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

Weight Averaging

EMA

parameters: {"decay":0.997}

Optimizer

Muon

weight_decay: 0.04

momentum: null

other_params: {"decoupled_update":true}

AdamW

weight_decay: 0.04

momentum: null

other_params: {"used_for":"tok/scalar/head optimizers"}

Quantization

mixed int6/int8

bits: 6

scope: MLP+attn int6; embeddings+other int8

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64}

Initialization

OrthoInit

Orthogonal initialization with 1/sqrt(2*num_layers) scaling for projection layers.

Regularization

layerwise LN scale

parameters: {"scale":"1/sqrt(layer_idx + 1)"}

Novel Contributions

Pre-TTT anchor submission adapted from the repo-root train_gpt.py skeleton
SDPA-based adaptation of donor features originally designed for flash_attn_3
Selective transplant of donor techniques including SmearGate, BigramHash, XSA, and Partial RoPE
Mixed int6/int8 export with zstd compression to fit the 16MB artifact limit
Stride-64 sliding evaluation for validation
EMA-based weight averaging and Muon optimization with decoupled weight decay