PR #1619

open

Submission/sp8192 depthrecur adamwttt

by AVINASH0052View on GitHub

val_bpb

1.1156

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

15,832,508 bytes

Training Techniques

Architecture

XSA

Applied XSA across all 11 layers, dropping the self-value projection.

parameters: {"layers":11}

BigramHash

Used bigram hash embedding for token representation.

parameters: {"dimensions":3072,"embedding_dim":112}

Partial RoPE

Applied rotary position encoding to a subset of head dimensions.

parameters: {"head_dims":16,"total_head_dims":64}

U-Net skip connections

Added U-Net style skip connections between mirrored layers.

parameters: {"pairs":[[0,10],[1,9],[2,8]]}

VE128

Re-injected value embeddings at later layers.

parameters: {"layers":[9,10]}

SmearGate

Used a learned position mixing gate on the embedding.

parameters: null

GQA

Used grouped query attention with fewer KV heads than query heads.

parameters: {"query_heads":8,"kv_heads":4}

LeakyReLU

Used LeakyReLU squared in the MLP.

parameters: {"squared":true,"negative_slope":0.5}

weight tying

Tied token embeddings with the LM head.

parameters: null

Weight Averaging

EMA

parameters: {"decay":0.997}

SWA

parameters: {"frequency":50,"condition":"lr_scale < 0.2"}

Quantization

late QAT

bits: null

scope: all

GPTQ

bits: 6

scope: all quantizable layers

Evaluation

sliding window eval

parameters: {"stride":64}

Optimizer

AdamW

weight_decay: null

momentum: null

other_params: {"used_for":"embeddings"}

Regularization

logit softcap

parameters: {"value":30}

LN scale

parameters: {"scale":"1/sqrt(L+1)"}

LR Schedule

warmdown

parameters: {"iters":4000}

Compression

lzma

level: 9

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

Novel Contributions

11-layer Transformer with XSA across all layers
BigramHash 3072×112 embedding
U-Net style skip connections
VE128 value embedding reinjection
Late QAT followed by full Hessian GPTQ int6 compression
Sliding-window exact evaluation with stride 64
EMA plus tight SWA during training
Parallel Muon optimizer with AdamW for embeddings