PR #1619

open

Submission/sp8192 depthrecur adamwttt

by AVINASH0052View on GitHub
val_bpb
1.1156
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
15,832,508 bytes

Training Techniques

Architecture
XSA
Applied XSA across all 11 layers, dropping the self-value projection.
parameters: {"layers":11}
BigramHash
Used bigram hash embedding for token representation.
parameters: {"dimensions":3072,"embedding_dim":112}
Partial RoPE
Applied rotary position encoding to a subset of head dimensions.
parameters: {"head_dims":16,"total_head_dims":64}
U-Net skip connections
Added U-Net style skip connections between mirrored layers.
parameters: {"pairs":[[0,10],[1,9],[2,8]]}
VE128
Re-injected value embeddings at later layers.
parameters: {"layers":[9,10]}
SmearGate
Used a learned position mixing gate on the embedding.
parameters: null
GQA
Used grouped query attention with fewer KV heads than query heads.
parameters: {"query_heads":8,"kv_heads":4}
LeakyReLU
Used LeakyReLU squared in the MLP.
parameters: {"squared":true,"negative_slope":0.5}
weight tying
Tied token embeddings with the LM head.
parameters: null
Weight Averaging
EMA
parameters: {"decay":0.997}
SWA
parameters: {"frequency":50,"condition":"lr_scale < 0.2"}
Quantization
late QAT
bits: null
scope: all
GPTQ
bits: 6
scope: all quantizable layers
Evaluation
sliding window eval
parameters: {"stride":64}
Optimizer
AdamW
weight_decay: null
momentum: null
other_params: {"used_for":"embeddings"}
Regularization
logit softcap
parameters: {"value":30}
LN scale
parameters: {"scale":"1/sqrt(L+1)"}
LR Schedule
warmdown
parameters: {"iters":4000}
Compression
lzma
level: 9
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048

Novel Contributions

  • 11-layer Transformer with XSA across all layers
  • BigramHash 3072×112 embedding
  • U-Net style skip connections
  • VE128 value embedding reinjection
  • Late QAT followed by full Hessian GPTQ int6 compression
  • Sliding-window exact evaluation with stride 64
  • EMA plus tight SWA during training
  • Parallel Muon optimizer with AdamW for embeddings