PR #1568

open

Non-record submission: Weight-Tied 6Lx2 d=672 (1.1639 BPB)

by yuitokyouniView on GitHub
val_bpb
1.1639
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
13.2MB

Training Techniques

Architecture
weight tying
Reuses 6 unique transformer blocks across 2 passes to create 12 effective layers with shared weights.
parameters: {"unique_blocks":6,"passes":2,"effective_layers":12,"d_model":672}
GQA
Uses grouped query attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
LeakyReLU
Uses LeakyReLU squared in the MLP.
parameters: {"slope":0.5}
MLP3x
Uses a 3x MLP expansion ratio.
parameters: {"expansion":3}
Partial RoPE
Applies rotary position embeddings to only part of the representation.
parameters: {"dimensions":16}
U-Net skip connections
Links encoder outputs back to decoder layers across the two-pass tied-depth structure.
parameters: null
XSA
Exclusive Self Attention applied to all unique blocks.
parameters: {"blocks":6}
SmearGate
Injects bigram context through a learned gate.
parameters: null
BigramHash
Adds token-pair embeddings via hashed bigram buckets.
parameters: {"buckets":2048,"dim":128}
Weight Averaging
EMA
parameters: {"decay":0.997}
Quantization
late QAT
bits: 6
scope: all
GPTQ
bits: 6
scope: all
Compression
lzma
level: null
Evaluation
sliding window eval
parameters: {"stride":64}
Optimizer
Parallel Muon
weight_decay: null
momentum: null
other_params: {"batched_newton_schulz":true,"reduce_scatter_overlap":true}

Novel Contributions

  • Weight-tied 6-block transformer reused across two passes to achieve 12 effective layers
  • Reinvesting parameter savings into a larger model dimension (d=672)
  • Combining SmearGate, BigramHash, XSA, EMA, late QAT, GPTQ int6, and lzma compression
  • Using U-Net skip connections and cached LayerNorm scaling to differentiate tied-depth passes
  • Sliding window evaluation with stride 64