PR #1568

open

Non-record submission: Weight-Tied 6Lx2 d=672 (1.1639 BPB)

by yuitokyouniView on GitHub

val_bpb

1.1639

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

13.2MB

Training Techniques

Architecture

weight tying

Reuses 6 unique transformer blocks across 2 passes to create 12 effective layers with shared weights.

parameters: {"unique_blocks":6,"passes":2,"effective_layers":12,"d_model":672}

GQA

Uses grouped query attention with fewer KV heads than attention heads.

parameters: {"heads":8,"kv_heads":4}

LeakyReLU

Uses LeakyReLU squared in the MLP.

parameters: {"slope":0.5}

MLP3x

Uses a 3x MLP expansion ratio.

parameters: {"expansion":3}

Partial RoPE

Applies rotary position embeddings to only part of the representation.

parameters: {"dimensions":16}

U-Net skip connections

Links encoder outputs back to decoder layers across the two-pass tied-depth structure.

parameters: null

XSA

Exclusive Self Attention applied to all unique blocks.

parameters: {"blocks":6}

SmearGate

Injects bigram context through a learned gate.

parameters: null

BigramHash

Adds token-pair embeddings via hashed bigram buckets.

parameters: {"buckets":2048,"dim":128}

Weight Averaging

EMA

parameters: {"decay":0.997}

Quantization

late QAT

bits: 6

scope: all

GPTQ

bits: 6

scope: all

Compression

lzma

level: null

Evaluation

sliding window eval

parameters: {"stride":64}

Optimizer

Parallel Muon

weight_decay: null

momentum: null

other_params: {"batched_newton_schulz":true,"reduce_scatter_overlap":true}

Novel Contributions

Weight-tied 6-block transformer reused across two passes to achieve 12 effective layers
Reinvesting parameter savings into a larger model dimension (d=672)
Combining SmearGate, BigramHash, XSA, EMA, late QAT, GPTQ int6, and lzma compression
Using U-Net skip connections and cached LayerNorm scaling to differentiate tied-depth passes
Sliding window evaluation with stride 64