PR #1411

open

Non-record: Blueprint Stack + ProgSeq + Multi-scale RoPE + ByteEmbed — val_bpb 1.5568 (1xRTX 3080)

by BlakethefnView on GitHub
val_bpb
1.5568
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.9 MB

Training Techniques

Architecture
GQA
Grouped-query attention with 8 attention heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
MLP3x
Three-layer MLP stack with ReLU² activation.
parameters: {"mlp_layers":3,"activation":"ReLU²"}
weight tying
Tied input and output embeddings.
parameters: null
U-Net skip connections
Skip connections inspired by U-Net added to the transformer stack.
parameters: null
RoPE
Multi-scale RoPE applied by KV group with different context scales.
parameters: {"bases":[1000,10000,100000,1000000]}
other
Byte-level token embeddings using a 64-dim UTF-8 byte side channel.
parameters: {"dimension":64}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"used_for":"matrix parameters"}
Adam
weight_decay: null
momentum: null
other_params: {"used_for":"embeddings/scalars"}
Weight Averaging
SWA
parameters: null
Sequence Length
sequence_length
train_length: 2048
eval_length: null
Quantization
mixed int5/int6
bits: 5
scope: MLP/attention/bigram/byte weights
Compression
zstd
level: null

Novel Contributions

  • Progressive sequence length schedule
  • Multi-scale RoPE by KV group
  • Byte-level token embeddings from UTF-8 bytes
  • Mixed-bit quantization export with zstd roundtrip
  • Combined blueprint stack on a single RTX 3080
  • Systematic ablation of leaderboard techniques under a 10-minute wallclock cap