PR #1411

open

Non-record: Blueprint Stack + ProgSeq + Multi-scale RoPE + ByteEmbed — val_bpb 1.5568 (1xRTX 3080)

by BlakethefnView on GitHub

val_bpb

1.5568

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.9 MB

Training Techniques

Architecture

GQA

Grouped-query attention with 8 attention heads and 4 KV heads.

parameters: {"heads":8,"kv_heads":4}

MLP3x

Three-layer MLP stack with ReLU² activation.

parameters: {"mlp_layers":3,"activation":"ReLU²"}

weight tying

Tied input and output embeddings.

parameters: null

U-Net skip connections

Skip connections inspired by U-Net added to the transformer stack.

parameters: null

RoPE

Multi-scale RoPE applied by KV group with different context scales.

parameters: {"bases":[1000,10000,100000,1000000]}

other

Byte-level token embeddings using a 64-dim UTF-8 byte side channel.

parameters: {"dimension":64}

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"used_for":"matrix parameters"}

Adam

weight_decay: null

momentum: null

other_params: {"used_for":"embeddings/scalars"}

Weight Averaging

SWA

parameters: null

Sequence Length

sequence_length

train_length: 2048

eval_length: null

Quantization

mixed int5/int6

bits: 5

scope: MLP/attention/bigram/byte weights

Compression

zstd

level: null

Novel Contributions

Progressive sequence length schedule
Multi-scale RoPE by KV group
Byte-level token embeddings from UTF-8 bytes
Mixed-bit quantization export with zstd roundtrip
Combined blueprint stack on a single RTX 3080
Systematic ablation of leaderboard techniques under a 10-minute wallclock cap