PR #1411
openNon-record: Blueprint Stack + ProgSeq + Multi-scale RoPE + ByteEmbed — val_bpb 1.5568 (1xRTX 3080)
by BlakethefnView on GitHub
val_bpb
1.5568
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.9 MB
Training Techniques
Architecture
GQA
Grouped-query attention with 8 attention heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
MLP3x
Three-layer MLP stack with ReLU² activation.
parameters: {"mlp_layers":3,"activation":"ReLU²"}
weight tying
Tied input and output embeddings.
parameters: null
U-Net skip connections
Skip connections inspired by U-Net added to the transformer stack.
parameters: null
RoPE
Multi-scale RoPE applied by KV group with different context scales.
parameters: {"bases":[1000,10000,100000,1000000]}
other
Byte-level token embeddings using a 64-dim UTF-8 byte side channel.
parameters: {"dimension":64}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"used_for":"matrix parameters"}
Adam
weight_decay: null
momentum: null
other_params: {"used_for":"embeddings/scalars"}
Weight Averaging
SWA
parameters: null
Sequence Length
sequence_length
train_length: 2048
eval_length: null
Quantization
mixed int5/int6
bits: 5
scope: MLP/attention/bigram/byte weights
Compression
zstd
level: null
Novel Contributions
- Progressive sequence length schedule
- Multi-scale RoPE by KV group
- Byte-level token embeddings from UTF-8 bytes
- Mixed-bit quantization export with zstd roundtrip
- Combined blueprint stack on a single RTX 3080
- Systematic ablation of leaderboard techniques under a 10-minute wallclock cap