PR #368

open

PROTEUS v4 — non-record submission (val_bpb: 1.2037)

by MatoTeziTankaView on GitHub

val_bpb

1.2037

Architecture

Transformer

Optimizer

Muon

Artifact Size

12,499,612 bytes

Training Techniques

Quantization

mixed int5/int6

bits: null

scope: INT5 for MLP weights, INT6 for attention weights

Architecture

SmearGate

Custom gating mechanism used in the model.

parameters: null

BigramHash

Bigram hashing component with size 2048.

parameters: {"dimensions":2048}

OrthoInit

Orthogonal initialization.

parameters: null

MLP3x

Transformer MLP expanded to 3x hidden size.

parameters: {"layers":10,"dim":512,"mlp_multiplier":3}

tied embeddings

FP16 tied embedding weights.

parameters: null

Optimizer

Muon

weight_decay: 0.04

momentum: null

other_params: {"decoupled_weight_decay":true}

AdamW

weight_decay: 0.04

momentum: null

other_params: {"scope":"embeddings and scalars"}

Weight Averaging

EMA

parameters: {"decay":0.999,"every_steps":10}

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64,"seq":2048}

Initialization

OrthoInit

Orthogonal initialization used for model weights.

Sequence Length

sequence_length

train_length: null

eval_length: 2048

LR Schedule

warmdown

parameters: {"warmdown_steps":3000}

Regularization

weight decay

parameters: {"value":0.04}

pruning

parameters: {"sparsity":0.03}

Novel Contributions

10-layer transformer with mixed INT5/INT6 quantization
SmearGate + BigramHash + OrthoInit integration
Muon optimizer with decoupled weight decay
EMA weight averaging
3% magnitude pruning before export
Sliding window evaluation with stride 64
RoPE base 50K
Late-K passthrough for the last 2 layers