PR #671

open

Submission: Atris Labs v8 (audited seed42, clean branch)

by keshav55View on GitHub

val_bpb

1.1807

Architecture

Transformer

Optimizer

Muon

Artifact Size

14,461,499 bytes

Training Techniques

Architecture

MLP3x

Uses a 3x-expanded MLP with 1536 hidden units and relu-squared activation.

parameters: {"hidden_units":1536}

BigramHash

Hashes consecutive token pairs into a 10240-bucket embedding table with learnable scale.

parameters: {"buckets":10240,"dimension":128,"scale":0.05}

SmearGate

Per-dimension learned gate blending each token with the previous token embedding.

parameters: null

weight tying

Tied embeddings are used and kept in FP16 passthrough during compression.

parameters: null

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"matrix_lr":0.02,"warmup_momentum_start":0.92,"warmup_steps":1500}

AdamW

weight_decay: 0.01

momentum: null

other_params: {"tied_embed_lr":0.03,"scalar_lr":0.02}

Weight Averaging

SWA

parameters: {"checkpoints_averaged":24,"during":"warmdown"}

Quantization

int5

bits: 5

scope: MLP weights

int6

bits: 6

scope: attention weights

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"context_length":2048}

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

LR Schedule

warmdown

parameters: {"warmdown_steps":3000}

Regularization

weight decay

parameters: {"muon_weight_decay":0.04,"adamw_weight_decay":0.01}

gradient clipping

parameters: {"norm":0.3}

pruning

parameters: {"magnitude_pruning":"3%"}

Novel Contributions

10-layer transformer with U-Net skip connections
MLP 3x expansion with relu-squared activation
BigramHash token-pair embedding augmentation
SmearGate token blending mechanism
Mixed int5/int6 quantization with per-row scaling
3% magnitude pruning before quantization
SWA over 24 checkpoints during warmdown
Audited seed=42 run with real train log and aligned submission artifacts