PR #1247

open

Proposal: Validate ASQU on the March 22 10min/16MB control line

by fahmitechView on GitHub

val_bpb

1.2208

Architecture

Transformer

Optimizer

Muon

Artifact Size

11,762,459 bytes

Training Techniques

Architecture

GQA

Grouped query attention with 4 KV heads in the 11-layer base stack

parameters: {"kv_heads":4}

XSA

Efficient partial XSA applied on the last 4 layers

parameters: {"layers":4}

Partial RoPE

Rotary position embeddings applied to a partial subset of dimensions

parameters: {"dimensions":16}

VE128

Shared Value Embedding with dimension 128 and per-layer learned scales

parameters: {"dimension":128,"layers":[9,10]}

ASQU

MLP activation x^2 for positive inputs and learned per-channel beta_i * x^2 for negative inputs

parameters: {"beta_per_channel":true}

Regularization

LN scale

parameters: {"scale":"1/sqrt(layer_idx+1)"}

logit softcap

parameters: {"value":30}

Weight Averaging

EMA

parameters: {"decay":0.997}

Tight SWA

parameters: {"interval":50}

Quantization

GPTQ-lite

bits: 6

scope: MLP + attention weights

late QAT

bits: 6

scope: all

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"warmup_momentum_start":0.92}

AdamW

weight_decay: 0.04

momentum: null

other_params: {"used_for":["embeddings","scalars"]}

LR Schedule

warmdown

parameters: {"warmdown_steps":3500}

Compression

zstd

level: 22

Novel Contributions

Transfers ASQU onto the March 22 control line as a differentiated candidate for the 10-minute/16MB regime
Adds learned per-channel beta parameters for ASQU with a dedicated low learning rate
Introduces a dedicated optimizer group for ASQU beta parameters
Adds export handling intended to protect ASQU control tensors from quantization damage
Provides a stage-gated validation plan with explicit GO/KILL criteria for real 8xH100 runs