PR #1247
openProposal: Validate ASQU on the March 22 10min/16MB control line
by fahmitechView on GitHub
val_bpb
1.2208
Architecture
Transformer
Optimizer
Muon
Artifact Size
11,762,459 bytes
Training Techniques
Architecture
GQA
Grouped query attention with 4 KV heads in the 11-layer base stack
parameters: {"kv_heads":4}
XSA
Efficient partial XSA applied on the last 4 layers
parameters: {"layers":4}
Partial RoPE
Rotary position embeddings applied to a partial subset of dimensions
parameters: {"dimensions":16}
VE128
Shared Value Embedding with dimension 128 and per-layer learned scales
parameters: {"dimension":128,"layers":[9,10]}
ASQU
MLP activation x^2 for positive inputs and learned per-channel beta_i * x^2 for negative inputs
parameters: {"beta_per_channel":true}
Regularization
LN scale
parameters: {"scale":"1/sqrt(layer_idx+1)"}
logit softcap
parameters: {"value":30}
Weight Averaging
EMA
parameters: {"decay":0.997}
Tight SWA
parameters: {"interval":50}
Quantization
GPTQ-lite
bits: 6
scope: MLP + attention weights
late QAT
bits: 6
scope: all
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"warmup_momentum_start":0.92}
AdamW
weight_decay: 0.04
momentum: null
other_params: {"used_for":["embeddings","scalars"]}
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Compression
zstd
level: 22
Novel Contributions
- Transfers ASQU onto the March 22 control line as a differentiated candidate for the 10-minute/16MB regime
- Adds learned per-channel beta parameters for ASQU with a dedicated low learning rate
- Introduces a dedicated optimizer group for ASQU beta parameters
- Adds export handling intended to protect ASQU control tensors from quantization damage
- Provides a stage-gated validation plan with explicit GO/KILL criteria for real 8xH100 runs