PR #928

open

Non-record: XSA-all + mHC + Full QAT (val_bpb=1.1211)

by autocode-rayesView on GitHub

val_bpb

1.1211

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

15.95 MB

Training Techniques

Architecture

XSA

Cross-sequence attention applied to all 11 layers instead of only the last 4 layers.

parameters: {"layers":11}

other

Manifold-constrained hyper-connections with learnable alpha/beta residual mixing per block under a norm constraint.

parameters: {"extra_params":22}

Quantization

QAT

bits: 6

scope: all

Evaluation

sliding window eval

parameters: {"stride":64}

Test-Time Training

full TTT

parameters: {"enabled":1}