PR #2068

open

Record: PR1797Base + HadamardRotation + ValueResidualLearning - val_bpb 1.06172

val_bpb

1.0617

Architecture

Transformer

Optimizer

—

Artifact Size

~15.97 MB

Training Techniques

Architecture

Value Residual

Blends the value tensor from layer 0 into later attention layers during training using a learned per-layer mixing scalar.

parameters: {"layers":11}

SmearGate

Inherited smear gate mechanism from the PR#1797 base stack.

parameters: {"window":12}

Gated Attention

Sparse/gated attention components are enabled in the run configuration.

parameters: null

Quantization

GPTQ

bits: 6

scope: weight matrices

Other

other

Hadamard rotation applied as a post-training quantization pre-processing step to reduce outlier columns and improve GPTQ compression.

parameters: {"seed":12648430}

Compression

brotli

level: null

Sequence Length

sequence_length

train_length: null

eval_length: 2048

Regularization

logit softcap

parameters: {"value":30}

Hadamard rotation before GPTQ quantization to reduce outlier columns and improve int6 compression
Value Residual Learning with learned blending of layer-0 values into later attention layers
Combination of PR#1797 base stack with SmearGate and LQER asym components