PR #1845

open

## Non-Record: Int6 QAT + Sliding Window Eval — 1.310 BPB (DGX Spark)

by AlirezaAlampourView on GitHub

val_bpb

1.3095

Architecture

Transformer

Optimizer

—

Artifact Size

13.73 MB

Training Techniques

Architecture

U-Net skip connections

Shared-transformer U-Net style topology with encoder/decoder halves and learned skip weights.

parameters: {"layers":7,"dimensions":512,"heads_q":8,"heads_kv":4}

weight tying

Tied embeddings.

parameters: null

GQA

Grouped query attention with 8 query heads and 4 KV heads.

parameters: {"query_heads":8,"kv_heads":4,"head_dim":64}

LeakyReLU

MLP activation uses leaky ReLU squared.

parameters: {"negative_slope":0.5}

RoPE

Rotary positional encoding.

parameters: {"base":10000}

Quantization

int6

bits: 6

scope: MLP and attention projections

Compression

lzma

level: null

Evaluation

sliding window eval

parameters: {"stride":64}

Sequence Length

sequence_length

train_length: 1024

eval_length: 1024

Regularization

logit softcap

parameters: {"softcap":30}

Weight Averaging

EMA

parameters: {"decay":0.997}

Novel Contributions

Int6 per-row QAT for MLP and attention projections
LZMA serialization to reduce artifact size
Stride-64 sliding-window evaluation
7-layer shared-transformer U-Net topology with learned skip connections
10-hour training budget on DGX Spark