PR #1845
open## Non-Record: Int6 QAT + Sliding Window Eval — 1.310 BPB (DGX Spark)
by AlirezaAlampourView on GitHub
val_bpb
1.3095
Architecture
Transformer
Optimizer
—
Artifact Size
13.73 MB
Training Techniques
Architecture
U-Net skip connections
Shared-transformer U-Net style topology with encoder/decoder halves and learned skip weights.
parameters: {"layers":7,"dimensions":512,"heads_q":8,"heads_kv":4}
weight tying
Tied embeddings.
parameters: null
GQA
Grouped query attention with 8 query heads and 4 KV heads.
parameters: {"query_heads":8,"kv_heads":4,"head_dim":64}
LeakyReLU
MLP activation uses leaky ReLU squared.
parameters: {"negative_slope":0.5}
RoPE
Rotary positional encoding.
parameters: {"base":10000}
Quantization
int6
bits: 6
scope: MLP and attention projections
Compression
lzma
level: null
Evaluation
sliding window eval
parameters: {"stride":64}
Sequence Length
sequence_length
train_length: 1024
eval_length: 1024
Regularization
logit softcap
parameters: {"softcap":30}
Weight Averaging
EMA
parameters: {"decay":0.997}
Novel Contributions
- Int6 per-row QAT for MLP and attention projections
- LZMA serialization to reduce artifact size
- Stride-64 sliding-window evaluation
- 7-layer shared-transformer U-Net topology with learned skip connections
- 10-hour training budget on DGX Spark