PR #993

closed

Record: 11L XSA + Mixed INT6 + Adaptive N-gram Cache (2->7 backoff) - val_bpb=0.9631, 3-seed

by aerostaView on GitHub
val_bpb
0.9631
Architecture
Transformer
Optimizer
Artifact Size
15,882,569 bytes

Training Techniques

Architecture
XSA
XSA applied to all 11 layers in an 11-layer Transformer with 512d hidden size, 8Q and 4KV heads.
parameters: {"layers":11,"hidden_dim":512,"q_heads":8,"kv_heads":4}
MLP3x
Three-times wider MLP using relu2 activation.
parameters: {"multiplier":3,"activation":"ReLU²"}
weight tying
Tied embeddings.
parameters: null
Weight Averaging
EMA + SWA
parameters: null
Quantization
mixed int6
bits: 6
scope: post-training mixed
Compression
lzma
level: null
Evaluation
sliding window eval
parameters: {"stride":64}
Other
other
Adaptive score-first n-gram cache with backoff orders 2->7, applied only to later positions/windows after scoring earlier windows.
parameters: {"orders":"2->7","adaptive_mode":"sigmoid_raw_entropy","alpha_range":[0.05,0.6],"hash_buckets":4194304,"min_count":2}

Novel Contributions

  • 11-layer XSA Transformer with tied embeddings and 3x MLP using ReLU²
  • Post-training mixed INT6 quantization with LZMA compression
  • Sliding-window evaluation with stride 64
  • Adaptive score-first n-gram cache with 2->7 backoff
  • EMA plus late SWA weight averaging