PR #1142

open

SpotlightLFB + Aux-Int6 Compression — 1.1493 BPB (3-seed mean)

by ymrohitView on GitHub
val_bpb
1.1493
Architecture
Transformer
Optimizer
Artifact Size
~15.92 MB

Training Techniques

Architecture
BigramHash
Uses a hashed bigram embedding path as part of the lexical side path.
parameters: {"vocab_size":1536,"dimension":96}
weight tying
Tied embeddings are enabled in the main trunk.
parameters: null
XSA
Uses late attention with only the last few layers active.
parameters: {"last_n":4}
VE128
Value residual / VE refinement path in late layers.
parameters: {"dim":96,"layers":[8,9]}
MLP3x
Uses a widened MLP with 3x multiplier.
parameters: {"multiplier":3}
KV head count
Uses grouped KV heads in the trunk attention.
parameters: {"num_heads":8,"num_kv_heads":4}
Hybrid
Adds a one-site late feature bank with previous-token, hashed bigram, and boundary signals at a single late insertion layer.
parameters: {"lfb_layers":6,"lfb_dim":80,"lfb_bigram_vocab_size":2048}
Weight Averaging
SWA
parameters: {"enabled":1,"every":50}
Quantization
mixed int6/int8
bits: 6
scope: auxiliary embeddings + main trunk int8
Compression
lzma
level: null
Evaluation
sliding window eval
parameters: null
Sequence Length
sequence_length
train_length: 448
eval_length: 448

Novel Contributions

  • One-site late feature bank (SpotlightLFB) concentrated at a single late insertion layer
  • Auxiliary embedding tables exported to int6 while keeping the main trunk on int8+lzma
  • Compression-aware architecture/export co-design to stay under the 16MB artifact cap
  • Exact sliding-window evaluation used as the main submission metric