PR #1142

open

SpotlightLFB + Aux-Int6 Compression — 1.1493 BPB (3-seed mean)

by ymrohitView on GitHub

val_bpb

1.1493

Architecture

Transformer

Optimizer

—

Artifact Size

~15.92 MB

Training Techniques

Architecture

BigramHash

Uses a hashed bigram embedding path as part of the lexical side path.

parameters: {"vocab_size":1536,"dimension":96}

weight tying

Tied embeddings are enabled in the main trunk.

parameters: null

XSA

Uses late attention with only the last few layers active.

parameters: {"last_n":4}

VE128

Value residual / VE refinement path in late layers.

parameters: {"dim":96,"layers":[8,9]}

MLP3x

Uses a widened MLP with 3x multiplier.

parameters: {"multiplier":3}

KV head count

Uses grouped KV heads in the trunk attention.

parameters: {"num_heads":8,"num_kv_heads":4}

Hybrid

Adds a one-site late feature bank with previous-token, hashed bigram, and boundary signals at a single late insertion layer.

parameters: {"lfb_layers":6,"lfb_dim":80,"lfb_bigram_vocab_size":2048}

Weight Averaging

SWA

parameters: {"enabled":1,"every":50}

Quantization

mixed int6/int8

bits: 6

scope: auxiliary embeddings + main trunk int8

Compression

lzma

level: null

Evaluation

sliding window eval

parameters: null

Sequence Length

sequence_length

train_length: 448

eval_length: 448

Novel Contributions

One-site late feature bank (SpotlightLFB) concentrated at a single late insertion layer
Auxiliary embedding tables exported to int6 while keeping the main trunk on int8+lzma
Compression-aware architecture/export co-design to stay under the 16MB artifact cap
Exact sliding-window evaluation used as the main submission metric