PR #1046

open

Record: 11L Adaptive Markov + Int6 Mixed Quant (1.2174 bpb)

by JayteareView on GitHub
val_bpb
1.2174
Architecture
Hybrid
Optimizer
Artifact Size
15,107,918 bytes

Training Techniques

Architecture
weight tying
Tied input and output embeddings.
parameters: null
GQA
Used grouped query attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4,"layers":11,"dim":512}
ReLU²
Used relu squared MLP activation.
parameters: null
SmearGate
Adaptive per-position gate mixing transformer logits with Markov logits, including confidence-based thresholding.
parameters: {"threshold":0.2,"temp":0.03}
Other
other
Added an explicit unigram Markov transition table combined with transformer logits as a short-range prior.
parameters: {"table_size":"1024x1024"}
Quantization
mixed int6/int8
bits: 6
scope: MLP and attention weights int6; embeddings and Markov table int8; control tensors fp16
Compression
zstd
level: 22
Sequence Length
sequence_length
train_length: 1024
eval_length: null

Novel Contributions

  • Adaptive Markov mixing with a learned per-position gate
  • Confidence-based suppression of Markov contribution using the top-2 Markov logit gap
  • Mixed int6/int8 quantization to fit under the 16MB artifact limit
  • Large 786K-token batch training for improved throughput within the 10-minute budget
  • Explicit short-range Markov prior combined with a causal transformer