PR #1046

open

Record: 11L Adaptive Markov + Int6 Mixed Quant (1.2174 bpb)

val_bpb

1.2174

Architecture

Hybrid

Optimizer

—

Artifact Size

15,107,918 bytes

Training Techniques

Architecture

weight tying

Tied input and output embeddings.

parameters: null

GQA

Used grouped query attention with fewer KV heads than attention heads.

parameters: {"heads":8,"kv_heads":4,"layers":11,"dim":512}

ReLU²

Used relu squared MLP activation.

parameters: null

SmearGate

Adaptive per-position gate mixing transformer logits with Markov logits, including confidence-based thresholding.

parameters: {"threshold":0.2,"temp":0.03}

Other

other

Added an explicit unigram Markov transition table combined with transformer logits as a short-range prior.

parameters: {"table_size":"1024x1024"}

Quantization

mixed int6/int8

bits: 6

scope: MLP and attention weights int6; embeddings and Markov table int8; control tensors fp16

Compression

zstd

level: 22

Sequence Length

sequence_length

train_length: 1024

eval_length: null

Adaptive Markov mixing with a learned per-position gate
Confidence-based suppression of Markov contribution using the top-2 Markov logit gap
Mixed int6/int8 quantization to fit under the 16MB artifact limit
Large 786K-token batch training for improved throughput within the 10-minute budget
Explicit short-range Markov prior combined with a causal transformer