PR #2108

closed

Closed: superseded non-record draft

by himanshudongreView on GitHub

val_bpb

1.0483

Architecture

Transformer

Optimizer

—

Artifact Size

15,972,854 bytes

Training Techniques

Architecture

SmearGate

Widened the sparse attention gate reader from the baseline gate window to 32 in the #2018 stack.

parameters: {"gate_window":32,"smear_gate_window":12}

BigramHash

Added a tiny causal BigramHash input feature branch with small-tensor routing.

parameters: {"vocab_size":512,"dimensions":4,"bits":6}

Path-A-v3

Applied a small Path-A-v3-style routing variant alongside the BigramHash branch.

parameters: {"small":true}

Other

other

q-aware token-only n-gram tilt path used during evaluation/training pipeline.

parameters: {"token_only":true,"q_aware":true}

Test-Time Training

score-first TTT

parameters: {"phased":true}

Sequence Length

sequence_length

train_length: null

eval_length: 2048

Evaluation

sliding window eval

parameters: {"stride":12}

Compression

brotli/lrzip

level: null

Documented a failed transfer of Gate32 to the #2018 frontier stack.
Documented a failed transfer of a tiny BigramHash branch to the #2018 stack.
Validated that the q-aware token-only n-gram patch was not the main cause of regression.
Reported a corrected no-go result for Memento/copy memory.
Recorded a promising but unfinished CrossWS tokenizer direction.
Established a stop rule: stop before quantization and TTT if pre-quant BPB is not competitive.