PR #2110

open

Non-record: final frontier autopsy

by himanshudongreView on GitHub

val_bpb

1.0639

Architecture

Transformer

Optimizer

—

Artifact Size

15,972,854 bytes

Training Techniques

Architecture

Gate32

Widened gate window used in the frontier experiments.

parameters: {"gate_window":32,"smear_gate_window":12}

BigramHash

Small causal input feature branch tested for transfer to the #2018 frontier.

parameters: {"vocab_size":512,"dimensions":4,"bits":6}

Path-A-v3

Small Path-A-v3 branch combined with the BigramHash experiment.

parameters: {"small":true}

Other

other

q-aware token-only n-gram tilt applied during training/evaluation.

parameters: {"token_only":true,"dynamic":true}

Test-Time Training

score-first TTT

parameters: null

Sequence Length

sequence_length

train_length: 32

eval_length: 12

Final autopsy of the PR #2018 frontier with three failed transfer attempts.
Identified a stop rule: stop a branch if it is about +0.01 BPB worse before quantization on the same seed unless it adds a proven legal eval mechanism.
Showed that Gate32 did not transfer to the #2018 stack.
Showed that the q-aware n-gram patch was not the root cause of the regression.
Tested a tiny BigramHash + Path-A-v3-small branch and found it did not recover training quality.