PR #1869

open

Add BOS-masked SmearGate record

val_bpb

1.0665

Architecture

Transformer

Optimizer

—

Artifact Size

15,950,966 bytes

Training Techniques

Architecture

SmearGate

Causal 1-token residual lookback gate with BOS masking to prevent cross-document residual carry.

parameters: {"window":12,"bos_masked":true}

SparseAttnGate

Sparse attention gating used in the base PR #1787/#1797 lineage.

parameters: null

Quantization

GPTQ

bits: null

scope: model weights

GPTQ-lite

bits: null

scope: model weights

mixed int4/int8

bits: null

scope: model weights

Test-Time Training

LoRA TTT

parameters: {"learning_rate":0.00005,"num_phases":4,"score_before_update":true}

Sequence Length

sequence_length

train_length: 8192

eval_length: 8192

Regularization

weight decay

parameters: {"min_lr":0.1}

Other

other

CaseOps lossless tokenizer transform with original UTF-8 byte sidecar for scoring.

parameters: null

other

TRAIN_SHARD_LIMIT=240 to pin the canonical training subset used for the measured run.

parameters: {"train_shard_limit":240}

BOS-masked SmearGate to prevent cross-document residual carry in both normal and TTT forward paths
Reproducible 240-shard training limit applied before rank sharding
Clean 3-seed record candidate with phase-4 score-before-update TTT
CaseOps + SparseAttnGate + SmearGate + asymmetric LQER lineage packaged as a compliant submission