val_bpb
1.0665
Architecture
Transformer
Optimizer
—
Artifact Size
15,950,966 bytes
Training Techniques
Architecture
SmearGate
Causal 1-token residual lookback gate with BOS masking to prevent cross-document residual carry.
parameters: {"window":12,"bos_masked":true}
SparseAttnGate
Sparse attention gating used in the base PR #1787/#1797 lineage.
parameters: null
Quantization
GPTQ
bits: null
scope: model weights
GPTQ-lite
bits: null
scope: model weights
mixed int4/int8
bits: null
scope: model weights
Test-Time Training
LoRA TTT
parameters: {"learning_rate":0.00005,"num_phases":4,"score_before_update":true}
Sequence Length
sequence_length
train_length: 8192
eval_length: 8192
Regularization
weight decay
parameters: {"min_lr":0.1}
Other
other
CaseOps lossless tokenizer transform with original UTF-8 byte sidecar for scoring.
parameters: null
other
TRAIN_SHARD_LIMIT=240 to pin the canonical training subset used for the measured run.
parameters: {"train_shard_limit":240}
Novel Contributions
- BOS-masked SmearGate to prevent cross-document residual carry in both normal and TTT forward paths
- Reproducible 240-shard training limit applied before rank sharding
- Clean 3-seed record candidate with phase-4 score-before-update TTT
- CaseOps + SparseAttnGate + SmearGate + asymmetric LQER lineage packaged as a compliant submission