PR #1534

open

SP4096 + Depth Recurrence + Parallel Residuals + Legal N-Gram

by someone114514View on GitHub

val_bpb

1.0846

Architecture

Transformer

Optimizer

—

Artifact Size

15,967,527 bytes

Training Techniques

Architecture

depth recurrence

Recurrent / parallel-residual SP4096 stack with depth recurrence.

parameters: null

parallel residuals

Uses parallel-residual stack in the base model.

parameters: null

Weight Averaging

EMA

parameters: null

Quantization

int6

bits: 6

scope: all

Evaluation

sliding window eval

parameters: null

Other

other

Legal prefix-only n-gram overlay with token / within-word / word-start experts, one-token logit tilt, and full-vocab renormalization during evaluation.

parameters: null

Novel Contributions

Adds a separate prefix-only legal n-gram evaluation path to the SP4096 recurrent / parallel-residual base
Uses token, within-word continuation, and word-start experts built from already-seen tokens
Applies a one-token bias and renormalizes over the full vocabulary in a single left-to-right pass
Keeps evaluation legal with no target-conditioned gating, no two-pass rescoring, and no weight updates during inference
Reports a best result of 1.08457715 val_bpb