PR #1728

open

Non-record: Neural Base Model, No TTT — Parcae + Gates + Layered Windows (val_bpb 1.07706)

by mikeapediaView on GitHub
val_bpb
1.0771
Architecture
Transformer
Optimizer
Artifact Size
15,962,729 B

Training Techniques

Evaluation
sliding window eval
parameters: {"window_size":null}
Quantization
GPTQ
bits: 6
scope: matrix weights
GPTQ
bits: 7
scope: embeddings
Compression
brotli
level: null
lzma
level: null
Architecture
SmearGate
Causal residual mixer blending the current token with the previous token.
parameters: null
Gated Attention
Per-head input-dependent sigmoid-scaled gate on attention output.
parameters: null
Partial RoPE
Global attention layers use partial rotary position embeddings on a subset of dimensions.
parameters: {"layers":[4,9,10],"rope_dims":16,"head_dim":64}
sliding window attention
Local layers use sliding-window causal attention with layered window sizes.
parameters: {"local_window_512_layers":[0,1,2,3,5],"local_window_1024_layers":[6,7,8]}
depth recurrence
Loop-based recurrence with constrained loop injection at re-entry.
parameters: {"num_loops":2}
weight tying
KV-tying on global attention layers was disabled in this submission.
parameters: null
ReLU²
xIELU fused MLP activation with squaring in the custom kernel.
parameters: null
Test-Time Training
LoRA TTT
parameters: {"enabled":false}
Weight Averaging
EMA
parameters: null

Novel Contributions

  • No test-time training; submission isolates architectural gains.
  • Layered local sliding windows with different window sizes across layers.
  • Parcae constrained loop injection for bounded recurrent-style updates.
  • Attention output gating and SmearGate residual mixing.
  • Mixed-precision GPTQ quantization with brotli-compressed state dict.