PR #857

open

Record: 15L Depth Recurrence + LeakyReLU² + Cosine TTT (3-seed mean val_bpb=1.1093)

by aruniyerView on GitHub

val_bpb

1.1093

Architecture

Transformer

Optimizer

—

Artifact Size

15.75 MB

Training Techniques

Architecture

depth recurrence

Ties layers 9-13 to share one physical block, creating 15 effective layers from 11 unique blocks.

parameters: {"layers":5,"effective_layers":15,"unique_blocks":11}

LeakyReLU(0.5)^2

Uses squared LeakyReLU activation to preserve negative gradient flow through the MLP.

parameters: {"negative_slope":0.5}

XSA

Uses XSA attention variant in the base architecture.

parameters: {"last_layers":4}

Partial RoPE

Applies partial rotary positional embeddings.

parameters: {"train":16,"total":64}

SmearGate

Additional gating mechanism in the architecture.

parameters: null

BigramHash

Bigram hashing component for token/feature processing.

parameters: {"size":2048}

Quantization

int6

bits: 6

scope: all

GPTQ-lite

bits: null

scope: all

Compression

zstd

level: 22

Test-Time Training

full TTT

parameters: {"epochs":20,"learning_rate":0.0005}

LR Schedule

cosine decay

parameters: {"phase":"test-time training","epochs":20}

Evaluation

sliding window eval

parameters: {"stride":64}

Weight Averaging

EMA

parameters: {"decay":0.997}

SWA

parameters: null

Initialization

OrthoInit

Orthogonal initialization.

Regularization

LN Scale

parameters: null

Novel Contributions

BI-guided depth recurrence using Block Influence scores to identify redundant layers
Layer tying of positions 9-13 to share one physical block while preserving per-layer scalars
Deduplication-aware quantization/export that stores tied weights once with a reconstruction map
Combination of LeakyReLU(0.5)^2 with cosine test-time training
15 effective layers from 11 unique blocks within the artifact budget