PR #857
openRecord: 15L Depth Recurrence + LeakyReLU² + Cosine TTT (3-seed mean val_bpb=1.1093)
by aruniyerView on GitHub
val_bpb
1.1093
Architecture
Transformer
Optimizer
—
Artifact Size
15.75 MB
Training Techniques
Architecture
depth recurrence
Ties layers 9-13 to share one physical block, creating 15 effective layers from 11 unique blocks.
parameters: {"layers":5,"effective_layers":15,"unique_blocks":11}
LeakyReLU(0.5)^2
Uses squared LeakyReLU activation to preserve negative gradient flow through the MLP.
parameters: {"negative_slope":0.5}
XSA
Uses XSA attention variant in the base architecture.
parameters: {"last_layers":4}
Partial RoPE
Applies partial rotary positional embeddings.
parameters: {"train":16,"total":64}
SmearGate
Additional gating mechanism in the architecture.
parameters: null
BigramHash
Bigram hashing component for token/feature processing.
parameters: {"size":2048}
Quantization
int6
bits: 6
scope: all
GPTQ-lite
bits: null
scope: all
Compression
zstd
level: 22
Test-Time Training
full TTT
parameters: {"epochs":20,"learning_rate":0.0005}
LR Schedule
cosine decay
parameters: {"phase":"test-time training","epochs":20}
Evaluation
sliding window eval
parameters: {"stride":64}
Weight Averaging
EMA
parameters: {"decay":0.997}
SWA
parameters: null
Initialization
OrthoInit
Orthogonal initialization.
Regularization
LN Scale
parameters: null
Novel Contributions
- BI-guided depth recurrence using Block Influence scores to identify redundant layers
- Layer tying of positions 9-13 to share one physical block while preserving per-layer scalars
- Deduplication-aware quantization/export that stores tied weights once with a reconstruction map
- Combination of LeakyReLU(0.5)^2 with cosine test-time training
- 15 effective layers from 11 unique blocks within the artifact budget