PR #1349

open

Add shared-block recurrent 10-minute non-record 16MB submission

by LocalX991View on GitHub
val_bpb
1.3693
Architecture
Transformer
Optimizer
Artifact Size
3,591,303 bytes

Training Techniques

Architecture
depth recurrence
Shared universal transformer block reused across multiple passes instead of separate blocks per layer.
parameters: {"num_passes":8}
RoPE
Partial rotary positional embeddings used with reduced rotary dimensions.
parameters: {"dimensions":16}
BigramHash
Bigram hash feature used in the model.
parameters: {"vocab_size":2048}
SmearGate
SmearGate feature used in the model.
parameters: null
Regularization
LN scale
parameters: null
Weight Averaging
EMA
parameters: null
SWA
parameters: null
Quantization
int6
bits: 6
scope: all
Compression
zstd
level: null
Evaluation
stride-based eval
parameters: {"stride":64}

Novel Contributions

  • Shared universal transformer block reused across 8 passes
  • Pass-dependent rotations, depth embeddings, and modulation
  • Partial RoPE with 16 rotary dimensions
  • BigramHash and SmearGate features
  • EMA/SWA averaging
  • Int6 quantization with zstd compression
  • 10-minute 8xH100 non-record 16MB submission