PR #1349
openAdd shared-block recurrent 10-minute non-record 16MB submission
by LocalX991View on GitHub
val_bpb
1.3693
Architecture
Transformer
Optimizer
—
Artifact Size
3,591,303 bytes
Training Techniques
Architecture
depth recurrence
Shared universal transformer block reused across multiple passes instead of separate blocks per layer.
parameters: {"num_passes":8}
RoPE
Partial rotary positional embeddings used with reduced rotary dimensions.
parameters: {"dimensions":16}
BigramHash
Bigram hash feature used in the model.
parameters: {"vocab_size":2048}
SmearGate
SmearGate feature used in the model.
parameters: null
Regularization
LN scale
parameters: null
Weight Averaging
EMA
parameters: null
SWA
parameters: null
Quantization
int6
bits: 6
scope: all
Compression
zstd
level: null
Evaluation
stride-based eval
parameters: {"stride":64}
Novel Contributions
- Shared universal transformer block reused across 8 passes
- Pass-dependent rotations, depth embeddings, and modulation
- Partial RoPE with 16 rotary dimensions
- BigramHash and SmearGate features
- EMA/SWA averaging
- Int6 quantization with zstd compression
- 10-minute 8xH100 non-record 16MB submission