val_bpb
1.1763
Architecture
Transformer
Optimizer
—
Artifact Size
15,501,266 bytes
Training Techniques
Architecture
depth recurrence
Fixed middle-block recurrence added to the full80 mixed-lowbit family.
parameters: {"layers":10,"model_dim":576,"num_heads":8,"num_kv_heads":4,"recurrent_mode":"fixed","recurrent_core_start":3,"recurrent_core_len":2,"recurrent_steps":2}
step embedding
Learned step-aware recurrence embeddings were added to the recurrent blocks.
parameters: {"enabled":true,"init_std":0.01}
Sequence Length
sequence_length
train_length: 1536
eval_length: null
Regularization
magnitude pruning
parameters: {"pct":0.033}
Quantization
mixed int4/int8
bits: 4
scope: export rescue
Novel Contributions
- Non-record unlimited-compute submission for the 16MB track
- Recurrence-based continuation of the full80 mixed-lowbit family
- Fixed middle-block recurrence with learned step-aware recurrence embeddings
- Training at sequence length 1536
- Export rescue using BIGRAM_EXPORT_BITS=4 and MAG_PRUNE_PCT=0.033
- Achieved val_bpb 1.17631839 while staying under the 16,000,000 byte cap