val_bpb
1.6110
Architecture
Transformer
Optimizer
—
Artifact Size
2,773,498 bytes
Training Techniques
Architecture
BigramHash
Uses tokenizer-normalized compressed lookup identities with multi-hash n-gram memory tables for conditional memory.
parameters: {"ngrams":[2,3],"hash_heads":4,"vocab_sizes":[2048,2048]}
Gated Attention
Projects memory embeddings into contextual keys and values and gates them back into the transformer.
parameters: {"layers":[1,3],"kv_heads":2}
KV head count
Uses reduced key/value head count for the transformer.
parameters: {"num_heads":4,"num_kv_heads":2}
other
Internal-layer memory injection of learned memory features into selected transformer layers.
parameters: {"layers":[1,3]}
Regularization
weight decay
parameters: {"value":0.01}
Sequence Length
sequence_length
train_length: 256
eval_length: null
Novel Contributions
- Faithful standalone conditional-memory architecture
- Tokenizer-normalized compressed lookup identities
- Multi-hash n-gram memory tables
- Contextual key/value projections
- Internal-layer memory injection
- Non-record 16MB submission with strong local signs of life but weaker cloud transfer