PR #1490

open

Non-record: Faithful Conditional Memory

by wisebreadloafView on GitHub
val_bpb
1.6110
Architecture
Transformer
Optimizer
Artifact Size
2,773,498 bytes

Training Techniques

Architecture
BigramHash
Uses tokenizer-normalized compressed lookup identities with multi-hash n-gram memory tables for conditional memory.
parameters: {"ngrams":[2,3],"hash_heads":4,"vocab_sizes":[2048,2048]}
Gated Attention
Projects memory embeddings into contextual keys and values and gates them back into the transformer.
parameters: {"layers":[1,3],"kv_heads":2}
KV head count
Uses reduced key/value head count for the transformer.
parameters: {"num_heads":4,"num_kv_heads":2}
other
Internal-layer memory injection of learned memory features into selected transformer layers.
parameters: {"layers":[1,3]}
Regularization
weight decay
parameters: {"value":0.01}
Sequence Length
sequence_length
train_length: 256
eval_length: null

Novel Contributions

  • Faithful standalone conditional-memory architecture
  • Tokenizer-normalized compressed lookup identities
  • Multi-hash n-gram memory tables
  • Contextual key/value projections
  • Internal-layer memory injection
  • Non-record 16MB submission with strong local signs of life but weaker cloud transfer