PR #1490

open

Non-record: Faithful Conditional Memory

by wisebreadloafView on GitHub

val_bpb

1.6110

Architecture

Transformer

Optimizer

—

Artifact Size

2,773,498 bytes

Training Techniques

Architecture

BigramHash

Uses tokenizer-normalized compressed lookup identities with multi-hash n-gram memory tables for conditional memory.

parameters: {"ngrams":[2,3],"hash_heads":4,"vocab_sizes":[2048,2048]}

Gated Attention

Projects memory embeddings into contextual keys and values and gates them back into the transformer.

parameters: {"layers":[1,3],"kv_heads":2}

KV head count

Uses reduced key/value head count for the transformer.

parameters: {"num_heads":4,"num_kv_heads":2}

other

Internal-layer memory injection of learned memory features into selected transformer layers.

parameters: {"layers":[1,3]}

Regularization

weight decay

parameters: {"value":0.01}

Sequence Length

sequence_length

train_length: 256

eval_length: null

Faithful standalone conditional-memory architecture
Tokenizer-normalized compressed lookup identities
Multi-hash n-gram memory tables
Contextual key/value projections
Internal-layer memory injection
Non-record 16MB submission with strong local signs of life but weaker cloud transfer