PR #684
closedRecord: 11L Sidecar48 + Enhanced Attention + Async Data Pipeline + AdamW TTT (20 epochs, cosine LR, 3-seed mean val_bpb=1.0573)
by DeepReinforceView on GitHub
val_bpb
1.0574
Architecture
11-layer Transformer
Optimizer
AdamW
Artifact Size
< 16 MB
Training Techniques
Architecture
Attention shift mixing
Learned k_shift_mix and v_shift_mix blend each position's keys/values with the previous position's keys/values.
parameters: null
K gain
Learned per-KV-head k_gain scales key norms independently of queries.
parameters: null
Local value residual
Learned per-head local_v_mix adds a direct value shortcut to the attention output.
parameters: null
RoPE
Adaptive rotary embedding dimension selection using 3/4 of head_dim when head_dim > 32.
parameters: {"dimensions":null}
BigramHash
BigramHash embedding used for token representation.
parameters: {"vocab":2048,"dimensions":96}
SmearGate
SmearGate with U-Net skip connections in the model.
parameters: null
MLP3x
Transformer uses a 3x MLP expansion.
parameters: null
KV head count
Model uses 8 attention heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
SharedSparseSidecar
SharedSparseSidecar module with 48 hidden units at layers 8-10.
parameters: {"hidden":48,"layers":[8,9,10]}
Quantization
mixed int6
bits: 6
scope: model weights
Weight Averaging
EMA
parameters: {"decay":0.997}
Initialization
orthogonal init
Orthogonal weight initialization.
Optimizer
AdamW
weight_decay: 0.01
momentum: null
other_params: {"epochs":20}
LR Schedule
cosine decay with linear warmup
parameters: {"start_lr":0.0005,"end_lr":0.00002,"warmup":"1 epoch"}
Evaluation
sliding window eval
parameters: {"stride":32}
Test-Time Training
full TTT
parameters: {"epochs":20}
Sequence Length
sequence_length
train_length: null
eval_length: null
Compression
zstd
level: 22
Other
other
Asynchronous memory-mapped data pipeline with coprime-stride shard sampling, background thread batching, CUDA stream prefetch, and adaptive shard mixing width.
parameters: {"mix_width_start":8,"mix_width_end":32}
Novel Contributions
- Learned attention shift mixing for keys and values
- Learned per-KV-head key gain
- Learned per-head local value residual
- Adaptive rotary embedding dimension selection
- Fully asynchronous memory-mapped data pipeline
- Background thread and CUDA stream prefetching
- Adaptive shard mixing schedule from 8 to 32 shards
- Denser sliding-window evaluation with stride 32
- Int6 mixed quantization and zstd-22 compression