PR #684

closed

Record: 11L Sidecar48 + Enhanced Attention + Async Data Pipeline + AdamW TTT (20 epochs, cosine LR, 3-seed mean val_bpb=1.0573)

by DeepReinforceView on GitHub

val_bpb

1.0574

Architecture

11-layer Transformer

Optimizer

AdamW

Artifact Size

< 16 MB

Training Techniques

Architecture

Attention shift mixing

Learned k_shift_mix and v_shift_mix blend each position's keys/values with the previous position's keys/values.

parameters: null

K gain

Learned per-KV-head k_gain scales key norms independently of queries.

parameters: null

Local value residual

Learned per-head local_v_mix adds a direct value shortcut to the attention output.

parameters: null

RoPE

Adaptive rotary embedding dimension selection using 3/4 of head_dim when head_dim > 32.

parameters: {"dimensions":null}

BigramHash

BigramHash embedding used for token representation.

parameters: {"vocab":2048,"dimensions":96}

SmearGate

SmearGate with U-Net skip connections in the model.

parameters: null

MLP3x

Transformer uses a 3x MLP expansion.

parameters: null

KV head count

Model uses 8 attention heads and 4 KV heads.

parameters: {"heads":8,"kv_heads":4}

SharedSparseSidecar

SharedSparseSidecar module with 48 hidden units at layers 8-10.

parameters: {"hidden":48,"layers":[8,9,10]}

Quantization

mixed int6

bits: 6

scope: model weights

Weight Averaging

EMA

parameters: {"decay":0.997}

Initialization

orthogonal init

Orthogonal weight initialization.

Optimizer

AdamW

weight_decay: 0.01

momentum: null

other_params: {"epochs":20}

LR Schedule

cosine decay with linear warmup

parameters: {"start_lr":0.0005,"end_lr":0.00002,"warmup":"1 epoch"}

Evaluation

sliding window eval

parameters: {"stride":32}

Test-Time Training

full TTT

parameters: {"epochs":20}

Sequence Length

sequence_length

train_length: null

eval_length: null

Compression

zstd

level: 22

Other

other

Asynchronous memory-mapped data pipeline with coprime-stride shard sampling, background thread batching, CUDA stream prefetch, and adaptive shard mixing width.

parameters: {"mix_width_start":8,"mix_width_end":32}

Novel Contributions

Learned attention shift mixing for keys and values
Learned per-KV-head key gain
Learned per-head local value residual
Adaptive rotary embedding dimension selection
Fully asynchronous memory-mapped data pipeline
Background thread and CUDA stream prefetching
Adaptive shard mixing schedule from 8 to 32 shards
Denser sliding-window evaluation with stride 32
Int6 mixed quantization and zstd-22 compression