PR #66

open

ArjunAutoResearch: MLP 3x + STE int6 QAT + seq4096 + sliding window. val_bpb 1.1632

by arjun-krishna1View on GitHub
val_bpb
1.1632
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,265,243 bytes

Training Techniques

Architecture
MLP3x
Wider MLP with 3x expansion (hidden size 1536 instead of 1024).
parameters: {"hidden":1536,"multiplier":3}
tied embeddings
Uses tied embedding as output head and keeps it in fp16 to avoid quantization penalty.
parameters: null
Sequence Length
sequence_length
train_length: 4096
eval_length: 4096
Optimizer
Muon
weight_decay: null
momentum: 0.99
other_params: {"matrix_lr":0.02,"scalar_lr":0.02,"tied_embed_lr":0.03,"batch_tokens":393216,"warmdown_iters":3000,"momentum_warmup_steps":1500,"momentum_warmup_start":0.92}
Quantization
STE QAT int6
bits: 6
scope: CastedLinear weights / MLP and attention weights
mixed int6/fp16
bits: 6
scope: MLP and attention weights int6, tied embedding fp16 passthrough
Evaluation
sliding window eval
parameters: {"stride":64,"context_length":4096}
Compression
zstd
level: 22
LR Schedule
warmdown
parameters: {"warmdown_steps":3000}
Other
other
AutoResearch agent harness that used GitHub CLI to inspect open PRs, bucket them by expected impact, and compose high-impact techniques automatically.
parameters: null

Novel Contributions

  • Built an AutoResearch agent harness to autonomously inspect and compose techniques from open PRs
  • Combined wider MLP, long-context training, optimizer tuning, STE int6 QAT, mixed int6 quantization, fp16 tied embedding passthrough, and sliding-window evaluation
  • Used int6 quantization savings to enable a 3x wider MLP within the artifact size limit
  • Applied sliding-window evaluation with stride 64 over 4096-token context to improve validation score
  • Reported multi-seed results with statistical significance