PR #1140

open

Crawler — 8.8MB -1.1874 BPB (3-seed mean, 8xH100, 600s)

by newjordanView on GitHub
val_bpb
1.1874
Architecture
Transformer
Optimizer
Artifact Size
9.36MB

Training Techniques

Architecture
XSA
4 flat XSA layers with a shared crawler block repeated in loops
parameters: {"layers":4,"loops":3}
KV head count
Uses 8 attention heads and 4 KV heads
parameters: {"heads":8,"kv_heads":4}
RoPE
RoPE configured with a 16 setting
parameters: {"value":16}
BigramHash
Bigram setting used in the architecture
parameters: {"bigram":2048}
Quantization
QAT
bits: 8
scope: model
int6
bits: 6
scope: final artifact
Compression
zstd
level: null

Novel Contributions

  • Micro Crawler submission with 3-seed mean evaluation
  • 4 flat XSA layers plus a shared crawler block repeated for 3 loops
  • Quantization-aware training with int8 settings and final int6 artifact
  • Naive int6 plus zstd compression without GPTQ
  • Use of 8 heads, 4 KV heads, bigram 2048, and RoPE 16