PR #1038

open

Add 1.20 BPB submission with Legal TTT and Calibration (9L/448D)

val_bpb

1.2058

Architecture

Transformer

Optimizer

—

Artifact Size

15.87MB

Training Techniques

Architecture

weight tying

Tied input and output embeddings.

parameters: null

Sequence Length

sequence_length

train_length: 2048

eval_length: null

Other

other

Model configuration uses 9 layers, 512 model dimension, 8 attention heads, 4 KV heads, and MLP multiplier 2.

parameters: {"layers":9,"model_dim":512,"num_heads":8,"num_kv_heads":4,"mlp_mult":2}

Quantization

int8

bits: 8

scope: model weights

Compression

zlib

level: null