PR #237

open

Add 10L 4K long-context negative-result submission

by takoyakisoftView on GitHub

val_bpb

1.8389

Architecture

Transformer

Optimizer

—

Artifact Size

13361078 bytes

Training Techniques

Architecture

tied embeddings

Uses tied embeddings as part of the model configuration.

parameters: null

KV head count

Uses fewer KV heads than attention heads.

parameters: {"num_heads":8,"num_kv_heads":4}

Quantization

QAT

bits: null

scope: all

Evaluation

sliding window eval

parameters: {"stride":64,"context_length":4096}

Test-Time Training

LoRA TTT

parameters: {"rank":8}

Sequence Length

sequence_length

train_length: 4096

eval_length: 4096

LR Schedule

warmdown

parameters: {"warmdown_iters":20000}

Other

other

Selective FP16 passthrough for a few sensitive tensors during training.

parameters: null