Emerging Techniques

Methods not yet mapped to a deep dive concept. These are new or uncommon techniques worth watching.

1022 unmapped methods
Otherunknown
910 PRs0
ArchitectureXSA
478 PRs0.0180
ArchitectureMLP3x
382 PRs0.0274
ArchitecturePartial RoPE
335 PRs0.0180
ArchitectureLeakyReLU
293 PRs0.0180
Architectureweight tying
288 PRs0.0180
Sequence Length2048
279 PRs0.0905
Test-Time Trainingscore-first TTT
264 PRs0.0274
ArchitectureKV head count
245 PRs0.0180
ArchitectureGQA
245 PRs0.0235
Compressionlzma
243 PRs0.0638
Architecturedepth recurrence
240 PRs0.1156
Architecturetied embeddings
211 PRs0.2952
Compressionzlib
210 PRs0.0280
Quantizationint6
164 PRs0.0180
ArchitectureRoPE
150 PRs0.0180
Regularizationlogit softcap
137 PRs0.0180
RegularizationLN scale
127 PRs0.0280
Regularizationlayerwise LN scale
124 PRs0.0281
QuantizationQAT
122 PRs0.1653
Test-Time Trainingfull TTT
116 PRs0.3964
LR Schedulecosine decay
116 PRs0.1003
ArchitectureVE128
113 PRs0.0180
OptimizerParallel Muon
109 PRs0.0830
QuantizationSTE QAT
106 PRs0.1653
QuantizationGPTQ-lite
106 PRs0.0280
Quantizationlate QAT
94 PRs0.0939
Sequence Lengthunknown
86 PRs0.0109
ArchitectureGated Attention
73 PRs0.0281
ArchitectureValue Residual
69 PRs0.0235
Weight AveragingEMA + SWA
67 PRs0.1154
Regularizationmagnitude pruning
64 PRs0.0235
Compressionbrotli
61 PRs1.0050
Quantizationmixed int6/int8
60 PRs0.4416
OptimizerSGD
57 PRs0
Evaluationstride-based eval
54 PRs0.0214
ArchitectureReLU²
50 PRs0.0972
Regularizationgradient clipping
45 PRs0.7227
Architectureparallel residuals
41 PRs1.0616
Sequence Length4096
38 PRs1.0217
CompressionBrotli
34 PRs0.9300
ArchitectureTrigramHash
31 PRs0.9850
ArchitectureLN Scale
28 PRs0.4027
Weight AveragingEMA + Tight SWA
28 PRs0.0280
Quantizationfp16
26 PRs1.1361
Test-Time TrainingTTT
23 PRs0.0214
Quantizationint5
22 PRs0.0274
Sequence Length8192
22 PRs1.0722
Quantizationmixed int6
19 PRs0.0972
ArchitectureMLP
19 PRs0.9650
Architectureother
18 PRs0.4961
Initializationspectral init
17 PRs1.1550
ArchitectureSwiGLU
16 PRs1.1175
Weight AveragingTight SWA
16 PRs0.0830
Quantizationint6 QAT
15 PRs0.9393
ArchitectureMLP4x
15 PRs1.0764
Sequence Length32768
15 PRs0.6580
RegularizationLN Scale
14 PRs0.9123
LR Schedulewarmup
13 PRs1.2073
Initializationresid mix
13 PRs1.0756
OptimizerNorMuon
13 PRs1.0824
Compressioncustom
12 PRs0
LR Schedulewarmup + warmdown
12 PRs0.6364
ArchitectureEMA
12 PRs1.0597
ArchitectureHybrid
12 PRs0.5755
ArchitectureValue Embeddings
11 PRs0.2951
InitializationOrthogonal init
11 PRs1.1204
ArchitectureSWA
11 PRs0.0180
ArchitectureValueEmbedding
11 PRs0.6364
ArchitectureGatedDeltaNet
11 PRs1.0098
ArchitectureQK-Gain
11 PRs0.9354
Sequence Length256
10 PRs0.5755
ArchitectureXSA4
10 PRs0.2841
ArchitectureLeakyReLU²
10 PRs0.9443
ArchitectureMamba
10 PRs1.1470
ArchitectureEigenweight
10 PRs1.3223
Architecturedepth
9 PRs1.1431
ArchitectureVRL
9 PRs0.4416
ArchitectureOrthoInit
9 PRs0.2841
LR Schedulelinear warmup
8 PRs1.1896
ArchitectureTransformer
8 PRs1.0717
Regularizationdropout
8 PRs1.0824
ArchitectureVE
8 PRs1.1100
Test-Time TrainingAdamW TTT
8 PRs0.8265
Architectureattention
8 PRs1.0741
ArchitectureGated DeltaNet
8 PRs1.0030
ArchitectureTransformer depth
7 PRs1.1550
Evaluationlong context eval
7 PRs1.1147
OptimizerMuon + Adam
7 PRs1.1387
LR Schedulecosine warmdown
7 PRs1.1399
ArchitectureMLP3.5x
7 PRs0.1582
Regularizationpruning
7 PRs0.2071
Regularizationlabel smoothing
7 PRs0.8503
Weight AveragingPolyak averaging
7 PRs0.2071
ArchitectureMLP activation
6 PRs1.2302
ArchitectureLoRA
6 PRs0.9623
Architectureresidual mixing
6 PRs1.0577
Architecturelogit softcap
6 PRs0.9076
Quantizationmixed int5/int6/int8
6 PRs1.1172
Initializationorthogonal init
6 PRs0.9123
Quantizationint4
6 PRs1.0785
ArchitectureShared Value Embedding
6 PRs1.1175
ArchitectureLeakyReLU(0.5)^2
6 PRs0.4416
Sequence Length512
6 PRs1.2734
QuantizationSTE QAT int6
5 PRs1.1502
ArchitectureSwiGLU MLP
5 PRs1.1558
Quantizationmixed int5/int6 QAT
5 PRs1.1466
ArchitectureValueResidual
5 PRs1.0896
Quantizationmixed int6/int5
5 PRs1.0924
Test-Time Trainingnone
5 PRs0.1315
OptimizerMuon + AdamW
5 PRs1.1160
Sequence Length131072
5 PRs0.0214
ArchitectureMLP expansion
5 PRs1.1355
Architecturesliding window eval
5 PRs1.0167
ArchitectureParallel Residuals
5 PRs1.0587
Architecturesliding window attention
4 PRs1.0205
Sequence Length16384
4 PRs1.1875
Architectureskip connections
4 PRs1.1702
Quantizationternary
4 PRs0.8100
Evaluationstride-based sliding window eval
4 PRs1.0541
InitializationOrthogonal
4 PRs1.1381
ArchitectureValue Embedding
4 PRs0.9370
LR Schedulewarmdown3500
4 PRs1.1228
ArchitectureU-Net
4 PRs0.9362
ArchitectureXSA-all
4 PRs0.5440
CompressionLZMA
4 PRs0.2834
RegularizationCROWN-Q penalty
4 PRs0.2071
Evaluationmulti-order n-gram backoff
4 PRs0.6364
ArchitectureGELU pre-enrichment
4 PRs0.3922
Regularizationz-loss
4 PRs1.1194
Sequence Length1536
4 PRs1.1763
Architectureattention modifications
4 PRs1.3288
ArchitectureDeltaNet
4 PRs0.7614
Sequence Length32000
4 PRs1.0362
Architectureattention modification
4 PRs1.0756
ArchitectureMLP hidden size
3 PRs1.1804
Architectureiteration embeddings
3 PRs1.2249
LR Schedulelinear warmup + warmdown
3 PRs1.0321
ArchitectureRMSNorm
3 PRs1.2156
LR Schedulecosine decay with linear warmup
3 PRs1.0574
ArchitectureCross-Repeat Skip
3 PRs1.1980
Quantizationmixed int8/int6
3 PRs1.0788
Initializationovertone embedding init
3 PRs1.1855
Initializationovertone init
3 PRs0.9393
InitializationOvertone init
3 PRs1.1551
InitializationOvertoneInit
3 PRs1.1739
Quantizationmixed int5/int6/int7
3 PRs1.1091
ArchitectureMLA
3 PRs1.2838
Quantizationmixed int4/int5
3 PRs1.1257
Test-Time Trainingtwo-phase TTT
3 PRs1.1216
LR Schedulelinear warmdown
3 PRs1.1431
ArchitectureBigramLogitHead
3 PRs1.1522
ArchitectureMoE
3 PRs1.1180
Architecturelayerwise LN scale
3 PRs1.0752
Quantizationint5/int6
3 PRs1.1417
ArchitectureMLP width
3 PRs1.1804
ArchitectureFlashAttention-3
3 PRs0.6864
Test-Time TrainingLegal TTT
3 PRs1.0813
RegularizationSIGReg
3 PRs1.2622
ArchitectureXSA-4
3 PRs0.2873
ArchitectureValue Residual Learning
3 PRs0.3212
QuantizationFP8
3 PRs1.2064
ArchitectureLeakyReLU^2
3 PRs0.9123
ArchitectureEBLS
3 PRs0.1156
ArchitectureLeakyReLU(0.9)^2
3 PRs0.1434
EvaluationkNN-LM
3 PRs0.2532
Regularizationratio loss
3 PRs1.3595
Sequence Length6144
3 PRs1.1084
Test-Time TrainingSLOT
3 PRs1.0713
Quantizationint7
3 PRs1.0719
OptimizerMuonEq-R
3 PRs1.0679
Initializationlinear scale init
3 PRs1.1077
ArchitectureParallel residuals
3 PRs1.0785
InitializationQK gain init
2 PRs1.1131
Evaluationerror correction table
2 PRs1.4370
ArchitectureQK-norm
2 PRs1.1873
Weight AveragingLAWA
2 PRs1.2302
ArchitectureBitLinear
2 PRs1.1770
ArchitectureLoop Embedding
2 PRs1.2065
LR Schedulewarmup and warmdown
2 PRs1.1472
Architecturephase-transition residual mixing
2 PRs1.1565
Test-Time Trainingonline logit bias
2 PRs1.1248
OptimizerMuonAdamW
2 PRs1.5918
ArchitectureMixture of Softmax
2 PRs1.3921
ArchitectureBackout
2 PRs1.1364
Architectureloop embeddings
2 PRs1.0788
Architectureresid_mix
2 PRs1.1650
Architecturememory tokens
2 PRs1.1466
ArchitectureMemory Tokens
2 PRs1.1659
Test-Time Trainingcausal TTT
2 PRs1.1257
RegularizationL1 regularization
2 PRs1.1257
ArchitectureParameter Banking
2 PRs1.1091
ArchitectureMLP4
2 PRs1.3092
Architectureactivation
2 PRs1.1370
Test-Time Trainingunknown
2 PRs1.1163
QuantizationLate QAT
2 PRs1.0672
ArchitectureSwiGLU FFN
2 PRs1.0672
LR Schedulewarmdown + cosine decay
2 PRs0.7227
Architecturedepth recurrence / weight sharing
2 PRs1.1454
ArchitectureGEPA
2 PRs1.0920
Architecturelayers
2 PRs1.1309
Initializationzero-init
2 PRs1.4707
ArchitectureU-Net Skip Gates
2 PRs1.1181
ArchitectureSmearGate + BigramHash
2 PRs1.1511
ArchitectureValue Residual (ResFormer)
2 PRs1.1330
Test-Time TrainingSGD TTT
2 PRs1.0857
QuantizationFull GPTQ
2 PRs1.1175
QuantizationQAT-export alignment
2 PRs1.1175
ArchitectureTied embeddings
2 PRs1.1175
Quantizationint5 GPTQ
2 PRs1.1162
Test-Time Trainingscore-first AdamW TTT
2 PRs1.1172
QuantizationINT6 QAT
2 PRs1.1412
ArchitectureLeakyReLU(0.5)^2 activation
2 PRs1.1194
Architecturedepth-scaled residual
2 PRs0.7853
ArchitectureValue Residual Learning (VRL)
2 PRs1.1175
Quantizationint8 QAT
2 PRs1.2791
Quantizationint6 QAT + GPTQ
2 PRs0.9633
ArchitectureSharedSparseSidecar
2 PRs1.0574
QuantizationFP16
2 PRs0.0804
ArchitectureBlock Attention Residuals
2 PRs1.1224
ArchitectureActivation
2 PRs1.1190
Evaluationmin-NLL epoch selection
2 PRs0.5601
Quantizationmixed int6 GPTQ
2 PRs1.1169
Quantizationint6 per-row with GPTQ-lite clip search
2 PRs1.0944
ArchitectureTied Embeddings
2 PRs1.0944
Regularizationweight pruning
2 PRs1.1326
Quantizationint6 per-row
2 PRs0.9581
ArchitectureU-Net encoder/decoder
2 PRs1.1239
ArchitectureFactored tied embedding
2 PRs1.1239
OptimizerNeoMuon
2 PRs1.1239
Evaluationonline n-gram cache eval
2 PRs1.0920
Test-Time TrainingTTT disabled
2 PRs1.0920
Quantizationint5 QAT
2 PRs1.1326
Quantizationmixed int5/int8
2 PRs1.0801
ArchitectureASQU
2 PRs1.2164
QuantizationQAT + GPTQ
2 PRs1.1186
Weight AveragingSWA + EMA
2 PRs1.1186
Test-Time Trainingdisabled
2 PRs0.8960
LR ScheduleLR scheduling tuned for single-device run
2 PRs1.4078
Sequence Length131000
2 PRs1.5252
Architecturebigram embedding guard
2 PRs1.5252
LR Scheduleadaptive cosine decay
2 PRs0.2071
LR ScheduleWarmup-Stable-Decay cosine schedule
2 PRs1.2824
Regularizationweight tying
2 PRs1.3223
ArchitectureBackoffNgramMixer
2 PRs0.0308
Architecturebidirectional transformer
2 PRs1.1465
Architecturevalue embeddings
2 PRs1.1454
Architectureloop embedding
2 PRs1.1454
Evaluationn-gram backoff cache
2 PRs0.2834
Evaluationtwo-pass n-gram rescoring
2 PRs0.1315
RegularizationCROWN-Q
2 PRs1.1105
ArchitectureDenseFormer
2 PRs1.3036
Evaluationfull-rescore
2 PRs0.0165
Architecturepredictor MLP
2 PRs1.1896
Evaluationtemperature sharpening
2 PRs0.0804
Evaluationtemperature scaling
2 PRs1.1303
Architectureshort convolution
2 PRs1.2162
ArchitectureJEPA
2 PRs1.1085
QuantizationINT6
2 PRs1.1531
ArchitectureSSM
2 PRs1.1682
Regularizationstructured pruning
2 PRs1.1147
ArchitectureFiLM
2 PRs1.1646
Architectureshared attention
2 PRs1.3527
Architecturebidirectional attention
2 PRs1.3485
Evaluationn-gram cache
2 PRs0.2532
ArchitectureEngramLite
2 PRs1.1091
Quantizationmixed int6/int7/int5
2 PRs1.6371
Test-Time TrainingAdam TTT
2 PRs1.6371
ArchitectureUnified Attention
2 PRs1.1088
Quantizationmixed int6/int4
2 PRs1.1574
Architectureerror feedback
2 PRs1.1163
RegularizationJacobian proxy loss
2 PRs1.1163
Architectureattention projections
2 PRs1.2207
Sequence Length524288
2 PRs1.1921
ArchitectureSLOT
2 PRs0.7406
Compressionbrotli + lzma
2 PRs1.0896
ArchitectureSP2048 vocabulary
2 PRs1.0955
OptimizerL-BFGS
2 PRs0.2282
Quantizationint6_awq
2 PRs1.1834
Regularizationcomplementary training
2 PRs0.3509
ArchitectureGDN
2 PRs1.0167
Architecturepause tokens
2 PRs1.3223
ArchitectureBasis Sharing
2 PRs1.3223
ArchitectureUniversal Transformer + ACT
2 PRs1.3223
Evaluationstateful-overlap eval
2 PRs1.1473
Regularizationdispersion loss
1 PRs1.2244
Architecturedepth / layer count
1 PRs1.2139
OptimizerMuon/Adam
1 PRs1.2139
Architecturereduced depth
1 PRs1.1888
Regularizationresidual scaling
1 PRs1.5283
Evaluationtest-time compute scaling
1 PRs1.5283
Evaluationlogit chunking
1 PRs1.8440
EvaluationNTK-aware RoPE scaling
1 PRs1.2160
Quantizationmixed int6/int8 STE QAT
1 PRs1.1556
Quantizationmixed int6/fp16
1 PRs1.1632
Architecturedepth/narrow transformer
1 PRs1.3509
Architecturelayer recurrence
1 PRs1.3281
Quantizationmixed int8/fp16
1 PRs1.1884
Architecturevocab size
1 PRs1.1858
ArchitectureMLP2x
1 PRs1.2156
ArchitectureMTP auxiliary head
1 PRs1.1605
Architecturedepth/width tradeoff
1 PRs1.3693
Quantizationmixed int6 quantization
1 PRs1.1648
Architecturedepth reduction
1 PRs1.3797
Architecturelogit softcapping
1 PRs1.7510
Initializationproj zero-init
1 PRs1.7510
Initializationresid_mix
1 PRs1.7510
LR Schedulelinear warmup + constant + cosine cooldown
1 PRs1.7510
LR Schedulewarmdown with LR floor and cooldown fraction schedule
1 PRs1.6372
LR Schedulelinear warmup + wallclock-aware linear warmdown
1 PRs1.2029
Architecturelayer looping
1 PRs1.2987
ArchitectureNTK-RoPE
1 PRs1.1478
Test-Time Trainingdoc-isolated eval
1 PRs1.2045
ArchitectureTransformer layers
1 PRs1.1876
Initializationovertone spectral embedding initialization
1 PRs1.1876
Initializationphase-transition residual-mix initialization
1 PRs1.1876
ArchitectureGQA / KV head count
1 PRs1.1472
Architecturelinearized neural memory
1 PRs1.1844
Initializationovertone spectral embedding init
1 PRs1.1844
Architecturepre-enrichment block
1 PRs1.1855
Architecturewider-shallower Transformer
1 PRs1.3043
ArchitectureMLP2.75x
1 PRs1.1565
Regularizationgrad clip
1 PRs1.1565
Architecturepre-enrichment
1 PRs1.1629
LR Schedulewarmup/warmdown
1 PRs1.1598
ArchitectureCTM workspace bridge
1 PRs1.2917
Initializationq_gain init
1 PRs1.3825
ArchitectureFlashAttention 3
1 PRs1.1318
Evaluationdoc-isolated sliding window eval
1 PRs1.1929
Evaluationpartial-window fix
1 PRs1.1551
Quantizationint6 STE QAT
1 PRs1.1507
Architectureencoder-decoder skip connections
1 PRs1.1719
Quantizationternary VQ
1 PRs1.1719
Architecturelow-rank Q
1 PRs1.1548
ArchitectureQK-Norm
1 PRs0.8100
ArchitectureFlashAttention-2
1 PRs0.8100
ArchitectureLRU / state space model
1 PRs1.8480
Architectureparallel scan
1 PRs1.8480
Architecturegated projection
1 PRs1.8480
ArchitectureReLU^2 MLP
1 PRs1.8480
Architecturenum_layers
1 PRs1.1899
Architecturepersistent memory
1 PRs1.3446
Architecturelow-rank factorization
1 PRs1.3446
Architecture10-layer 4xMLP
1 PRs1.4444
Architecturephase-transition resid_mix
1 PRs1.2036
LR Scheduleextended warmup
1 PRs1.2036
Quantizationmixed-bit lowbit export
1 PRs1.2036
ArchitectureTransformer depth/width
1 PRs1.6660
Evaluationvalidity-safe eval path
1 PRs1.2064
Evaluationnon-overlapping final eval
1 PRs1.2064
Sequence Length768
1 PRs1.6114
ArchitectureTransformer size
1 PRs1.6231
Quantizationmixed selective precision
1 PRs1.1554
InitializationQK Gain Init
1 PRs1.5879
InitializationQK_GAIN_INIT
1 PRs1.3003
Test-Time Trainingtiny eval-time SGD
1 PRs1.2427
Architecturedepth sharing / shared-depth
1 PRs1.6577
ArchitectureRMSNorm interface
1 PRs1.6577
Architecturephase-conditioned scales
1 PRs1.6577
Evaluationeval-time probability blending / context mixing
1 PRs1.2244
Evaluationstandard eval
1 PRs1.2334
Evaluationint8+zlib roundtrip evaluation
1 PRs1.3274
Architecturetokenizer/vocabulary size
1 PRs1.2827
InitializationSVD spectral init
1 PRs1.1477
Regularizationcompression-aware auxiliary loss
1 PRs1.2271
Architecturelow-rank K projection
1 PRs1.2271
Architecturelow-rank TD projection
1 PRs1.2271
Architecturelow-rank GRU state carry
1 PRs1.2271
Evaluationneural cache
1 PRs1.4245
Regularization3% magnitude pruning
1 PRs1.1448
LR Schedulewarmdown cosine schedule
1 PRs1.1914
ArchitectureTransformer depth / tied embeddings / KV head count
1 PRs1.1787
Initializationspectral init / residual mixing
1 PRs1.1787
ArchitectureCanon ACD
1 PRs1.1668
ArchitectureLow-Rank Q
1 PRs1.2035
Architecture12 layers
1 PRs1.2035
Initializationovertone spectral init
1 PRs1.2035
Quantizationint6 mixed
1 PRs1.1442
LR Schedulefixed learning rates
1 PRs1.1442
Evaluationcross-window KV caching
1 PRs1.1284
Architectureloop gates
1 PRs1.2716
Initializationzero initialization for loop embeddings
1 PRs1.2716
Initializationuniform gate initialization
1 PRs1.2716
Architecturedepth recurrence / looped transformer
1 PRs1.1462
ArchitectureBigram features
1 PRs1.1462
ArchitecturePer-head temperature scaling
1 PRs1.1450
Compressionzstandard
1 PRs1.1320
ArchitectureLate-K FP16
1 PRs1.1565
InitializationOvertone SVD init
1 PRs1.1565
Architectureper-layer scalars
1 PRs1.3323
LR Schedulewarmup schedule
1 PRs1.2459
ArchitectureDifferential Attention V2
1 PRs1.8522
Architecturelow-rank Q delta
1 PRs1.8522
Architectureloop position embeddings
1 PRs1.8522
ArchitectureMLP width reduction
1 PRs1.1929
Architecture10L Transformer
1 PRs1.1400
Quantizationmixed int6 QAT
1 PRs1.1400
Quantizationternary QAT
1 PRs1.1770
ArchitectureMLP3.25x
1 PRs1.1770
Weight AveragingEMA/SWA
1 PRs1.1770
Quantizationmixed int6/int8 with STE
1 PRs1.2421
Architecture11-layer U-Net Transformer
1 PRs1.1361
Test-Time TrainingReptile meta-learning TTT
1 PRs1.1257
ArchitectureGQA + RoPE
1 PRs1.4072
ArchitectureINL BetaMu attention
1 PRs1.4072
ArchitectureSort-Split MoE
1 PRs1.4072
ArchitectureALiBi
1 PRs1.4072
ArchitectureToken-routed MoE
1 PRs1.4072
ArchitecturePID Dynamics / INL Ultra-Lite
1 PRs1.4072
LR Schedulecosine warm restarts (SGDR)
1 PRs1.4072
Test-Time Trainingself-distillation TTT
1 PRs1.1257
Architecturex0 residual mix
1 PRs1.4061
ArchitectureShared Value Embeddings
1 PRs1.1231
Architectureencoder-decoder depth split
1 PRs1.2374
Architecturelearned per-dimension control knobs
1 PRs1.2374
ArchitectureMLP3x/MLP4x
1 PRs1.2417
Initializationphase-transition residual mixing
1 PRs1.2417
Architecturedepth reduction / encoder-decoder split
1 PRs1.2374
Architectureper-dimension control parameters
1 PRs1.2374
ArchitectureLayerNorm scale
1 PRs1.1247
ArchitectureCANON
1 PRs1.1296
InitializationCANON delta gate near-identity init
1 PRs1.1296
Test-Time TrainingSelf-Distillation TTT
1 PRs1.1287
Quantizationmixed int6/int5 QAT
1 PRs1.1227
ArchitectureDiffTransformer V2
1 PRs1.1715
Architectureweight sharing / depth recurrence
1 PRs1.1454
ArchitectureMLP×5
1 PRs1.1454
Architecturebackout connection
1 PRs1.1466
Architectureper-head temperature
1 PRs1.1466
Initializationortho+muP init
1 PRs1.1466
Quantizationmixed int5/int6 with fp16 embeddings
1 PRs1.2026
Architecturedepth and MLP width increase
1 PRs1.2026
Architectureq_proj
1 PRs1.5295
Sequence Length960
1 PRs1.5295
LR Scheduleshort-to-full context warmup
1 PRs1.5295
ArchitectureRadial Token Branch
1 PRs1.6130
ArchitectureBitNet-style ternary projections
1 PRs1.6130
ArchitectureMLP3x/4x MLP
1 PRs1.2392
ArchitectureAuxNet
1 PRs1.2257
Architecturesmear transformation
1 PRs1.2257
Weight AveragingEMA-SWA
1 PRs1.2219
Architecturedepth increase
1 PRs1.2219
Architectureextra RMSNorm
1 PRs1.1933
Evaluationmanual logits-only exact evaluation
1 PRs1.2006
ArchitectureCatalytic Residual Connections
1 PRs1.1466
Test-Time TrainingSGD post-quantization
1 PRs1.1366
InitializationOrthogonal + muP-scaled init
1 PRs1.1248
Regularizationgrad_clip
1 PRs1.2055
Test-Time Trainingscore-first full-model TTT
1 PRs1.1532
LR Schedulewarmup + warmdown + cosine decay
1 PRs1.1532
Regularizationweight entropy regularization
1 PRs1.1490
ArchitectureKronecker attention
1 PRs1.1490
Architectureskip-gram hash
1 PRs1.1490
Regularizationentropy token masking
1 PRs1.1490
Architecturegrouped-query attention
1 PRs1.2928
Quantizationint6 + zstd
1 PRs1.1446
LR Schedulewarmup + warmdown cosine decay
1 PRs1.1508
Quantizationmixed int5
1 PRs1.1354
Evaluationscore-first per chunk evaluation
1 PRs1.1428
Architectureper-layer scaling
1 PRs1.1454
ArchitectureCatalytic Residuals
1 PRs1.1690
Quantizationmixed int6/int8 QAT
1 PRs1.1690
Quantizationint16
1 PRs0.5000
ArchitectureHECR quantum state vectors
1 PRs0.5000
Architecturemulti-kernel readout heads
1 PRs1.4574
ArchitectureComplexSSM
1 PRs1.4574
Architecturecausal self-attention
1 PRs1.4574
ArchitectureU-Net skip connection
1 PRs1.4574
Quantizationmixed int6/int5/int4
1 PRs1.1456
ArchitectureLate Soft-Round QAT
1 PRs1.1185
Test-Time Trainingscore-first TTT with EB-adaptive per-layer scaling
1 PRs1.1185
Quantizationmixed int5/int6/int7 QAT
1 PRs1.1101
RegularizationLN scale depth damping
1 PRs1.1327
ArchitectureSmearGate + BigramHash embeddings
1 PRs1.1591
Quantizationint8 with FP16 token embedding
1 PRs1.3162
Test-Time Trainingskipped
1 PRs1.3162
InitializationOrthogonal loop positions
1 PRs1.1478
Architecturedepth recurrence / recursive weight sharing
1 PRs1.1478
InitializationQR-initialized orthogonal loop position embeddings
1 PRs1.1478
ArchitectureBlock AttnRes
1 PRs1.1925
ArchitecturePhiAlpha Simple
1 PRs1.1925
ArchitectureMLP width multiplier
1 PRs1.5248
ArchitectureTrigramHashEmbedding
1 PRs1.5275
OptimizerAdamW with Muon
1 PRs1.5275
Quantizationint6 + GPTQ-lite + QAT
1 PRs1.1181
ArchitectureValue Embeddings (VE128)
1 PRs1.1181
Architecturegated U-Net skip connections
1 PRs1.1558
ArchitectureCatalytic residuals
1 PRs1.1558
ArchitectureValue residual (ResFormer)
1 PRs1.1558
Regularizationembedding freeze
1 PRs1.1215
ArchitectureU-Net gated skips
1 PRs1.1175
QuantizationInt6 QAT
1 PRs1.1175
OptimizerMUD
1 PRs1.1989
Test-Time Trainingdelayed outside-context-only PPM
1 PRs1.1417
RegularizationEMA weights, LN Scale
1 PRs1.1428
Quantizationzstd
1 PRs1.1425
ArchitectureLayer-Norm Scale
1 PRs1.1425
Regularizationfreeze early layers
1 PRs1.1425
LR Schedulecustom tuning from multi-device to single-device scale
1 PRs1.4078
ArchitectureLayer count
1 PRs1.1324
ArchitectureEmbeddings
1 PRs1.1324
ArchitectureAttention
1 PRs1.1324
Quantizationmixed int8/fp16 with custom codebook quantization
1 PRs1.0487
ArchitectureLeakyReLU(0.5)² activation
1 PRs1.1204
ArchitectureLeakyReLU(0.5)² MLP
1 PRs1.1387
Quantizationfp8
1 PRs1.1511
ArchitectureDG Attention
1 PRs1.1898
ArchitectureFlash Attention
1 PRs1.1898
Quantizationmixed int6/int8 with GPTQ-lite
1 PRs1.1804
ArchitectureShared VE128
1 PRs1.1804
LR ScheduleLate QAT
1 PRs1.1804
QuantizationEarly QAT
1 PRs1.1179
LR Schedulewarmup + warmdown cosine schedule
1 PRs1.0865
LR Scheduleauto warmdown
1 PRs1.1890
ArchitectureFiLM conditioning
1 PRs1.1634
ArchitectureBigramHash + TrigramHash
1 PRs1.1634
ArchitectureKV heads
1 PRs1.1634
Compressioncustom packed_zstd
1 PRs1.4612
Architectureshared sparse sidecar
1 PRs1.0916
Architecturevalue residual
1 PRs1.1160
Architecturegated attention
1 PRs1.1160
ArchitectureMLP3x with LeakyReLU(0.5)^2
1 PRs1.1160
Initializationloop gates initialized at 1.0
1 PRs1.5348
Quantizationmixed int5 (MLP) / int6 (attention) + GPTQ-lite per-row clip search + 3% magnitude pruning + FP16 passthrough for embeddings + zstd-22 compression
1 PRs1.1354
Evaluationsliding window eval + Test-Time Training (TTT)
1 PRs1.1354
Evaluationscore every epoch
1 PRs0.7853
ArchitecturePartial RoPE + NTK-aware scaling
1 PRs1.1175
Quantization2% magnitude pruning post-quantization
1 PRs1.1175
ArchitectureTrigramHash Embedding
1 PRs1.3434
ArchitectureBigramHash Embedding
1 PRs1.3434
ArchitectureStar-ReLU
1 PRs1.3434
Test-Time Trainingscore-first multi-pass legal TTT
1 PRs1.0523
Architecturedepth recurrence, weight tying, tied embeddings, RoPE, ReLU² MLP 3×, GQA
1 PRs1.1750
Quantizationint5 QAT + GPTQ
1 PRs1.1164
Regularization2% pruning
1 PRs1.1164
Test-Time Trainingfull TTT with SGD
1 PRs1.1207
QuantizationGPTQ with early QAT
1 PRs1.1215
Test-Time TrainingLegal Score-First TTT
1 PRs1.1215
ArchitecturePartial RoPE, XSA, BigramHash, VE128, SmearGate, logit softcap, tied embeddings
1 PRs1.1215
Evaluationsliding window eval with stride 32
1 PRs1.1215
Quantizationint6 per-row with GPTQ Hessian-aware quantization
1 PRs1.1355
Architecturerecursive weight sharing
1 PRs1.1355
Architectureasymmetric weight sharing (Micro Crawler)
1 PRs1.1355
Architecturebidirectional persistent deliberation gate
1 PRs1.1355
Architectureinput conditioning
1 PRs1.1355
Architectureposition embeddings
1 PRs1.1355
OptimizerMuon (matrices) and AdamW (embeddings and scalars)
1 PRs1.1355
Weight AveragingSWA and EMA
1 PRs1.1355
Quantizationfull-run Int6 QAT with STE
1 PRs1.1489
Quantizationint5 quantization
1 PRs1.1489
Architectureregister tokens
1 PRs1.1233
Architecturegated V-norm
1 PRs1.1233
Architecturemixture of softmax
1 PRs1.1233
Quantizationint6 per-row with Hadamard rotation
1 PRs1.1365
ArchitectureShared Value Embeddings (VE128)
1 PRs1.1365
ArchitectureLayer Norm Scale
1 PRs1.1365
ArchitecturecuDNN SDPA
1 PRs1.1365
Quantizationmixed Int5/Int6 QAT
1 PRs1.1476
ArchitectureValue Embed
1 PRs1.1476
ArchitectureEmbedding
1 PRs1.1476
ArchitectureLN depth scaling
1 PRs1.1334
ArchitectureValue embeddings
1 PRs1.1334
ArchitectureLate QAT
1 PRs1.1334
ArchitectureHybrid Attention + Mamba SSM
1 PRs1.1828
OptimizerMuon (matrix), Adam (scalar/embed)
1 PRs1.1828
ArchitectureOrthogonal initialisation
1 PRs1.2364
ArchitectureBigram hash embeddings
1 PRs1.2364
ArchitectureGQA (Grouped-Query Attention)
1 PRs1.2364
QuantizationQAT int6
1 PRs1.2364
QuantizationSTE QAT (late QAT) + Full GPTQ + Int5 MLP re-quantization + GPTQ-lite
1 PRs1.1418
ArchitectureValue Residual (VR)
1 PRs1.1418
ArchitectureGated Attention (GA)
1 PRs1.1418
ArchitectureBigramHash embeddings
1 PRs1.1418
Test-Time TrainingSGD TTT (legal, cosine, per-layer)
1 PRs1.1418
ArchitecturePer-head gated attention
1 PRs1.4750
ArchitectureLooped middle blocks
1 PRs1.4750
ArchitectureSelective ±1 magnitude pruning
1 PRs1.1154
ArchitectureLeakyReLU(0.5)² MLP 3x
1 PRs1.1154
QuantizationFull Hessian GPTQ
1 PRs1.1154
ArchitecturePartialRoPE
1 PRs1.1190
ArchitectureLNScale
1 PRs1.1190
ArchitectureValueEmbed
1 PRs1.1190
ArchitectureLateQAT
1 PRs1.1190
ArchitectureK projection LoRA
1 PRs0.5601
Quantizationint6 per-row with GPTQ-lite
1 PRs1.1079
ArchitectureK-Projection LoRA
1 PRs0.6864
ArchitectureResidual Input Mixing
1 PRs1.1169
Quantizationmixed int5/int6 with QAT
1 PRs1.4222
Architecturedepth-scaled residuals
1 PRs0.9443
Architecture11L Shared
1 PRs1.1507
Architectureskip_connections
1 PRs1.1507
Architecture1+7+1 layer stack
1 PRs1.1194
ArchitectureSolarShield gating
1 PRs1.1194
Test-Time TrainingNo TTT
1 PRs1.1180
Regularizationfreeze early layers during TTT
1 PRs1.0983
ArchitectureAttention-Residuals
1 PRs1.2767
QuantizationINT6 GPTQ-lite
1 PRs1.1526
QuantizationFull Hessian GPTQ with amax-aligned QAT
1 PRs1.1171
Quantizationint6 uniform + GPTQ-lite
1 PRs1.1330
ArchitectureMLP 3.5x with LeakyReLU(0.5)^2
1 PRs1.1330
ArchitectureXSA all 11 layers
1 PRs1.1330
ArchitectureTied FP16 embeddings
1 PRs1.1330
Otherunknown
1 PRs1.1330
OptimizerAdam-style groups
1 PRs1.1234
Weight AveragingSWA+EMA blend
1 PRs1.1158
QuantizationBitNet b1.58 ternary quantisation with FP8 QAT
1 PRs1.1570
ArchitectureFused QKV projection
1 PRs1.1570
CompressionBase-3 + LZMA
1 PRs1.1570
RegularizationZ-loss regularisation
1 PRs1.1570
Quantization1-bit binary quantisation
1 PRs1.1239
ArchitectureYaRN positional encoding
1 PRs1.1239
Compressionbit-packing + LZMA
1 PRs1.1239
RegularizationPolynomial softcap with Z-loss regularisation
1 PRs1.1239
OptimizerMuon and Adam for training; SGD with momentum for TTT
1 PRs1.0944
LR Schedulecosine warmdown with linear warmup
1 PRs1.0944
Regularizationweight decay and layerwise LN scale
1 PRs1.0944
InitializationmuP scaling
1 PRs1.8990
ArchitectureOLR-FW
1 PRs1.1349
LR Schedulebeta2 decay
1 PRs1.1349
LR Schedulelearning rate scaling
1 PRs1.1428
Architecturelayerwise residual mixing
1 PRs1.2073
ArchitectureLN scaling
1 PRs1.2073
ArchitectureHybrid GDN/Transformer
1 PRs1.2093
Quantizationplanned but not implemented
1 PRs1.2093
ArchitectureMish² Activation
1 PRs1.1552
ArchitectureLayerNorm Scale
1 PRs1.1552
OptimizerParameter Banking + Parallel Muon
1 PRs1.1552
ArchitectureBigram Vocab
1 PRs1.1190
ArchitectureMLP 3×
1 PRs1.1234
ArchitectureSoft MoE
1 PRs1.1826
Test-Time Trainingstreaming legal TTT
1 PRs1.1208
Architecturemanifold-guided token interaction graph
1 PRs0.4380
Architecturesparsemax routing
1 PRs0.4380
Architecturespectrally-modulated gated hop cells
1 PRs0.4380
Architecturemanifold-guided attention
1 PRs0.4380
Architectureparallel transport across token manifold
1 PRs0.4380
LR Schedulecosine decay + hold + linear warmdown
1 PRs0.4380
Initializationdeterministic physics simulation initialization
1 PRs0.4380
Architecturespiking MLP
1 PRs1.2982
Regularizationspike-rate regularization
1 PRs1.2982
ArchitectureTRN hybrid
1 PRs1.4942
QuantizationSpinQuant/Hadamard
1 PRs1.1171
QuantizationSoft-Round QAT
1 PRs1.1171
Quantizationselective pruning
1 PRs1.1171
ArchitectureQKV fusion
1 PRs1.1171
ArchitectureLeakyReLU² stack
1 PRs1.0781
ArchitectureKV GQA
1 PRs1.0781
Test-Time Trainingvalidation set training
1 PRs1.2302
InitializationSVD-based attention warm-start
1 PRs1.3525
Architectureshared last layer
1 PRs1.3525
ArchitectureShort Conv
1 PRs1.2164
ArchitectureMoC
1 PRs1.2164
ArchitectureBankedLinear
1 PRs1.2164
ArchitectureMLP expansion adjustment
1 PRs1.2164
Initializationdepth-aware initialization
1 PRs1.2164
Architecturewavelet-lite mixer
1 PRs1.1483
ArchitectureTTT disabled
1 PRs1.1483
Initializationwavelet init
1 PRs1.1483
QuantizationGPTQ-lite int6
1 PRs1.4775
ArchitectureAttention shift mixing
1 PRs1.0574
ArchitectureK gain
1 PRs1.0574
ArchitectureLocal value residual
1 PRs1.0574
ArchitectureJEPA encoder-decoder
1 PRs1.2622
Sequence Length2047
1 PRs1.2622
LR Schedulecosine recovery
1 PRs1.1194
OptimizerMuon/AdamW
1 PRs1.1642
ArchitectureGPTQ-lite
1 PRs1.1642
ArchitectureCache+Backout
1 PRs1.1176
ArchitectureU-Net style skip connections
1 PRs1.2151
Evaluation5-gram eval interpolation
1 PRs1.0461
ArchitectureLeakyReLU2
1 PRs1.4239
ArchitectureGQA attention
1 PRs1.3515
LR Schedulestandard LR scheduling tuned for single-device run
1 PRs1.5252
ArchitectureMiddle-Out Autoregressive Compressor (MOAC)
1 PRs0
Evaluationsliding window eval with backward-looking 7-gram cache
1 PRs1.0717
Test-Time Trainingscore-first TTT-like cache update
1 PRs1.0717
ArchitectureCastedLinear clip factor estimator
1 PRs1.5252
Architecture11L Transformer
1 PRs0.9674
Evaluationn-gram eval cache
1 PRs0.9674
Initializationones-init
1 PRs1.1570
ArchitectureHedge Mixer
1 PRs1.0278
Quantizationmixed FP4/Int6 QAT
1 PRs1.5000
InitializationDeepNorm init
1 PRs1.5000
RegularizationZ-Loss
1 PRs1.5000
RegularizationQK-Clip
1 PRs1.5000
Evaluationonline n-gram cache
1 PRs1.0909
Evaluationmulti-order n-gram cache interpolation
1 PRs0.9850
ArchitectureCROWN-Q
1 PRs1.0222
ArchitecturePairHash
1 PRs1.3684
Evaluationfull validation on fineweb_val_* split
1 PRs1.3684
Evaluationmulti-order backoff n-gram eval
1 PRs0.9625
Evaluationadaptive alpha evaluation
1 PRs0.9625
Test-Time TrainingMLP-down-only TTT
1 PRs1.1142
Test-Time TrainingMLP-all TTT
1 PRs1.1142
QuantizationInt6 STE QAT
1 PRs1.1124
Test-Time Trainingscore-first full TTT
1 PRs1.1124
ArchitectureLeakyReLU(0.5)^2 MLP
1 PRs1.0465
RegularizationLN scaling
1 PRs1.0465
Evaluationbackward-looking eval cache
1 PRs1.0465
Architecture8-layer architecture
1 PRs1.3092
ArchitectureLeakyReLU^2 MLP
1 PRs0.9917
Evaluationmulti-order backoff n-gram eval cache
1 PRs0.9917
Evaluation7-gram backoff
1 PRs0.9633
Evaluationadaptive n-gram backoff eval
1 PRs0.9209
ArchitectureLeakyReLU(0.9)²
1 PRs0.8508
Evaluationmulti-order n-gram eval
1 PRs0.9370
ArchitectureN-gram cache
1 PRs0.9258
Test-Time TrainingCosine TTT
1 PRs0.9258
Evaluationentropy-adaptive cache blending
1 PRs0.9623
LR Schedulenone
1 PRs0.6683
Evaluationlegal score-first 7-gram backoff
1 PRs0.9362
ArchitectureLeakyReLU² MLP
1 PRs1.5364
Evaluationorder-adaptive entropy-gated n-gram backoff cache
1 PRs0.9059
Evaluationn-gram backoff
1 PRs1.0340
ArchitectureU-Net-style skip structure
1 PRs1.2500
Evaluationorder-adaptive n-gram backoff cache
1 PRs0.8881
Regularizationlayerwise LN scaling
1 PRs0.8881
Evaluation7-gram causal cache with entropy-adaptive blending
1 PRs0.6567
Evaluation7-gram n-gram cache
1 PRs0.8960
LR Scheduledynamic wallclock cosine warmdown
1 PRs1.2005
QuantizationSTE QAT / post-quant 6-bit
1 PRs1.2005
Evaluationentropy-adaptive alpha
1 PRs0.9123
LR Schedulelate QAT activation based on LR scale threshold
1 PRs1.1807
Evaluationmulti-order backoff n-gram cache
1 PRs0.6678
Evaluationdistributed cache pre-fill
1 PRs0.6678
Initializationasymmetric LoRA initialization
1 PRs1.0116
Evaluationfull evaluation
1 PRs1.0116
Evaluationmulti-GPU n-gram prefill
1 PRs0.6364
Evaluationchunk-based sequential evaluation
1 PRs0.2952
Test-Time Trainingscore-first TTT-like n-gram cache
1 PRs0.9393
ArchitectureBankLinear
1 PRs1.2236
Initializationdepth-aware mixing coefficient initialization
1 PRs1.2236
Architecturelarger MLP
1 PRs1.2236
ArchitectureLeakyReLU MLP
1 PRs0.6671
Initializationwarm-start cubric initialization
1 PRs0.4820
ArchitectureHWNODE
1 PRs0.5527
Architecturespectral normalization
1 PRs0.5527
ArchitectureadaLN timestep conditioning
1 PRs1.6252
Evaluationvariational bound evaluation with discrete absorbing-mask process
1 PRs1.6252
Architecturelearned level signals
1 PRs1.2604
ArchitectureGatedAttn
1 PRs1.0896
ArchitectureXSA6
1 PRs1.0896
ArchitectureBigramHash4K
1 PRs1.0896
Test-Time Traininglegal TTT
1 PRs1.0896
Evaluationorder-adaptive entropy-gated BackoffNgramMixer
1 PRs0.5440
Evaluationint8+zlib roundtrip eval
1 PRs1.4096
ArchitectureByte-level transformer
1 PRs1.1903
ArchitectureJEPA auxiliary loss
1 PRs1.1903
ArchitectureLinear gate head
1 PRs0.1663
ArchitectureLeakyReLU_LegalTTT_ParallelMuon
1 PRs1.1219
Evaluationfine-grained n-gram cache chunked evaluation
1 PRs0.2873
Test-Time Trainingscore-first legal TTT
1 PRs1.1157
ArchitectureGPT depth increase
1 PRs1.1407
ArchitectureMLP_MULT reduction
1 PRs1.1407
ArchitectureBigram embedding modification
1 PRs1.1407
ArchitectureToken embedding / VE dimension reduction
1 PRs1.1407
Test-Time TrainingLegalTTT
1 PRs1.1407
ArchitectureGatedAttention
1 PRs1.1105
ArchitectureU-Net encoder-decoder
1 PRs1.1105
RegularizationLate QAT soft-round STE
1 PRs1.1105
Evaluationorder-9 n-gram backoff cache
1 PRs0.3212
LR ScheduleWSD
1 PRs0.3212
RegularizationEMA
1 PRs0.3212
ArchitectureSelective Scan (Mamba)
1 PRs1.1189
Evaluationn-gram backoff with extended order
1 PRs0.1315
Evaluationlarger chunked cache refresh
1 PRs0.1315
ArchitectureOutput-LN
1 PRs1.2659
ArchitectureBirkhoff mixing
1 PRs1.2659
Architecturetimestep scaling
1 PRs1.2659
Architecturecross-repeat skip
1 PRs1.1454
Architecturelearned mixer head
1 PRs0.1582
Architecturefrozen n-gram oracle
1 PRs0.1582
ArchitectureMHA 8/8
1 PRs0.1582
Evaluationscore-first backward-looking n-gram cache
1 PRs0.1582
LR Schedulematrix learning rate tuning
1 PRs0.1582
Evaluationscore-first eval
1 PRs0.1181
Evaluationscore-first n-gram backoff
1 PRs0.8004
Evaluationvectorized 7-gram backoff + kNN-LM
1 PRs1.0467
ArchitectureRandomLinearWithAdapter
1 PRs1.6070
Regularizationentropy-reg QAT
1 PRs0.9958
ArchitectureParallel Muon
1 PRs0.1310
Regularizationsemantic tube regularization
1 PRs1.1821
ArchitectureQK RMSNorm
1 PRs1.1896
RegularizationVICReg
1 PRs1.1896
Architectureprojection heads
1 PRs1.1896
Architecturedoc_copy_ctx2
1 PRs1.8111
Sequence Length16300000
1 PRs1.8111
ArchitectureVE64
1 PRs0.8609
Evaluationfull-rescore n-gram cache
1 PRs0.1653
Architecturedifferential attention
1 PRs1.1580
Evaluationtwo-pass full-rescore
1 PRs0.0804
Evaluationfull validation eval
1 PRs1.4457
ArchitecturePacked causal memory
1 PRs0.0165
Evaluationscore-first causal evaluation
1 PRs1.1606
Evaluationtokenizer-agnostic val_bpb evaluation
1 PRs1.3178
Evaluationfull-rescore two-pass N-gram
1 PRs0
Evaluationorder-12 n-gram cache
1 PRs0.0881
Evaluationlong phrase cache
1 PRs0.0881
Evaluation65K chunking
1 PRs0.0881
Evaluation11-gram eval cache
1 PRs0.8609
ArchitectureKGIIR
1 PRs1.1184
Evaluationscore-first evaluation
1 PRs0.1154
Evaluationsingle-pass eval
1 PRs0.0638
Regularizationtemperature sharpening
1 PRs0.0638
Evaluationtwo-pass full rescore
1 PRs0.0830
Evaluationexact post-quant eval
1 PRs1.0857
ArchitectureLogisticContextMixer
1 PRs1.0362
QuantizationINT4
1 PRs1.1650
Architectureresid mix
1 PRs1.1682
Regularizationlogit bias
1 PRs1.6200
Initializationphase-mix init
1 PRs1.2115
EvaluationMC Dropout ensembling
1 PRs1.3250
ArchitectureMTP
1 PRs1.1185
Architectureanti-layer removal
1 PRs1.3631
ArchitectureLegal TTT
1 PRs1.1261
ArchitectureMHA
1 PRs1.1978
LR Schedulelinear warmup + cosine decay
1 PRs1.3379
ArchitectureBoxIntersectionMixer
1 PRs1.4242
ArchitectureGPT
1 PRs0.0109
Architecturevocab_bias
1 PRs1.1349
Architecturebias to pre-norms
1 PRs1.2542
LR Schedulelayer/depth schedule
1 PRs1.2542
QuantizationGPTQ mixed int6/int7
1 PRs1.1086
ArchitectureMTP heads
1 PRs0.4027
Evaluationcausal sequential chunk eval
1 PRs0.4027
ArchitectureJEPA bottleneck
1 PRs1.3355
ArchitectureadaLN
1 PRs1.1465
Architecturefrozen visible-token logits
1 PRs1.1465
Evaluationdiscrete absorbing-mask ELBO
1 PRs1.1465
Regularizationloss truncation
1 PRs1.1147
Evaluationdiscrete ELBO eval
1 PRs1.1465
Architectureiteration scales
1 PRs1.2249
ArchitectureHybridNorm
1 PRs0.2532
ArchitectureDifferential Attention
1 PRs0.2532
ArchitectureWaveletGPT
1 PRs0.2532
ArchitectureVGA
1 PRs0.2532
ArchitectureMulti-Token Prediction
1 PRs0.2532
EvaluationTurboQuant KV cache compression
1 PRs0.2532
Architecturetoken-shift mixing
1 PRs1.2252
Architectureattention window
1 PRs1.2252
ArchitectureFrozenRandomLinearWithLoRA
1 PRs1.3705
Sequence Length9000
1 PRs1.1187
Architecturehierarchical token processing
1 PRs0.6846
Architectureqk_gain
1 PRs1.1946
ArchitectureTurbo-Muon
1 PRs1.1091
ArchitectureMimetic V-O initialization
1 PRs1.1091
ArchitectureResidual lambdas
1 PRs1.1140
ArchitectureVE196
1 PRs1.1140
ArchitectureCache + backout
1 PRs1.1140
ArchitectureSSSL
1 PRs1.1801
Sequence Length448
1 PRs1.1493
Evaluationonline n-gram agreement eval
1 PRs1.1109
Evaluationautoregressive KV-cache eval
1 PRs1.6507
Evaluationautoregressive eval
1 PRs1.6507
Quantizationpolar
1 PRs1.7757
Test-Time Trainingrandom-map TTT
1 PRs1.1917
Test-Time TrainingTTT-Linear
1 PRs1.1347
ArchitectureFlowRefiner
1 PRs1.1347
Architecturehierarchical chunking
1 PRs1.3639
Architecturemulti-resolution processing
1 PRs1.3639
ArchitectureNativeFlowMatcher
1 PRs1.1199
Compressionbyte-shuffle
1 PRs1.1105
Architecturedepthwise Conv1D
1 PRs1.0577
ArchitectureDynamicChunker
1 PRs1.3587
EvaluationTriton eval kernels
1 PRs1.3560
Architecturecrawler bottleneck
1 PRs1.1761
Architectureshared TAP encoder connections
1 PRs1.1761
CompressionrANS
1 PRs1.1601
ArchitectureCausal n-gram fix
1 PRs1.1084
ArchitectureTRN
1 PRs1.1915
Evaluationentropy-adaptive mixing
1 PRs1.4841
ArchitectureMonarch Matrices
1 PRs1.4841
Architecturemini-MoE
1 PRs1.1527
Architecturelogit bias
1 PRs0.9300
ArchitectureResidualScale
1 PRs1.1163
Regularizationfocal loss
1 PRs1.1460
ArchitecturePRP
1 PRs1.3527
Evaluationtwo-pass eval
1 PRs1.0903
EvaluationSLOT
1 PRs1.1240
ArchitectureWARP-Len
1 PRs1.0713
ArchitectureWARP-Pos
1 PRs1.0713
ArchitectureWARP-Type
1 PRs1.0713
RegularizationLeakyReLU
1 PRs1.0855
Quantizationmixed int5/int6 GPTQ
1 PRs1.0929
ArchitectureFA3
1 PRs1.1088
ArchitectureTTT
1 PRs1.1289
ArchitectureMLP adapters
1 PRs1.1100
LR Schedulehold-cosine
1 PRs1.2196
Test-Time TrainingContext-Only SLOT
1 PRs1.0963
QuantizationQ-LoRA
1 PRs1.8184
Architecturecoordinate embeddings
1 PRs1.8184
Evaluationonline n-gram agreement
1 PRs1.1078
LR Schedulesplit-LR
1 PRs1.1078
ArchitectureH-Net
1 PRs1.2070
Evaluationexact sequence matching
1 PRs1.1177
ArchitectureLatentPredictor
1 PRs1.3299
Initializationdepth-aware init
1 PRs1.2207
Regularizationadaptive focal loss
1 PRs1.3868
QuantizationVQ
1 PRs1.1948
Quantizationmixed int8/int7
1 PRs1.2079
LR Schedulelinear decay
1 PRs1.2079
Architecturestep embedding
1 PRs1.1763
Quantizationmixed int4/int8
1 PRs1.1763
Architecturemulti-model single representation
1 PRs1.2450
Architectureconv kernel
1 PRs1.2450
QuantizationProxQuant
1 PRs1.2200
ArchitectureSpiking-MLP
1 PRs1.3319
ArchitectureRBF
1 PRs1.3319
ArchitectureArcTan surrogate gradients
1 PRs1.3319
Architecturehomeostatic threshold adaptation
1 PRs1.3319
Architectureiter_embed
1 PRs0.8503
Architectureiter_gate
1 PRs0.8503
Regularizationrepeat penalty
1 PRs1.1196
Test-Time TrainingL-BFGS Causal SLOT
1 PRs1.0050
Test-Time TrainingCascaded 2-Phase L-BFGS
1 PRs1.0050
Test-Time TrainingDiscriminative per-block pre-quant TTT
1 PRs1.0050
Sequence Length128
1 PRs1.0050
Architectureencoder-decoder split
1 PRs1.1567
Test-Time TrainingFiLM-only TTT
1 PRs1.3151
Regularizationcompression-aware regularization
1 PRs1.4465
Architecturesignsq
1 PRs1.5390
QuantizationBF16 scales
1 PRs1.5390
Quantizationactivation binarization
1 PRs1.5390
ArchitectureSentencePiece 4096
1 PRs1.1020
ArchitectureSP4096
1 PRs1.0924
Evaluationquadrature over mask ratios
1 PRs1.3485
RegularizationSDClip
1 PRs1.0835
LR Schedulehigher LR compensation
1 PRs1.0913
ArchitectureGELU
1 PRs1.4192
Initializationinit_std
1 PRs1.4192
ArchitectureHadamard rotation
1 PRs1.4192
Evaluationn-gram tilt
1 PRs1.0801
Quantizationcodebook quant
1 PRs1.2067
RegularizationL2 loss
1 PRs1.2067
Architectureparallel blocks
1 PRs1.5207
Architecturedecoder depth
1 PRs1.5207
Architecturemodel width
1 PRs1.5207
OptimizerMousse
1 PRs1.1026
Quantizationmixed Q4/Q5/Q6
1 PRs1.1854
Architecturebyte-level input
1 PRs1.3496
Initializationlinear-by-depth scale init
1 PRs1.1834
Compressionint8
1 PRs1.3680
Quantizationmixed int7/int5
1 PRs1.1324
Evaluationhedge mixer
1 PRs1.1324
Quantizationmixed int6/int5/int4/fp16
1 PRs1.1465
ArchitectureDirectionalSemanticVec
1 PRs0.4118
ArchitectureHadamard Matrix
1 PRs0.4118
Evaluationmulti-seed evaluation
1 PRs0.4118
ArchitectureParallel Residual
1 PRs1.1056
Architecturedepth embeddings
1 PRs1.2066
Architectureweight scaling
1 PRs1.1156
ArchitectureSP8192
1 PRs1.0842
EvaluationBOS-reset non-overlap eval
1 PRs1.1995
Initializationlinear phase initialization
1 PRs1.1920
Initializationdepth-aware constant scale init
1 PRs1.1920
Evaluationfull validation comparison
1 PRs1.2206
Test-Time Trainingscore-first SLOT
1 PRs0.2282
ArchitectureJEPA-style regression transformer
1 PRs1.8658
QuantizationLeanICQ int3
1 PRs1.0872
QuantizationICQuant
1 PRs1.0872
RegularizationHessian clipping
1 PRs1.0788
EvaluationTap-In V6 cross-window
1 PRs1.0788
Regularizationloss weighting
1 PRs1.1146
ArchitectureV22
1 PRs1.4537
Compressionpyminify
1 PRs1.0742
ArchitectureTAP
1 PRs1.0742
ArchitectureANCHOR
1 PRs1.0742
Architectureparallel residual routing
1 PRs1.0850
LR Schedulebudget annealing
1 PRs1.2199
Regularizationspectral floor
1 PRs1.4352
RegularizationHessian-Aware SDClip
1 PRs1.0773
ArchitectureLoRA TTT
1 PRs1.0741
Architectureweight banking
1 PRs1.0783
Architecturehash embedding
1 PRs1.0783
ArchitectureSP1024 tokenizer
1 PRs1.0205
ArchitectureGated DeltaNet hybrid
1 PRs1.0171
ArchitectureLN scale
1 PRs1.1464
LR Schedulewarmup + stable + cosine decay
1 PRs1.3587
Regularizationgradient checkpointing
1 PRs1.3587
ArchitectureGDN-Hybrid
1 PRs1.0167
ArchitectureD-TPA
1 PRs1.2781
Sequence Length65536
1 PRs1.6644
ArchitectureVarLen Attention
1 PRs1.0719
Regularizationadaptive clip
1 PRs1.0719
RegularizationTWEO
1 PRs1.2299
Quantizationpost-training quantization
1 PRs1.0832
Regularizationskip gates
1 PRs1.0909
ArchitectureFMN
1 PRs1.4233
ArchitectureSparseBraidRegister
1 PRs1.4233
OptimizerFMNRiemannianAdam
1 PRs1.4233
Test-Time Trainingscore-first SGD
1 PRs1.0810
Architecturetrajectory-state readout
1 PRs1.0788
Test-Time TrainingqTTT
1 PRs1.1280
CompressionANS + brotli
1 PRs1.1280
Architecturerandom basis MLP
1 PRs1.2554
Architectureparallel residual lanes
1 PRs1.0809
ArchitectureQK depth ramp
1 PRs1.0809
ArchitectureMTP head
1 PRs1.2244
LR Schedulelinear cooldown
1 PRs1.2244
ArchitectureAttention Output Gate
1 PRs1.0573
Test-Time TrainingMP-SGD-TTT
1 PRs1.0759
ArchitectureK_KVShare_Wider
1 PRs1.0099
Quantizationfp8 e4m3
1 PRs1.4831
Evaluationchain-rule eval
1 PRs1.4831
LR Schedulelate loop onset
1 PRs1.1016
Test-Time Trainingreadout_only
1 PRs1.0976
ArchitectureGatedDeltaNet / Flash Linear Attention
1 PRs1.0339
Quantizationmixed int6/int7
1 PRs1.0740
Architecturebreadcrumb gating
1 PRs1.1803
Regularizationstochastic depth
1 PRs1.1803
Sequence Length786432
1 PRs1.0845
Quantizationmixed int4/int6/int8
1 PRs1.0785
ArchitectureQK gain
1 PRs1.0785
Quantizationmixed int8/int6/int4
1 PRs1.0785
QuantizationAWQ
1 PRs1.0785
Evaluationeval-only quantized path
1 PRs1.0722
Evaluationquantized-eval-only
1 PRs1.0722
InitializationQK gain
1 PRs1.3565
Architecturefactorized late layers
1 PRs1.2050
LR Schedulehealing phase
1 PRs1.2050
Initializationcustom random init
1 PRs1.2917
Evaluationfixed-depth eval
1 PRs1.0651