Experiments I'm running to test ideas and assumptions in the open.
Experiment #1
Can open-source models with structured orchestration match frontier model performance on real software engineering tasks?
This experiment pairs open-weight models with a multi-phase pipeline — planning, critique, execution, and review — to tackle SWE-bench Verified, a benchmark of 500 authentic GitHub issues requiring code changes.
My goal is to see if open models can beat closed models with robust orchestration — a general-purpose planning harness that manages multi-phase workflows with explicit critique-and-gate patterns.
View methodology + results →