POM

POMPOMPOMPOMPOMPOMPOMPOMPOMPOMPOMPOMPOMPOMPOMPOMPOMPOMPOMPOMPOMPOMPOMPOMPOMPOMPOMPOMPOMPOMPOMPOMPOMPOMPOMPOMPOMPOMPOMPOMPOMPOMPOMPOMPOMPOMPOMPOMPOMPOMPOMPOMPOMPOMPOMPOMPOMPOMPOMPOM

Peter O'Malley

About|Posts|Assorted

Experiments I'm running to test ideas and assumptions in the open.

Experiment #1

Can open-source models with structured orchestration match frontier model performance on real software engineering tasks?

This experiment pairs open-weight models with a multi-phase pipeline — planning, critique, execution, and review — to tackle SWE-bench Verified, a benchmark of 500 authentic GitHub issues requiring code changes.

My goal is to see if open models can beat closed models with robust orchestration — a general-purpose planning harness that manages multi-phase workflows with explicit critique-and-gate patterns.

View methodology + results →