Appreciate you testing it out. Would you share what numbers you got?
Looking back at my own data the first 7 gemini-3.1-flash-lite runs were also remarkably consistent: 61, 61, 61, 61, 63, 61, 60. It's not until run 8 that I get my first 48.
For gemma3:4b it's a similar story: That model makes up open source contributions, but for projects it starts with 25, 23, 28, 28, 28, 28, 28, and suddenly 18.
I've seen a few people now mention that a frontier model doesn't have this effect, so I ended up trying out Opus 4.8, and I've gotta say, the data doesn't look that different. I can't embed images into a comment, but I've added a little update section to the article with that data.
In my experience, using GLM 5.1 (ollama cloud) resulted almost always the same results on the same CV.
Appreciate you testing it out. Would you share what numbers you got?
Looking back at my own data the first 7 gemini-3.1-flash-lite runs were also remarkably consistent: 61, 61, 61, 61, 63, 61, 60. It's not until run 8 that I get my first 48.
For gemma3:4b it's a similar story: That model makes up open source contributions, but for projects it starts with 25, 23, 28, 28, 28, 28, 28, and suddenly 18.
I've seen a few people now mention that a frontier model doesn't have this effect, so I ended up trying out Opus 4.8, and I've gotta say, the data doesn't look that different. I can't embed images into a comment, but I've added a little update section to the article with that data.
Great work! This is wild stuff. It looks like there was no evaluation done on these prompts at all 🤯