Rendered at 06:58:46 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
magnio 2 hours ago [-]
I saw on Twitter that in an ML course at Tsinghua University, one of the tests asks students to write quizzes that fail the most LLM models as possible.
What if we create a benchmark that works like this and assigns ELO scores? Models fight head-to-head by writing a question, a bug, or an incomplete implementation, which the opponent has to answer, fix, or finish.
vincnetas 1 hours ago [-]
We could call this "generative adversarial network" (GAN) :)
This kind of approach would generally still need human guidance, otherwise these models might get into weird niche corners of the problem space that would not be relevant to any real world project.
olmo23 26 minutes ago [-]
How do you prevent degenerate strategies? I could trivially give a model a SHA256 hash and ask it to provide the source input.
In class you'd probably want a rule saying at least one LLM should be able to figure out the answer, but in a head-to-head I'm not sure how to solve it.
18 minutes ago [-]
_345 2 hours ago [-]
This makes so much sense as to why I've always felt that Opus 4.8 was leagues ahead of GPT 5.5. It's so good at taking underspecified requirements and filling in the gaps with sensible approaches for your project
nsingh2 1 hours ago [-]
Why supply underspecified requirements in the first place? Both models are good at challenging assumptions/edge cases and asking questions to clarify, but seemingly only when explicitly asked (i.e. something like a "brainstorm" skill).
I don't think either harnesses do enough to encourage the model to challenge all assumptions and ask questions, maybe because users might find it annoying. That step is basically a requirement IMO.
I've found all of the GPT-5 models to be very nit-picky, useful for code review and mathematics (important for my work), but seemingly gets in the way of "aesthetic" code, e.g. overly defensive code to cover all edge cases, even if unlikely.
There is seemingly also a tradeoff between flexibility vs instruction following. In my experience Opus will sometimes ignore instructions but can "fill in the blanks" more, vs GPT-5.5 follows instructions better but perhaps at the cost of rigidity.
fooker 54 minutes ago [-]
> Why supply underspecified requirements in the first place?
Because you'd not want to forever loop outside your home when asked to "while you're out, grab some eggs" :)
antonvs 1 hours ago [-]
> Why supply underspecified requirements in the first place?
Minimizes effort, is the obvious answer.
cyberpunk 17 minutes ago [-]
Poor trade off, the model is then designing a massive chunk of your solution instead of you. With a good spec, bits of typo’d pseudocode, and slightly more effort than a couple of sentences they can actually produce passable software.
I think the reason claude has so much mindshare is exactly because it’s more useful to non-developers who wouldn’t know how to describe what an api call executes to his grandmother.
For those who can, I can’t find much of a difference between them. Codex has the slight edge, but that’s all just “feels” to me.
CSMastermind 32 minutes ago [-]
Man I don't know if I'm living in a crazy bubble or something but GPT 5.5 is lightyears better than Opus 4.8 for me to the point where I'm honestly wondering how you're evaluating them or what kind of work you're doing.
There's specific tasks that Opus does better on like Frontend Dev and Design but for anything else 5.5 just laps it.
dools 22 minutes ago [-]
Yeah I’ve been consistently underwhelmed by anthropic models, but then I don’t use their harness so maybe that’s it
hypfer 15 minutes ago [-]
Similarly, it explains to me why people found Claude so amazing, while I just thought "eh."
Tool expectations
zuzululu 55 minutes ago [-]
same observation here opus 4.8 (and i dont understand the people defending gpt 5.5 constantly) was significantly mature, it would even push back against anything off putting where as GPT 5.5 will happily agree and do what is asked but I would note that it takes several tries.
4.8 also requires more than one prompt but its output is significantly higher quality and offers more insight
Fable 5 is a different beast however.
re-thc 1 hours ago [-]
> It's so good at taking underspecified requirements and filling in the gaps with sensible approaches for your project.
At a high level. It misses low level or other non-functional requirements differently so I wouldn't say Opus is just strictly better.
It's also possible that it's just a harness problem more than model.
e9 1 hours ago [-]
I agree with you on the harness. I find that Claude can be good in any harness but GPT is only superior inside Codex.
facorreia 2 hours ago [-]
It's nice to see a new public benchmark from Snorkel. They're doing some pretty sophisticated stuff over there.
monster_truck 2 hours ago [-]
Once again I am asking: who are these people and what makes them more qualified than any of you to asses anyone or anything "as a senior engineer" (with the subtext being that none of you are, either)
re-thc 1 hours ago [-]
> who are these people and what makes them more qualified than any of you
Anyone can run something and make a web page. These people just do it instead of questioning. Main difference. If everyone asks "how could you" "are you qualified" then we have nothing but gatekeeping.
jonathanleane 4 hours ago [-]
Top solve rate is currently 24% with Opus 4.8... What's a competent human supposed to score?
jascha_eng 17 minutes ago [-]
I mean these were all solved before I assume so 100% not the same human ofc but models are expected to be good at a variety of code bases while human can specialize in one and learn. I think it's fair to compare to an individual that is used to working on a product.
I'm more interested in how fable would do
lacunary 3 hours ago [-]
presumably whatever the top model uses and then some, since the human can use the model.
I wonder if a model could score higher if it had a human at its disposal?
olmo23 22 minutes ago [-]
With a human at its disposal, it could probably count the number of R's in strawberry!
In all seriousness though, adding capabilities should not normally reduce the effectiveness of a model (within reason: don't pollute the context window with millions of useless tools).
pishpash 2 hours ago [-]
Maybe models should ask for human-in-the-loop input, as a matter of convention.
sinuhe69 34 minutes ago [-]
A model that can ask questions or ask for help when in doubt is indeed a major feat. None of the current frontier models can do that.
LiamPowell 3 hours ago [-]
> You are a senior SWE-Bench reviewer, make no mistakes.
I don't know what a better approach would look like while still remaining feasible, however this approach of telling a LLM to make a subjective judgement seems fundamentally flawed.
rhdunn 7 minutes ago [-]
This approach is effectively seeding the context with how you want the LLM to behave/operate ("senior reviewer", i.e. the style of the responses you want) and the context/domain in which the LLM is operating in ("SWE-Bench").
This is common in system prompts and frames the responses.
For example, you'd get different responses saying:
1. you are a pirate writing sea shanties about programming;
2. you are a news reporter writing an article on physics;
3. you are a senior software engineer with complete knowledge of PostgreSQL.
For 1 you could get responses along the lines of the Wellerman sea shanty -- "There once was a program that was set to C ...".
The "make no mistakes" bit does look dubious. It would be interesting comparing the results with and without that bit and trying alternative ways of getting the same desired behavior.
FeepingCreature 56 minutes ago [-]
More importantly, I suspect this actually hinders the work. If the LLM does make a mistake, it's now incentivized to downplay it instead of acknowledging and correcting.
antonvs 56 minutes ago [-]
The “make no mistakes” admonition does seem pretty silly (it’s been skewered to death on yt), but… it’s easy to imagine how it might work. E.g. it could be interpreted as simply as “check your work”.
Of course, no-one seems to be (publicly) doing the comparative measurements that might allow us to reach rational conclusions here.
rhdunn 43 seconds ago [-]
I'm not sure if they've fixed this, but older models have a tendency to ignore negation as `no`, `not`, etc. all occur frequently in the training data so are weighted less strongly than the verbs and nouns.
The advice I've heard is to emphasize the traits you want, not discourage the traits you don't. So rather than saying "make no mistakes" you can do something like you suggested with writing it as "check your work" or "ensure you answer correctly and concisely".
guilhermecgs 2 hours ago [-]
fable 5?
2 hours ago [-]
Madmallard 3 hours ago [-]
next round of trust me bro benchmarks
dozerly 2 hours ago [-]
Just wait for the next 100 rounds. People love seeing the 65% -> 85% seemingly over and over again for every new model.
funnywish 1 hours ago [-]
[flagged]
danpalmer 3 hours ago [-]
Why didn't they just make it "Staff SWE-Bench", would be much better smh. /s
But seriously, as an industry we're terrible at assessing engineering levels, I've worked with "senior engineers" who can't code and I've worked with "junior engineers" who could run rings around them.
Benchmarks like this should be much more precise about what they're actually testing, and what axes they're hard on. We also need to rise above prompts like "you are a senior engineer", it's woo, and it's far better to ask for precise outcomes.
glaslong 2 hours ago [-]
Principal-SWE-Bench will take some time to run, because the LLM needs to wait for a crisis to present its solution, having correctly identified that the same solution would have been organizationally impossible to propose until that moment.
amrrs 3 hours ago [-]
As someone who's trying to get better assessments, I'm struggling to come up with objective coding tasks that evaluates all aspects of real life like planning, design choices, problem solving and context usage. From your experience with humans, Do you have any recommendations on what could be effective in measuring it?
allan_s 3 hours ago [-]
I think the source of your issue is in your statement itself, why do you want a task that evaluate things as broad to be only a coding task ? Shouldn't it be a planning task, documentation task, knowledge retrieval task etc. And very certainly not with just an initial prompt but an existing codebase + existing doc + tickets ?
jocelyner 3 hours ago [-]
[flagged]
purple-leafy 3 hours ago [-]
Benchmarks are great, but I feel like there’s a better way this seems quite subjective.
What you really need is an objective benchmark
eli 3 hours ago [-]
I actually really like subjective benchmarks, so long as it's a human (ideally me) grading the results. LLM as judge never made much sense.
charcircuit 2 hours ago [-]
The issue is that you can't do unsupervised learning if you require humans.
echelon 3 hours ago [-]
> What you really need is an objective benchmark
"When are all the software engineers unemployed?"
purple-leafy 3 hours ago [-]
Not sure I follow haha
0xbadcafebee 2 hours ago [-]
The "tasteful solves" is codified cargo culting. The software industry has a tendency to anthropomorphize software while playing to the ego of the programmer. The programmer imagines they are creating a "beautiful" artistic expression. Good code becomes "tasteful", as a software artist must have "good taste" to tell the good software from the bad software. Good quality lacks "bad smells", because a good artist has fine senses (and everybody must like the same smells). "Fine craftsmanship", in code as in woodworking, means your finely-crafted work is "technically superior", so you can charge more money for something that could've been made cheaper and faster and done the same thing.
But it's a lie. Nobody's paying you to make paintings. They're paying you to build machines. The comparison between "making working software" with "taste" always devolves into bikeshedding and subjective opinionism, uses subjective human feelings to describe what should be objective and functional, isn't rooted in scientific rigor, and detracts from the real purpose of the thing. The work doesn't actually get better by trying to apply artistic principles to engineering. It just feels better for the people making it.
Once you make the machine work, then you can go about gilding the lily. But this is unromantic, unsatisfying, boring. Since the inmates run this particular asylum, we end up with a benchmark that tries to accurately mimic the human ego as applied to software design. Thus the new Gods create their digital Adams and Eves in their image.
FeepingCreature 58 minutes ago [-]
Taste is just quality by instinct. At sufficient (and not all that long) timescales, a tasteless product will be more and more difficult to make work at all.
phreeza 29 minutes ago [-]
I think this is a complete misunderstanding of what people mean by taste in software engineering. Taste is more like the System 1 response one builds to code over time, which (ideally) captures the quality of the software beyond surface level, so things like maintainability, composability, readability, likelihood of hidden bugs. This is completely different from the question if the code fulfills the immediate task at hand, but also not the same as pure aesthetics.
9dev 1 hours ago [-]
I may be paid to build a machine, but I am a human and take pleasure in arbitrary acts of vanity. I value elegance, and will always favour elegant solutions in engineering and the design of machines, virtual or physical.
That’s the reason why I buy Apple products in private, because I value the design over the exorbitant prices they charge; and it’s the reason why I mull over code that’s already functional until it’s pleasing my ideas of elegance.
I can come up with all kinds of justifications and explanations why the code I’ve written a certain way is objectively better too - understandability matters to the next guy after all - but I won’t be ashamed for taking a certain pride in my work, even if nobody other than me ever values it. That’s fine.
When the LLMs finally take over coding altogether, you’ll have your raw, functional code. Won’t be long anymore. But for now, I’m a human, and I will do human things.
Dban1 1 hours ago [-]
As time passes we will have fewer and fewer literati
Eridrus 1 hours ago [-]
Most engineers are wrong (I obviously am the true arbiter of taste), but that doesn't mean there isn't better and worse code.
"Does it work" glosses over a bunch of things: is it fast, cheap, secure, reliable, easy to understand, easy to modify? And that's just for server software where you've nailed down all the functional requirements. Determining what the functional requirements is it's own question.
And all these other non-happy path requirements are somewhat in tension with each other, so what is ideal in one environment is not necessarily ideal in another.
And in particular, "easy to understand/modify" is truly subjective. Different people have different ideas of what easy to understand means. Even if we get to a world where AI is writing all our code, "easy to understand/modify for the AI" is still an important question. We've probably all seen prototypes that collapse under their own weight of slop by now.
sally_glance 46 minutes ago [-]
Well actually there is a reasonably objective standard defining software quality criteria on the source code level (ISO 5055). They also define 29 criteria for maintainability: https://www.it-cisq.org/coding-rules/
What if we create a benchmark that works like this and assigns ELO scores? Models fight head-to-head by writing a question, a bug, or an incomplete implementation, which the opponent has to answer, fix, or finish.
https://en.wikipedia.org/wiki/Generative_adversarial_network
In class you'd probably want a rule saying at least one LLM should be able to figure out the answer, but in a head-to-head I'm not sure how to solve it.
I don't think either harnesses do enough to encourage the model to challenge all assumptions and ask questions, maybe because users might find it annoying. That step is basically a requirement IMO.
I've found all of the GPT-5 models to be very nit-picky, useful for code review and mathematics (important for my work), but seemingly gets in the way of "aesthetic" code, e.g. overly defensive code to cover all edge cases, even if unlikely.
There is seemingly also a tradeoff between flexibility vs instruction following. In my experience Opus will sometimes ignore instructions but can "fill in the blanks" more, vs GPT-5.5 follows instructions better but perhaps at the cost of rigidity.
Because you'd not want to forever loop outside your home when asked to "while you're out, grab some eggs" :)
Minimizes effort, is the obvious answer.
I think the reason claude has so much mindshare is exactly because it’s more useful to non-developers who wouldn’t know how to describe what an api call executes to his grandmother.
For those who can, I can’t find much of a difference between them. Codex has the slight edge, but that’s all just “feels” to me.
There's specific tasks that Opus does better on like Frontend Dev and Design but for anything else 5.5 just laps it.
Tool expectations
4.8 also requires more than one prompt but its output is significantly higher quality and offers more insight
Fable 5 is a different beast however.
At a high level. It misses low level or other non-functional requirements differently so I wouldn't say Opus is just strictly better.
It's also possible that it's just a harness problem more than model.
Anyone can run something and make a web page. These people just do it instead of questioning. Main difference. If everyone asks "how could you" "are you qualified" then we have nothing but gatekeeping.
I'm more interested in how fable would do
I wonder if a model could score higher if it had a human at its disposal?
In all seriousness though, adding capabilities should not normally reduce the effectiveness of a model (within reason: don't pollute the context window with millions of useless tools).
I don't know what a better approach would look like while still remaining feasible, however this approach of telling a LLM to make a subjective judgement seems fundamentally flawed.
This is common in system prompts and frames the responses.
For example, you'd get different responses saying:
1. you are a pirate writing sea shanties about programming;
2. you are a news reporter writing an article on physics;
3. you are a senior software engineer with complete knowledge of PostgreSQL.
For 1 you could get responses along the lines of the Wellerman sea shanty -- "There once was a program that was set to C ...".
The "make no mistakes" bit does look dubious. It would be interesting comparing the results with and without that bit and trying alternative ways of getting the same desired behavior.
Of course, no-one seems to be (publicly) doing the comparative measurements that might allow us to reach rational conclusions here.
The advice I've heard is to emphasize the traits you want, not discourage the traits you don't. So rather than saying "make no mistakes" you can do something like you suggested with writing it as "check your work" or "ensure you answer correctly and concisely".
But seriously, as an industry we're terrible at assessing engineering levels, I've worked with "senior engineers" who can't code and I've worked with "junior engineers" who could run rings around them.
Benchmarks like this should be much more precise about what they're actually testing, and what axes they're hard on. We also need to rise above prompts like "you are a senior engineer", it's woo, and it's far better to ask for precise outcomes.
What you really need is an objective benchmark
"When are all the software engineers unemployed?"
But it's a lie. Nobody's paying you to make paintings. They're paying you to build machines. The comparison between "making working software" with "taste" always devolves into bikeshedding and subjective opinionism, uses subjective human feelings to describe what should be objective and functional, isn't rooted in scientific rigor, and detracts from the real purpose of the thing. The work doesn't actually get better by trying to apply artistic principles to engineering. It just feels better for the people making it.
Once you make the machine work, then you can go about gilding the lily. But this is unromantic, unsatisfying, boring. Since the inmates run this particular asylum, we end up with a benchmark that tries to accurately mimic the human ego as applied to software design. Thus the new Gods create their digital Adams and Eves in their image.
That’s the reason why I buy Apple products in private, because I value the design over the exorbitant prices they charge; and it’s the reason why I mull over code that’s already functional until it’s pleasing my ideas of elegance.
I can come up with all kinds of justifications and explanations why the code I’ve written a certain way is objectively better too - understandability matters to the next guy after all - but I won’t be ashamed for taking a certain pride in my work, even if nobody other than me ever values it. That’s fine.
When the LLMs finally take over coding altogether, you’ll have your raw, functional code. Won’t be long anymore. But for now, I’m a human, and I will do human things.
"Does it work" glosses over a bunch of things: is it fast, cheap, secure, reliable, easy to understand, easy to modify? And that's just for server software where you've nailed down all the functional requirements. Determining what the functional requirements is it's own question.
And all these other non-happy path requirements are somewhat in tension with each other, so what is ideal in one environment is not necessarily ideal in another.
And in particular, "easy to understand/modify" is truly subjective. Different people have different ideas of what easy to understand means. Even if we get to a world where AI is writing all our code, "easy to understand/modify for the AI" is still an important question. We've probably all seen prototypes that collapse under their own weight of slop by now.