Dot.@feddit.org to

Fuck AI@lemmy.worldEnglish · 17 days ago

GPT-4o and Co. get it wrong more often than right, says OpenAI study.

the-decoder.com

4

20

GPT-4o and Co. get it wrong more often than right, says OpenAI study.

the-decoder.com

Dot.@feddit.org to

Fuck AI@lemmy.worldEnglish · 17 days ago

4

OpenAI releases SimpleQA benchmark to test AI model factual accuracy

the-decoder.com

A new OpenAI study using their in-house SimpleQA benchmark shows that even the most advanced AI language models fail more often than they succeed when answering factual questions.

A new OpenAI study using their SimpleQA benchmark shows that even the most advanced AI language models fail more often than they succeed when answering factual questions, with OpenAI’s best model achieving only a 42.7% success rate.

The SimpleQA test contains 4,326 questions across science, politics, and art, with each question designed to have one clear correct answer. Anthropic’s Claude models performed worse than OpenAI’s, but smaller Claude models more often declined to answer when uncertain (which is good!).

The study also shows that AI models significantly overestimate their capabilities, consistently giving inflated confidence scores. OpenAI has made SimpleQA publicly available to support the development of more reliable language models.

Chat

kboy101222@sh.itjust.works
link
fedilink
English
arrow-up
10·
17 days ago

Anthropic’s Claude models performed worse than OpenAI’s, but smaller Claude models more often declined to answer when uncertain (which is good!).

It’s right there, bud

Fuck AI@lemmy.world

fuck_ai@lemmy.world

You are not logged in. However you can subscribe from another Fediverse account, for example Lemmy or Mastodon. To do this, paste the following into the search field of your instance: !fuck_ai@lemmy.world

“We did it, Patrick! We made a technological breakthrough!”

A place for all those who loathe AI to discuss things, post articles, and ridicule the AI hype. Proud supporter of working people. And proud booer of SXSW 2024.

Visibility: Public

This community can be federated to other instances and be posted/commented in by their users.

486 users / day
656 users / week
2.69K users / month
4.24K users / 6 months
1 local subscriber
1.4K subscribers
164 Posts
1.45K Comments
Modlog