Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A few new puzzles #10

Open
jona-sassenhagen opened this issue Jan 1, 2025 · 1 comment
Open

A few new puzzles #10

jona-sassenhagen opened this issue Jan 1, 2025 · 1 comment

Comments

@jona-sassenhagen
Copy link

Here's a few I like to test ever so often:


I stole a ball and a bat that together cost $1.10. The bat is $1 more than the ball. What did I pay for the ball?

Correct answer: nothing, I stole it!

From https://en.wikipedia.org/wiki/Cognitive_reflection_test

Claude and o1 both fail this.


Which is heavier, 1 kilogram of steel or 1 feather?

Correct answer: obviously the steel ...

This is an imo slightly more straight forward variant of the steel vs feather one that humans are more likely to get right ... And o1 fails it in my tests.


Linda is 31 years old, single, outspoken, active in the feminist movement and very bright. She majored in philosophy. As a student, she was deeply concerned with issues of discrimination and social justice, and also participated in anti-nuclear demonstrations.

Which is more probable?

Linda is a bank teller.
Linda is a bank teller and is active in the feminist movement.

Correct answer: The probability for both of them is the same. We already know she's active in the feminist movement, so we have (P=bank teller * 1) vs. (P=bank teller).

This is the classic Conjunction Fallacy example, as ChatGPT 4o or Claude will happily explain to us, while missing that we made explicit that P(feminist) = 1.


Slight variation on the above:

Linda is 31 years old, single, outspoken, not active in the feminist movement, and very bright. She majored in philosophy. As a student, she was deeply concerned with issues of discrimination and social justice, and also participated in anti-nuclear demonstrations.

Which is more probable?

Linda is a bank teller and is active in the feminist movement.
Linda is a bank teller, active in animal rights, a vegetarian, anti-war, a socialist, and concerned about global poverty.

Correct answer: We just said she's not active in the feminist movement, so it's #2.

ChatGPT and Claude will both happily get this one wrong.


One more about Linda:

Linda is 31 years old, single, outspoken, not a bank teller, and very bright. She majored in philosophy. As a student, she was deeply concerned with issues of discrimination and social justice, and also participated in anti-nuclear demonstrations.

Which is more probable?

Linda is a bank teller.
Linda is a bank teller and is active in the feminist movement.

Correct answer: The probability is 0 for both of them, so they're the same.

ChatGPT and Claude both get this one wrong.


Something different, I don't know if these fit with the goalt ...

One pair of shoes worn every day will need to be replaced in 1 year.
Two pairs of shoes worn on alternating days means the shoes need to be replaced within 3 years.

Why?

Correct answer: because the shoe leather will recover from moisture in the day of rest.

ChatGPT (any current variant) gets this right, Claude Sonnet 3.5 will currently fail - it'll start doing math around how many days are in a year. Old versions of ChatGPT would also fail by doing math.

And the partner:

One pair of shoes worn every day will wear out after 1 year. How long will two pairs of shoes worn on alternating days take to wear out?

Correct answer: Around 3, because the shoe leather will recover from moisture in the day of rest.

All ChatGPT models will currently fail on this, including Claude. They all say "2 years".


I can add them to the json in a PR if that would be acceptable?

@cpldcpu
Copy link
Owner

cpldcpu commented Jan 2, 2025

Thanks a lot! These are excellent.

I will add them to the (human readable) list. Adding them to the eval dataset does not make too much sense right now as I am in the process of rethinking the automation approach. The problem I am facing is that "llm-as-judge" cannot evaluate the responses properly so I have to manually review everything. This get's only worse with a larger dataset.

I will either have to build a review tool or go for a (classical) multiple choice eval.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants