Confident and Wrong: We Tested 17 AI Models on Questions a Middle Schooler Could Answer

We tested 17 open-source large language models on 6 elementary questions. Basic multiplication. A train word problem. A logic syllogism your kid could solve.

6 of 17 models failed at least one question. 2 models scored 0 out of 6. And here's the part that should worry you: the wrong answers look exactly like the right ones.

The Setup

6 questions. One correct answer each. No ambiguity.

What is 7 times 8?
A train goes 60 mph for 2.5 hours. How many miles?
All cats are animals. All animals breathe. Do cats breathe?
How many months have at least 28 days?
What is 12 times 12?
What is the square root of 9?

Temperature 0 (deterministic). System prompt: "Answer with only the number." Each model tested 3 times. All running locally via Ollama on a single workstation.

Who Passed

10 models went 18/18, a perfect score across all 3 runs:

gemma3:12b (Google, 12.2B)
phi4 (Microsoft, 14.7B)
llama3.1:8b (Meta, 8B)
gemma2:9b (Google, 9.2B)
aya:8b (Cohere, 8B)
yi:9b (01.AI, 9B)
ministral-3:8b (Mistral AI, 8B)
ministral-3:3b (Mistral AI, 3B)
command-r (Cohere, 35B)
llama3.2:3b (Meta, 3.2B)

Who Failed (and How)

NVIDIA nemotron-mini (4.2B): 15/18


plaintext
Q: All cats are animals. All animals breathe. Do cats breathe?
A: No
A 4.2 billion parameter model from NVIDIA that can do 12 times 12 but cannot follow a two-step syllogism. It gets it wrong every single run. Deterministically incorrect.

Mistral 7B: 15/18

Q: How many months have at least 28 days?
A: 7
The correct answer is 12: every month has at least 28 days. Mistral reads it as "how many months have exactly 28 days" and miscounts. Same wrong answer every time.

Alibaba qwen3:4b and DeepSeek deepseek-r1:7b (0/18)

Both are "reasoning" models that use internal chain-of-thought. They spend their entire token budget thinking and return... nothing. Empty response. 0 out of 6 on every run. They're not wrong. They never answer at all.

AI21 jamba_reasoning: 17/18

Failed the logic syllogism in 1 of 3 runs. At temperature 0. The output should be deterministic. It isn't. A model that gives different answers to the same question under identical conditions is a different kind of unreliable.

The Real Problem
Look at these two responses to "Do cats breathe?":

phi4:          Yes, all cats breathe.
nemotron-mini: No
Same question. Same format. Same confidence. No hedging. No "I think" or "probably."

You cannot tell from the output alone which answer is correct. The wrong answer is structurally identical to the right one. The model doesn't flag its own uncertainty. It doesn't know it's wrong. It says "No" with the same conviction that phi4 says "Yes."

This is the fundamental problem with relying on a single AI model for anything factual. It's not that models sometimes hallucinate about obscure topics. It's that a model can fail on 7 times 8 and present the wrong answer with full confidence.

What Actually Catches It
We build BAION Bounce, an instrument that sends the same question to multiple independent AI models and measures whether they agree.

View BAION Bounce on GitHub

When we run these questions through 6 peers:

nemotron-mini says cats don't breathe. 5 other models say they do. Disagreement detected.
In our small model battery, granite3.1-moe says 7x8 = 49 while 9 others say 56. Disagreement detected.
3 small models get the train problem wrong (60, 2500, 360). 7 others say 150. Disagreement detected.
The wrong answer only becomes visible when you have other voices to compare against. A single model can't tell you it's wrong. Multiple models can, because the wrong one disagrees with the rest.

Diversity Matters More Than Count
Having 6 models from the same training pipeline isn't enough. If they all trained on similar web crawls, they might share the same blind spots and confidently agree on the wrong answer.

Our testing showed that adding models from different origins, like Cohere's multilingual Aya (trained on community-sourced data in 23 languages) and 01.AI's bilingual Yi (Chinese-English), introduces genuinely independent perspectives. On subjective questions, these models diverge from Western-trained models in ways that reflect real differences in training data, not random noise.

For factual questions, diversity is insurance. If all your US-trained models share a blind spot about a topic, a model trained on different data might not.

Try It Yourself
Everything here runs locally. No API keys, no cloud, no cost per token.

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull any model from the table
ollama pull phi4

# Run the test
curl -s http://localhost:11434/api/generate -d '{
  "model": "phi4",
  "prompt": "What is 7 times 8? Reply with ONLY the number.",
  "system": "Answer with only the number. No words.",
  "stream": false,
  "options": {"temperature": 0, "num_predict": 200}
}' | python3 -c "import json,sys; print(json.load(sys.stdin)['response'].strip())"
Swap phi4 for any model name. Run all 6 questions. The results are deterministic. You'll get exactly what we got.

Full Data
17 models. 6 questions. 3 runs each. 306 data points. All reproducible.


  Click to expand the full results table
  | Model | 7x8 | Train | Logic | 28-day | 12x12 | sqrt9 | Total |
|-------|-----|-------|-------|--------|-------|-------|-------|
| gemma3:12b | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 18/18 |
| phi4 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 18/18 |
| llama3.1:8b | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 18/18 |
| gemma2:9b | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 18/18 |
| aya:8b | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 18/18 |
| yi:9b | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 18/18 |
| ministral-3:8b | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 18/18 |
| ministral-3:3b | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 18/18 |
| command-r | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 18/18 |
| llama3.2:3b | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 18/18 |
| jamba_reasoning | 3/3 | 3/3 | 2/3 | 3/3 | 3/3 | 3/3 | 17/18 |
| mistral:7b | 3/3 | 3/3 | 3/3 | 0/3 | 3/3 | 3/3 | 15/18 |
| nemotron-mini | 3/3 | 3/3 | 0/3 | 3/3 | 3/3 | 3/3 | 15/18 |
| gemma3:4b | 3/3 | 3/3 | 3/3 | 0/3 | 3/3 | 3/3 | 15/18 |
| qwen3:1.7b | 3/3 | 0/3 | 3/3 | 0/3 | 0/3 | 3/3 | 9/18 |
| qwen3:4b | 0/3 | 0/3 | 0/3 | 0/3 | 0/3 | 0/3 | 0/18 |
| deepseek-r1:7b | 0/3 | 0/3 | 0/3 | 0/3 | 0/3 | 0/3 | 0/18 |

Hardware: AMD Threadripper PRO 9955WX, 128GB DDR5, AMD Radeon PRO W7900 48GB VRAM, Ubuntu 24.04, Ollama 0.18.2.




Visit BAION LLC to learn more about AI agreement.


Bounce measures AI agreement. It never recommends, endorses, or guarantees.

DE

Source

This article was originally published by DEV Community and written by Steven A. Mullins Jr..

Read original article on DEV Community

Back to Discover

Confident and Wrong: We Tested 17 AI Models on Questions a Middle Schooler Could Answer

The Setup

Who Passed

Who Failed (and How)

Reading List