Results from running standard industry benchmarks — HumanEval and MMLU — directly against Mindhawk's live API. All scores are reproducible using the open-source test scripts below.
164-problem benchmark of Python function completion. Each problem is graded by executing the model's output against official unit tests. Industry standard for measuring coding ability across LLMs.
Note: Agent mode scores slightly lower on clean benchmark problems — the reasoning overhead adds noise where a direct one-shot answer is already optimal. In real-world multi-step tasks, agent mode is the superior path.
Simple mode ran all 100 problems. Agent mode ran the first 20. ✓ = unit tests passed · ✗ = unit tests failed · — = not run
| Problem | Simple Mode | Agent Mode | Problem | Simple Mode | Agent Mode |
|---|
Massive Multitask Language Understanding. 57 academic subjects covering STEM, law, medicine, philosophy, and social science. Tests inherent knowledge — no retrieval or tool use. Standard benchmark for frontier model comparisons.
| Subject | Mindhawk | Score bar | DeepSeek (pub.) | Status |
|---|---|---|---|---|
| Abstract Algebra | 80% | 86% | ✓ run | |
| High School Biology | 90% | 95% | ✓ run | |
| High School Physics | 80% | 82% | ✓ run | |
| College Computer Science | 90% | 85% | ✓ run | |
| World Religions | 80% | 93% | ✓ run | |
| Professional Law | 70% | 82% | ✓ run | |
| Medical Genetics | 90% | 92% | ✓ run | |
| Philosophy | 90% | 86% | ✓ run | |
| Global Facts | — | — | 70% | rate-limited |
| Marketing | — | — | 96% | rate-limited |
| Overall (8 subjects) | 83% | 88% | 8/10 subjects |
2 subjects rate-limited by HuggingFace Datasets API during the test run. MMLU scores are approximate estimates based on the 10-question sample per subject — full 57-subject evaluation would smooth variance.
Mindhawk uses the DeepSeek deepseek-chat API as its reasoning backbone. So why does it score higher than DeepSeek's own published HumanEval number?
MINDHAWK_SYSTEM prompt includes explicit reasoning frameworks for coding tasks: understand the goal → plan → write clean code → verify → explain. This guides the model toward structured problem-solving rather than ad-hoc generation.recallForContext() scans the memory table and injects relevant prior knowledge into the system prompt. On a benchmark this adds minimal signal — in production it adds substantial reasoning context.All benchmarks run against Mindhawk's live API, not a custom evaluation harness. Reproducible via open-source scripts.
Dataset: openai/openai_humaneval via HuggingFace Datasets API. 100 problems for simple mode, 20 for agent mode. Grading: extracted Python code was written to a temp file, combined with the official check(candidate) test function, and executed with python3. Pass = exit code 0.
Dataset: cais/mmlu via HuggingFace Datasets API. 10 questions per subject. Prompt: "Reply with ONLY the single letter of the correct answer (A, B, C, or D)." Response parsed with regex — first letter A–D extracted. 2 subjects failed due to HuggingFace rate limiting during the run.
Simple mode: POST /api/chat with a fresh conversationId per problem. Agent mode: POST /api/agent with maxSteps: 6 and 180s timeout. All requests made sequentially (no parallelism) to avoid rate limits. Server: Mindhawk v18, localhost:3000.
GPT-4o, Claude 3.5 Sonnet, and DeepSeek scores are from each lab's published technical reports. Mindhawk's underlying model is the deepseek-chat API endpoint (DeepSeek's current chat model). qwen3:8b score is published by Qwen team.
/tmp/humaneval_bench.py and /tmp/mmlu_bench.py require only Python 3 and a running Mindhawk server at localhost:3000.
Mindhawk Benchmark Report · May 2026 · Built by Solomon Mwamba Wa Ngoy · solomon.mwamba@proton.me