AI Detection Tools
Compared
2,400 samples (1,200 human, 1,200 AI-generated) across academic essays, news, marketing copy, technical documentation, and creative writing.
Text Detection
Sorted by overall accuracy across all 5 content categories.
| Tool | Accuracy | False Positive | False Negative | Latency | Type |
|---|---|---|---|---|---|
| Originality.aioriginality.ai | 7% | 420ms | text | ||
| GPTZerogptzero.me | 10% | 380ms | text | ||
| Writer.comwriter.com | 8% | 290ms | text | ||
| Copyleakscopyleaks.com | 12% | 510ms | text | ||
| Sapling AIsapling.ai | 17% | 610ms | text | ||
| Hive Moderationthehive.ai | 9% | 340ms | voice |
How We Test
Every tool is tested against the same corpus of 1,200 human-written samples (sourced from published journalism, academic papers, and creative writing) and 1,200 AI-generated samples (Claude, GPT-4o, Gemini 1.5 Pro, and Llama 3) across five categories.
Accuracy
The percentage of samples correctly classified (human as human, AI as AI). The headline number, but not the only one that matters.
False Positive Rate (FPR)
How often human-written text is incorrectly flagged as AI. This is the most consequential metric for academic deployment — a 17% FPR means nearly 1 in 5 human students could be falsely accused.
False Negative Rate (FNR)
How often AI-generated text passes as human. Critical for journalism integrity and content moderation use cases.
Important Caveats
- All tools show degraded performance on post-humanized text (text processed through an AI humanizer)
- STEM writing has naturally low perplexity — accuracy on technical content is materially lower for all text detectors
- Results vary significantly by model — older GPT-3.5 output is easier to detect than Claude 3.5 or GPT-4o output
- Benchmarks are updated quarterly. Join the forum to discuss methodology
Discuss detection methodology
The forum has active threads on bypass techniques, tool-specific quirks, academic deployment policy, and new research.