Independent resource on AI content authenticity — detection, standards & policy
Community Forum →
Benchmark · March 2026

AI Detection Tools
Compared

2,400 samples (1,200 human, 1,200 AI-generated) across academic essays, news, marketing copy, technical documentation, and creative writing.

Text Detection

Sorted by overall accuracy across all 5 content categories.

Originality.aioriginality.ai
Accuracy
91%
False Pos.
7%
Latency
420ms
Type
text
GPTZerogptzero.me
Accuracy
87%
False Pos.
10%
Latency
380ms
Type
text
Writer.comwriter.com
Accuracy
84%
False Pos.
8%
Latency
290ms
Type
text
Copyleakscopyleaks.com
Accuracy
79%
False Pos.
12%
Latency
510ms
Type
text
Sapling AIsapling.ai
Accuracy
76%
False Pos.
17%
Latency
610ms
Type
text
Hive Moderationthehive.ai
Accuracy
88%
False Pos.
9%
Latency
340ms
Type
voice
ToolAccuracyFalse PositiveFalse NegativeLatencyType
Originality.aioriginality.ai 91% 7% 420ms text
GPTZerogptzero.me 87% 10% 380ms text
Writer.comwriter.com 84% 8% 290ms text
Copyleakscopyleaks.com 79% 12% 510ms text
Sapling AIsapling.ai 76% 17% 610ms text
Hive Moderationthehive.ai 88% 9% 340ms voice

How We Test

Every tool is tested against the same corpus of 1,200 human-written samples (sourced from published journalism, academic papers, and creative writing) and 1,200 AI-generated samples (Claude, GPT-4o, Gemini 1.5 Pro, and Llama 3) across five categories.

Accuracy

The percentage of samples correctly classified (human as human, AI as AI). The headline number, but not the only one that matters.

False Positive Rate (FPR)

How often human-written text is incorrectly flagged as AI. This is the most consequential metric for academic deployment — a 17% FPR means nearly 1 in 5 human students could be falsely accused.

False Negative Rate (FNR)

How often AI-generated text passes as human. Critical for journalism integrity and content moderation use cases.

Important Caveats

  • All tools show degraded performance on post-humanized text (text processed through an AI humanizer)
  • STEM writing has naturally low perplexity — accuracy on technical content is materially lower for all text detectors
  • Results vary significantly by model — older GPT-3.5 output is easier to detect than Claude 3.5 or GPT-4o output
  • Benchmarks are updated quarterly. Join the forum to discuss methodology

Discuss detection methodology

The forum has active threads on bypass techniques, tool-specific quirks, academic deployment policy, and new research.

Open Forum →