Glossary — 25 Terms

AI Content Authenticity
Terminology

The vocabulary of AI content detection, provenance standards, and authenticity policy — from statistical signals to regulatory definitions.

All Terms

Perplexity PP

A measure of how "surprised" a language model is by a text. Formally, the geometric mean of inverse token probabilities. AI-generated text has characteristically low perplexity because models produce statistically likely token sequences.

Full definition →

Burstiness

The variance in per-sentence perplexity across a document. Human writing is bursty — alternating between fluent and labored passages. AI text has unnaturally uniform sentence-level perplexity.

Full definition →

False Positive Rate FPR

The rate at which a detector incorrectly classifies human-written text as AI-generated. The critical metric for academic and publishing deployments where wrongly accusing a human author has serious consequences.

Full definition →

False Negative Rate FNR

The rate at which AI-generated text is incorrectly classified as human-written. Critical for content moderation and journalism integrity use cases.

Type-Token Ratio TTR

The ratio of unique words to total words in a text. AI-generated text at scale tends toward lower TTR due to repetitive vocabulary patterns.

Hapax Legomenon

A word appearing exactly once in a text. Human writing has a higher hapax rate — more contextually specific, rare word choices. AI text under-represents hapax legomena.

Full definition →

C2PA Standard

Coalition for Content Provenance and Authenticity. Defines how cryptographic manifests are embedded in media files to prove origin, edit history, and AI involvement. The leading content provenance standard.

Full definition →

Content Credentials C2PA

The human-readable label for C2PA manifests. A "Content Credentials" badge on an image or document indicates verifiable provenance metadata is attached.

Watermarking

Embedding imperceptible signals in AI output that survive editing and compression, enabling post-hoc detection of AI-generated content. Two classes: statistical (token-level) and perceptual (frequency-domain).

Full definition →

SynthID Google

Google DeepMind's watermarking technology. Uses statistical biasing during text generation to embed imperceptible signals. Deployed in Gemini. Works on text, images, and audio.

Full definition →

Deepfake

Synthetic media in which a person's likeness or voice is generated or substituted using AI. Detection uses spectral analysis, liveness signals, and temporal consistency checks.

Full definition →

Equal Error Rate EER

The point at which a biometric/deepfake detector has equal false acceptance and false rejection rates. A single-number summary of detector performance; lower is better.

GPAI Model EU AI Act

General-Purpose AI Model. Under the EU AI Act, GPAI models (like large language models) have specific transparency and documentation obligations regardless of their downstream use.

Manifest C2PA

A C2PA data structure attached to a media file. Contains signed assertions about the file's origin, creation tools, edits made, and AI involvement. Cryptographically tamper-evident.

Provenance

The documented history of where content came from, who created it, and how it was modified. Content provenance is the broader concept that C2PA and watermarking aim to establish.

N-gram Entropy

The unpredictability of word sequences at the bigram and trigram level. AI-generated text shows reduced n-gram entropy — more predictable phrase combinations than human writing.

Humanizer

A tool that rewrites AI-generated text to reduce its detectability by AI detectors. Typically works by replacing predictable token sequences with higher-perplexity alternatives. Effectiveness varies widely by tool pairing.

Bypass Rate

The percentage of AI-generated text that successfully passes a given detector after humanization. Ranges from 23% to 91% depending on the humanizer-detector pair tested.

Liveness Detection

In voice deepfake detection, liveness signals distinguish real human vocal production (breath noise, micro-variation) from synthetic speech. A key signal for audio authenticity.

NIST AI 100-4 Standard

NIST guidance document defining approaches to watermarking AI-generated content. Referenced in the 2023 US Executive Order on AI Safety as the framework for federal AI watermarking requirements.

Article 50 EU AI Act

The transparency obligation provision of the EU AI Act. Requires disclosure of AI-generated text and deepfakes. Enforceable from August 2026 with fines up to 1.5% of global annual revenue.

Spectral Analysis Audio

Examining the frequency-domain representation of audio to detect anomalies characteristic of synthetic speech. AI-generated voice often shows unnatural spectral flatness or absence of breath noise.

Perceptual Hash

A fingerprint of a media file based on perceptual content rather than exact bytes. Used to identify near-duplicate or manipulated media even after format conversion or light editing.

Training Data Disclosure EU AI Act

Under Article 53 of the EU AI Act, GPAI model providers must publish summaries of the training data used. This is distinct from the Article 50 output-labeling requirement.

Semantic Location

In backlink analysis, the position of a link within a page (e.g., header, body, footer). Editorial body links carry more SEO signal than footer or nav links.

Missing a term?

Suggest additions or corrections in the community forum. This glossary is maintained collaboratively.

Suggest in Forum →

AI Content AuthenticityTerminology

All Terms

Missing a term?

AI Content Authenticity
Terminology