Corpus Statistics Dashboard

Corpora analyzed

With concept identifiers

6 / 9

Ann/doc range

7.3 – 1027.0

Ambiguity range

1.00 – 1.35

Annotations per thousand tokens

Annotations per document

Log scale. NLM-Chem annotates full-text articles; BioID uses figure captions.

AnatEM noneBC5CDR MESHBioID BAO, CHEBI, CL, CVCL, Corum, GO, NCBI gene, NCBI taxon, PubChem, Rfam, Uberon, Uniprot, cell, gene, molecule, organism, protein, subcellular, tissueCHEMDNER noneCRAFT CHEBI, CL, GO_BP, GO_CC, GO_MF, MONDO, MOP, NCBITaxon, PR, SO, UBERONCellLink CL (partial)JNLPBA noneNCBI-Disease MESH, OMIMNLM-Chem MESH

Faded bars — zero or negligible identifier coverage. These corpora can only benchmark span detection, not entity normalization.

Ambiguity — identifiers per mention

Variation — surface forms per concept

Ambiguity near 1.0 indicates low polysemy. Variation shown only for corpora with concept-level identifiers.

Distinct entity type labels

Label entropy (bits)

Entropy = 0 for single-entity corpora. Higher entropy indicates more balanced coverage across entity types.

Corpus	Docs	Tokens	Types	Total ann.	Ann/doc	Men/doc	IDs/doc	ID vocabulary	Ambiguity^a	Variation^b	Entropy^c
AnatEM	1,212	259,510	12	13,701	11.3	7.2	—	none	1.028	346.92	2.84
BC5CDR	1,500	297,019	2	29,271	19.5	9.4	6.88	MESH	1.018	2.47	0.99
BioID	13,697	771,248	8	102,742	7.5	4.9	5.16	BAO, CHEBI, CL, CVCL, Corum, GO, NCBI gene, NCBI taxon, PubChem, Rfam, Uberon, Uniprot, cell, gene, molecule, organism, protein, subcellular, tissue	1.354	1.48	2.06
CHEMDNER	10,000	2,092,491	1	84,331	8.4	4.6	—	none	1.000	19803.00	-0.00
CRAFT	97	652,168	11	99,623	1027.0	241.9	149.35	CHEBI, CL, GO_BP, GO_CC, GO_MF, MONDO, MOP, NCBITaxon, PR, SO, UBERON	1.021	2.33	2.99
CellLink	2,003	227,490	3	14,731	7.3	6.0	4.09	CL (partial)	1.065	3.74	0.83
JNLPBA	2,404	564,660	5	59,963	24.9	16.5	—	none	1.020	4450.20	1.67
NCBI-Disease	793	169,561	1	6,892	8.7	5.1	3.17	MESH, OMIM	1.014	2.75	-0.00
NLM-Chem	150	789,532	1	38,339	255.6	54.3	34.07	MESH	1.024	2.42	-0.00

^a Mean concept identifiers per unique mention string. ^b Mean surface forms per concept identifier; only for corpora with IDs. ^c Shannon entropy of label distribution in bits; 0 = single entity type.

Token vocabularyMention tokensMention stringsIdentifiers

Corpus	Splittrain → test tokens	Token vocabJaccard	Mention tokensJaccard	Mention stringsJaccard	IdentifiersJaccard	ID vocab
BC5CDR	8,721 → 8,850	41.9%	31.5%	19.9%	35.9%	MESH
CHEMDNER	34,451 → 31,892	38.4%	26.1%	15.0%	0.0%	none
AnatEM	11,900 → 9,527	36.0%	32.5%	15.2%	0.0%	none
JNLPBA	13,814 → 7,041	35.9%	27.5%	6.6%	0.0%	none
CRAFT	17,536 → 11,740	34.2%	29.3%	15.5%	24.6%	CHEBI, CL, GO_BP, GO_CC, GO_MF, MONDO, MOP, NCBITaxon, PR, SO, UBERON
CellLink	14,685 → 7,907	33.7%	27.9%	9.1%	40.6%	CL (partial)
NLM-Chem	19,315 → 14,467	33.2%	20.5%	11.0%	23.8%	MESH
NCBI-Disease	8,363 → 3,308	28.2%	21.8%	9.0%	19.6%	MESH, OMIM
BioID	20,940 → 9,112	25.2%	19.9%	13.3%	14.2%	BAO, CHEBI, CL, CVCL, Corum, GO, NCBI gene, NCBI taxon, PubChem, Rfam, Uberon, Uniprot, cell, gene, molecule, organism, protein, subcellular, tissue

All values are Jaccard similarity (intersection / union) between splits.

Overlap cascade

Each line traces one corpus across four abstraction levels. Lines that terminate before the identifier level indicate corpora without concept normalization.

Unique journal count

Journal concentration

Top-1 journal Top-3 journals

9 of 9 corpora have journal metadata. Unique journal count measures language diversity. Concentration reveals whether the corpus is dominated by a small number of sources. Faded bars indicate corpora with no metadata.

Publication year range

Decade share per corpus

Year-by-year: oldest vs most recent

Hover range bars for the mode year. Corpora anchored in pre-2000 literature risk reduced performance on contemporary terminology.

Article topic distribution per corpus (%)

Topic

AnatEM

BC5CDR

BioID

CHEMDNER

CRAFT

CellLink

JNLPBA

NCBI-Disease

NLM-Chem

Multidisciplinary

—

Cell & developmental biology

10%

17%

12%

Molecular biology / biochemistry

16%

62%

15%

21%

12%

36%

17%

18%

Genetics/genomics

—

18%

10%

28%

Neuroscience & neurology

—

Microbiology/pathogenesis

—

Pharmacology

—

Toxicology

—

Oncology

—

Public health / health services

—

Chemistry / Materials Science

15%

—

29%

11%

29%

Immunology

—

Psychiatry & psychology

—

Health disciplines

—

General biology / anatomy / physiology

14%

18%

23%

22%

11%

14%

General natural sciences

—

General / internal medicine

—

Nutrition, metabolism, and food science

—

Surgery / anesthesia / perioperative

—

Diagnostics / pathology / radiology

—

Pediatrics / reproductive / developmental medicine

—

Clinical specialties by organ system

11%

—

Demographic characteristics

18%

—

16%

Total shown

100%

Topics are high-level MeSH-derived article categories resolved from article metadata MeSH terms, with unresolved article-term fractions filled from journal MeSH topics and configured journal-name fallback topics. Only topics with ≥ 1% share in at least one corpus are shown. Dominant value per row is bold. Percentages may not sum to exactly 100 due to rounding.

Journal topic distribution per corpus (%)