Corpus Statistics Dashboard

Biomedical named entity annotation corpora — comparative analysis

Corpora analyzed

9

With concept identifiers

6 / 9

Ann/doc range

7.3 – 1027.0

Ambiguity range

1.00 – 1.35

AnatEMBC5CDRBioIDCHEMDNERCRAFTCellLinkJNLPBANCBI-DiseaseNLM-Chem
Entity scope

Every annotation label reported by the corpus.

Annotations per thousand tokens

Annotation density per thousand tokens varies widely across corpora.

Annotations per document

Annotation density per document varies widely across corpora.

Log scale. NLM-Chem annotates full-text articles; BioID uses figure captions.

AnatEM noneBC5CDR MESHBioID BAO, CHEBI, CL, CVCL, Corum, GO, NCBI gene, NCBI taxon, PubChem, Rfam, Uberon, Uniprot, cell, gene, molecule, organism, protein, subcellular, tissueCHEMDNER noneCRAFT CHEBI, CL, GO_BP, GO_CC, GO_MF, MONDO, MOP, NCBITaxon, PR, SO, UBERONCellLink CL (partial)JNLPBA noneNCBI-Disease MESH, OMIMNLM-Chem MESH
Three corpora have no concept identifiers.

Faded bars — zero or negligible identifier coverage. These corpora can only benchmark span detection, not entity normalization.

Ambiguity — identifiers per mention

Ambiguity is low and uniform across all corpora.

Variation — surface forms per concept

CellLink highest; BC5CDR and NLM-Chem lowest.

Ambiguity near 1.0 indicates low polysemy. Variation shown only for corpora with concept-level identifiers.

Distinct entity type labels

AnatEM has 12 types; four corpora annotate a single entity type.

Label entropy (bits)

Single-entity corpora have 0 bits; AnatEM highest at 2.84 bits.

Entropy = 0 for single-entity corpora. Higher entropy indicates more balanced coverage across entity types.

Corpus DocsTokensTypes Total ann.Ann/doc Men/docIDs/doc ID vocabulary Ambiguitya Variationb Entropyc
AnatEM1,212259,5101213,70111.37.2none1.028346.922.84
BC5CDR1,500297,019229,27119.59.46.88MESH1.0182.470.99
BioID13,697771,2488102,7427.54.95.16BAO, CHEBI, CL, CVCL, Corum, GO, NCBI gene, NCBI taxon, PubChem, Rfam, Uberon, Uniprot, cell, gene, molecule, organism, protein, subcellular, tissue1.3541.482.06
CHEMDNER10,0002,092,491184,3318.44.6none1.00019803.00-0.00
CRAFT97652,1681199,6231027.0241.9149.35CHEBI, CL, GO_BP, GO_CC, GO_MF, MONDO, MOP, NCBITaxon, PR, SO, UBERON1.0212.332.99
CellLink2,003227,490314,7317.36.04.09CL (partial)1.0653.740.83
JNLPBA2,404564,660559,96324.916.5none1.0204450.201.67
NCBI-Disease793169,56116,8928.75.13.17MESH, OMIM1.0142.75-0.00
NLM-Chem150789,532138,339255.654.334.07MESH1.0242.42-0.00
a Mean concept identifiers per unique mention string.   b Mean surface forms per concept identifier; only for corpora with IDs.   c Shannon entropy of label distribution in bits; 0 = single entity type.
Token vocabularyMention tokensMention stringsIdentifiers
CorpusSplittrain → test tokensToken vocabJaccardMention tokensJaccardMention stringsJaccardIdentifiersJaccardID vocab
BC5CDR8,721 → 8,850
41.9%
31.5%
19.9%
35.9%
MESH
CHEMDNER34,451 → 31,892
38.4%
26.1%
15.0%
0.0%
none
AnatEM11,900 → 9,527
36.0%
32.5%
15.2%
0.0%
none
JNLPBA13,814 → 7,041
35.9%
27.5%
6.6%
0.0%
none
CRAFT17,536 → 11,740
34.2%
29.3%
15.5%
24.6%
CHEBI, CL, GO_BP, GO_CC, GO_MF, MONDO, MOP, NCBITaxon, PR, SO, UBERON
CellLink14,685 → 7,907
33.7%
27.9%
9.1%
40.6%
CL (partial)
NLM-Chem19,315 → 14,467
33.2%
20.5%
11.0%
23.8%
MESH
NCBI-Disease8,363 → 3,308
28.2%
21.8%
9.0%
19.6%
MESH, OMIM
BioID20,940 → 9,112
25.2%
19.9%
13.3%
14.2%
BAO, CHEBI, CL, CVCL, Corum, GO, NCBI gene, NCBI taxon, PubChem, Rfam, Uberon, Uniprot, cell, gene, molecule, organism, protein, subcellular, tissue
All values are Jaccard similarity (intersection / union) between splits.

Overlap cascade

Overlap cascade from token vocabulary to identifier level.

Each line traces one corpus across four abstraction levels. Lines that terminate before the identifier level indicate corpora without concept normalization.

Unique journal count

Journal diversity from 25 to over 300 unique journals.

Journal concentration

Top-1 journal Top-3 journals
CRAFT most concentrated; BC5CDR most distributed.

9 of 9 corpora have journal metadata. Unique journal count measures language diversity. Concentration reveals whether the corpus is dominated by a small number of sources. Faded bars indicate corpora with no metadata.

Publication year range

Year ranges span from 1968 to 2025.

Decade share per corpus

Decade distribution across corpora.

Year-by-year: oldest vs most recent

Article distribution per year.

Hover range bars for the mode year. Corpora anchored in pre-2000 literature risk reduced performance on contemporary terminology.

Article topic distribution per corpus (%)

Topic
AnatEM
BC5CDR
BioID
CHEMDNER
CRAFT
CellLink
JNLPBA
NCBI-Disease
NLM-Chem
Multidisciplinary
2%
2%
Cell & developmental biology
7%
1%
10%
5%
9%
17%
12%
4%
6%
Molecular biology / biochemistry
16%
6%
62%
15%
21%
12%
36%
17%
18%
Genetics/genomics
5%
8%
3%
18%
8%
10%
28%
4%
Neuroscience & neurology
2%
4%
1%
1%
2%
4%
2%
Microbiology/pathogenesis
2%
2%
1%
2%
2%
1%
Pharmacology
2%
6%
6%
1%
2%
Toxicology
1%
Oncology
4%
2%
1%
1%
Public health / health services
3%
2%
2%
1%
1%
2%
Chemistry / Materials Science
8%
15%
29%
4%
4%
11%
5%
29%
Immunology
1%
2%
4%
6%
1%
Psychiatry & psychology
1%
3%
1%
1%
Health disciplines
3%
2%
2%
1%
1%
2%
General biology / anatomy / physiology
14%
14%
6%
18%
23%
22%
11%
11%
14%
General natural sciences
2%
3%
1%
2%
3%
3%
4%
General / internal medicine
6%
6%
7%
2%
1%
3%
1%
2%
Nutrition, metabolism, and food science
1%
1%
Surgery / anesthesia / perioperative
2%
2%
Diagnostics / pathology / radiology
2%
4%
2%
1%
4%
1%
2%
2%
Pediatrics / reproductive / developmental medicine
2%
Clinical specialties by organ system
8%
11%
3%
7%
9%
2%
5%
4%
Demographic characteristics
7%
18%
4%
4%
5%
3%
16%
6%
Total shown
100%
100%
100%
100%
100%
100%
100%
100%
100%
Topics are high-level MeSH-derived article categories resolved from article metadata MeSH terms, with unresolved article-term fractions filled from journal MeSH topics and configured journal-name fallback topics. Only topics with ≥ 1% share in at least one corpus are shown. Dominant value per row is bold. Percentages may not sum to exactly 100 due to rounding.

Journal topic distribution per corpus (%)

Topic
AnatEM
BC5CDR
BioID
CHEMDNER
CRAFT
CellLink
JNLPBA
NCBI-Disease
NLM-Chem
Multidisciplinary
4%
2%
5%
6%
12%
8%
Cell & developmental biology
5%
10%
1%
13%
10%
6%
2%
2%
Molecular biology / biochemistry
7%
2%
62%
12%
8%
6%
16%
3%
20%
Genetics/genomics
4%
8%
2%
30%
5%
4%
56%
3%
Neuroscience & neurology
2%
10%
1%
7%
4%
3%
3%
Microbiology/pathogenesis
3%
2%
2%
6%
1%
Pharmacology
3%
9%
24%
2%
Toxicology
3%
11%
Oncology
13%
4%
3%
2%
4%
2%
3%
Public health / health services
3%
1%
Chemistry / Materials Science
4%
20%
8%
4%
26%
Immunology
1%
2%
12%
22%
1%
Psychiatry & psychology
2%
7%
1%
3%
Health disciplines
6%
6%
2%
2%
3%
1%
3%
General biology / anatomy / physiology
5%
2%
6%
7%
14%
18%
4%
3%
5%
General natural sciences
4%
2%
6%
12%
11%
4%
3%
8%
General / internal medicine
14%
21%
7%
3%
5%
11%
11%
8%
11%
Nutrition, metabolism, and food science
5%
Surgery / anesthesia / perioperative
3%
9%
Diagnostics / pathology / radiology
4%
3%
Pediatrics / reproductive / developmental medicine
1%
3%
2%
Clinical specialties by organ system
10%
16%
4%
6%
4%
5%
2%
1%
Total shown
100%
100%
100%
100%
100%
100%
100%
100%
100%
Topics are high-level MeSH-derived journal categories resolved from the journal record's NLM Catalog MeSH topics, with configured journal-name fallback topics for journals that do not have MeSH topics. Only topics with ≥ 1% share in at least one corpus are shown. Dominant value per row is bold. Percentages may not sum to exactly 100 due to rounding.

Deprecated terms summary

Corpus Terminology Total concepts Deprecated concepts Resolvable identifier rate

Resolvable identifier rate

Coverage rates for all corpora.
Coverage counts only identifiers whose resource is associated with the selected terminology.

Annotation depth distribution

Terminology coverage

Terminology coverage = unique corpus concept count in branch ÷ total terminology concepts in that branch. Only branches with signal in the selected scope are shown.

Annotation topic coverage

Annotation topic coverage = annotation-weighted branch count ÷ all identifiers for that corpus and entity scope, including deprecated identifiers in the denominator.