Utilora

DOI, arXiv, PMID, ISBN: A Field Guide to Persistent Citation Identifiers

Every academic identifier has its own format, its own resolver, and its own quirks. This is the practical guide to recognizing them in the wild and pulling them out of messy text.

DOI, arXiv, PMID, ISBN: A Field Guide to Persistent Citation Identifiers

When you're reading a survey paper and you want to follow up on a reference, you don't look at the author and title. You look at the identifier. If there's a DOI you click it; if there's an arXiv ID you paste it into arxiv.org; if there's a PMID you punch it into PubMed. The identifier is the address. The author and title are the label on the door.

This post walks through the four identifiers researchers actually use — DOI, arXiv ID, PMID, and ISBN — plus the unstructured URL fallback. For each: what the identifier means, what format to look for, where to resolve it, and the parsing gotchas that bite when you try to extract them from messy text.

Why Persistent Identifiers Exist

URLs change. Journal websites get redesigned, publishers merge, preprint servers move. A reference like http://arxiv.org/pdf/cs/0601001.pdf worked in 2008 and works today — but only because arXiv maintained the redirect. Most URL-based references rot within a decade.

Persistent identifiers solve this by separating the name of a document from its location. The DOI registry, arXiv, and PubMed all promise that an identifier issued in 1995 will still resolve in 2055. The actual URL behind the identifier can change; the identifier doesn't.

This is the same insight as DNS. The IP address of a server can change without your bookmarks breaking, because the bookmark points at the domain name, not the IP. Persistent identifiers are domain names for documents.

DOI: The Universal Identifier

A DOI (Digital Object Identifier) is the closest thing the scholarly world has to a universal identifier. Issued by the DOI Foundation through registration agencies (mainly Crossref for journal articles, DataCite for datasets), a DOI is a string of the form:

10.<registrant>/<suffix>

The 10. prefix identifies the scheme. The registrant is a numeric code assigned to the publisher or registration agency (e.g., 10.1145 is the ACM, 10.1109 is IEEE, 10.1038 is Nature). The suffix is whatever the registrant chose — sometimes structured, sometimes opaque.

Examples:

  • 10.1145/3580305.3599362 — ACM, KDD 2023 proceedings
  • 10.1038/nature14539 — Nature, the LeCun/Bengio/Hinton deep learning review

To resolve a DOI, prepend https://doi.org/. The DOI resolver redirects to the publisher's landing page for the document.

Parsing gotchas. DOIs are case-insensitive, but the canonical form lowercases everything after 10.. The suffix can contain any printable ASCII character including /, :, ;, and (/). The most common parsing mistake is stopping at the first slash or punctuation — you'll truncate any DOI with a structured suffix.

A robust regex anchors on 10. followed by digits, then captures everything until whitespace or the end of the line. This will sometimes over-capture trailing punctuation (a period or comma at the end of a sentence) but it's safer to over-capture and trim than to under-capture and lose the suffix.

arXiv ID: Preprint Server Identifier

arXiv has used two ID formats over its history.

The old format (pre-April 2007) was <subject-class>/<YYMMnnn>, e.g. math.GT/0601001 for a January 2006 preprint in geometric topology. The subject class is the section of arXiv; the number is sequential within that section.

The modern format (April 2007 onward) is <YYMM>.<NNNNN> with optional v<version>, e.g. 2305.12345v2. Five-digit numbers became standard in 2015; older modern-format IDs are four digits.

Both formats remain valid resolvers. Prepend https://arxiv.org/abs/ to either to land on the preprint page. The version suffix points at a specific version; omitting it returns the latest.

Parsing gotchas. The old subject-class IDs are easy to confuse with paths in URLs. The modern format is easy to confuse with dates if the surrounding text is messy. A regex that requires the structure exactly (four digits, dot, four or five digits, optional v<digits>) handles 99% of cases.

The Utilora Citation Extractor matches the modern format. The old format is rarer in current writing but worth a manual pass if you're working with literature pre-2010.

PMID: PubMed Identifier

PMIDs are integers assigned by PubMed to each indexed biomedical publication. They look like 12345678 — typically 7 or 8 digits, no structure beyond being sequential.

The catch with PMIDs is that they're indistinguishable from random integers. A regex that catches "any 8-digit number" will hit hundreds of false positives in a typical biomedical paper — page numbers, sample sizes, gene IDs, dates. The convention is therefore to only match PMIDs when they're explicitly labeled:

PMID: 12345678
PMID:12345678
PubMed ID 12345678

Without the label, you're guessing. The Citation Extractor requires the PMID: prefix for this reason.

To resolve a PMID, use https://pubmed.ncbi.nlm.nih.gov/<pmid>/. The PubMed landing page links to the publisher's full text and includes a DOI when one exists.

ISBN: The Book Identifier

ISBNs identify books. They come in two lengths: ISBN-10 (used through 2006) and ISBN-13 (the current standard, identical to the EAN-13 barcode on the back of physical books).

ISBN-13 format: 978- or 979- prefix, then a 9-digit body, then a single check digit. The hyphens are optional in the canonical form but ubiquitous in print. Examples:

  • 978-0-13-110362-7The C Programming Language, 2nd edition
  • 9780131103627 — same ISBN, no hyphens

ISBN-10 format: 9 digits plus a check digit (which can be X for the value 10). Hyphens still optional.

Parsing gotchas. Both formats are easy to extract with a regex if you allow optional hyphens. The check digit calculation is deterministic — a strict extractor can validate the check digit to reject false matches. The Utilora extractor accepts the format pattern without validating the check digit, on the theory that it's better to over-extract and let the user decide than to silently drop valid identifiers that happen to have typos.

To resolve an ISBN, use the WorldCat catalog: https://www.worldcat.org/isbn/<isbn>. WorldCat aggregates library holdings worldwide and links to publishers, Google Books, and library copies.

URLs as the Last-Resort Identifier

For documents without DOIs, arXiv IDs, PMIDs, or ISBNs — blog posts, technical reports, software repositories, government documents — a URL is the only persistent identifier you have. URLs are weaker than the other identifiers because they rot, but they're better than nothing.

When you cite a URL, capture the access date alongside it. If the page changes or disappears, the access date tells your reader when you saw the version you're citing. For high-stakes citations, archive the page on the Wayback Machine and cite both the original URL and the archive snapshot.

Extracting URLs from text is straightforward — any http:// or https:// prefix followed by non-whitespace characters. The gotchas are trailing punctuation (a period at the end of a sentence becomes part of the URL match unless you trim it) and Markdown link syntax (you want the URL inside (...), not the link text inside [...]).

The Extraction Use Case

Researchers do citation extraction in three contexts:

Literature review. You're reading a survey paper and want every identifier from its references section to build a reading list. Manual extraction is tedious; an automated tool gives you a deduped list in seconds.

Reference audit. You're reviewing your own draft and want to confirm every citation has a resolvable identifier. Run the extractor on the draft, compare against your .bib, fill in gaps.

Batch lookup. You have a list of identifiers and want to fetch metadata for all of them via Crossref or Unpaywall. The extractor produces the input; the API does the lookup.

For all three, the friction point is the messy input — references sections with line breaks in the middle of identifiers, PDFs whose text layer mangles characters, email threads with identifiers buried in URLs. A good extractor handles the noise so you don't have to.

Deduplication

Identifiers are unique per document, but a single document often appears in text under multiple identifiers. A paper might be cited as a DOI in one place, an arXiv ID in another, and a publisher URL in a third — all pointing at the same work.

Deduplication by identifier value (case-insensitive, ignoring URL prefixes) collapses obvious duplicates. Cross-format deduplication (matching a DOI to its arXiv equivalent) requires looking up metadata, which means external API calls. Most client-side tools, including ours, dedupe within type only — the manual cross-format pass is a worthwhile finishing step for short reading lists.

What Citation Extraction Won't Do

A few things to keep in mind about client-side extractors:

  • No metadata. The extractor returns identifiers, not titles or authors. To get the title, you need a Crossref or PubMed API call.
  • No validation. The extractor matches format, not existence. A typo-ridden DOI that happens to match the pattern will be extracted; only a resolver call will tell you it points at nothing.
  • No formatting cleanup. If the source text has a DOI split across two lines, the extractor's regex won't reassemble it. Pre-process the text to remove line breaks inside identifiers.

These limits are inherent in any extractor that respects the "client-side only" promise. If you need metadata, run the extracted list through a Crossref / Unpaywall / Semantic Scholar batch lookup as a separate step.

Best Practices for Your Own Citations

A short list of habits that make your own references easier for others to extract:

  1. Always include a DOI when one exists. Crossref provides them for the majority of journal articles. They're the strongest identifier.
  2. For preprints, cite the arXiv ID with the version. A versioned ID (2305.12345v2) is reproducible; an unversioned one drifts as the author revises.
  3. Don't strip identifiers from URLs. https://doi.org/10.1145/3580305.3599362 is more useful than https://dl.acm.org/doi/10.1145/3580305.3599362 because the resolver URL works forever; the publisher URL changes.
  4. Avoid using URL shorteners for identifiers. A bit.ly link is a single point of failure. The full DOI URL is the persistent form.
  5. In Markdown, link the identifier. [Doe et al. 2024](https://doi.org/10.1234/abc) reads naturally and lets the reader click straight through to the source.

Conclusion

Persistent identifiers are the backbone of academic citation. DOIs cover most journal articles. arXiv IDs cover preprints. PMIDs cover biomedical literature. ISBNs cover books. URLs fill the gaps. Each identifier has a format you can recognize on sight and a resolver you can paste it into.

For bulk extraction — building reading lists, auditing drafts, prepping batch lookups — the Utilora Citation Extractor finds all five types in a single pass, dedupes the results, and gives you clickable links to each canonical resolver. Everything runs in your browser; pasted text never leaves the tab.

Pair it with BibTeX Formatter when you need to clean up the .bib file the extracted identifiers eventually end up in, and Markdown Table Generator when you want to share a reading list as a properly aligned table with collaborators.

Try these tools