BibTeX Hygiene: Why Your .bib File Breaks at the Worst Possible Moment
Bibliographies rot. A clean .bib file at submission becomes a tangled mess of duplicates, missing fields, and inconsistent formatting six months later. Here's how the rot happens and how to keep your references honest.
BibTeX Hygiene: Why Your .bib File Breaks at the Worst Possible Moment
Every academic project starts with a clean .bib file. Three co-authors and two revisions later, the file looks like a thrift store: entries with no year, two different keys for the same paper, indentation drifting between two-space and tab, and a stray @misc that someone pasted from Google Scholar and never finished editing. The LaTeX compile still works — until it doesn't, and you find out at submission that half your citations are rendering as [??].
This post is about the failure modes of long-lived BibTeX files, what to check before you ship a draft, and how to keep a bibliography healthy as it grows.
Why .bib Files Decay
A bibliography is a side effect of collaboration. Each new reference enters the file from a different source: Google Scholar's "Cite" button, Zotero's export, a Mendeley plugin, a hand-typed entry copied from a PDF, a colleague's .bib snippet pasted into Slack. Each source has its own conventions for field order, indentation, citation key format, and even which fields it bothers to fill in.
After a year of accretion, four problems show up:
- Duplicate citation keys. Two co-authors both added the same paper, used different keys, and now LaTeX silently picks whichever it sees last in the file.
- Missing required fields. A
@inproceedingsentry withoutbooktitlewill compile but render badly. An@articlewithoutjournalwill produce empty parentheses. - Inconsistent formatting. The first half of the file uses two-space indentation; the second half tabs. Field order varies. Some entries quote values, others use braces.
- Stale duplicates. The same paper appears under three different keys — one for a 2020 arXiv preprint, one for the 2021 conference version, one for the 2023 journal extension. Citing different versions inadvertently is embarrassing.
None of these problems necessarily break the compile. They produce quietly wrong output, which is much harder to notice than a hard error.
What BibTeX Actually Requires
Each entry type has a list of required and optional fields. BibTeX itself enforces almost nothing — most BibTeX styles (.bst files) emit warnings rather than errors, and the warnings often scroll past in the compile log.
The required fields per common type are roughly:
- article: author, title, journal, year (volume/number/pages are strongly recommended)
- book: author or editor, title, publisher, year
- inproceedings: author, title, booktitle, year
- incollection: author, title, booktitle, publisher, year
- techreport: author, title, institution, year
- phdthesis / mastersthesis: author, title, school, year
- misc: nothing strictly required — which is why
@miscbecomes a dumping ground
When a required field is missing, the rendered bibliography looks fine to the author (who knows the missing data and mentally fills it in) and broken to the reader (who sees a citation without a venue, year, or pages). The author rarely catches this because they have selective blindness for their own bibliography.
The Citation Key Problem
Citation keys are the identifiers you use in \cite{...}. They are also the only way BibTeX deduplicates entries. Two entries with the same key collide and BibTeX silently keeps one. Two entries with different keys for the same paper both render, producing duplicate citations in your bibliography.
Most reference managers generate keys algorithmically — AuthorYearTitle or lastnameYY patterns. The patterns differ between tools, so when collaborators import the same paper through different tools they get different keys. The cleanup step is manual: scan for duplicate titles, pick one canonical key, and rewrite \cite{} calls everywhere the duplicate is used.
A defensive habit: when you add a paper, search the .bib file for the title first. If it's already there under a different key, use the existing one. If you're merging .bib files from collaborators, run a dedup pass before you start rewriting \cite{} calls in the manuscript.
Sort Order Matters
A sorted .bib file is easier to maintain. Sort order matters in three ways:
- Diffs. Sorted files produce minimal diffs when collaborators add entries. Unsorted files produce noisy diffs where the same entry moves around between commits.
- Conflict resolution. When two collaborators add entries to the same region of an unsorted file, git produces merge conflicts that are tedious to resolve. Sorted files have predictable insertion points.
- Manual scanning. When you're looking for an entry by author or by year, a sorted file lets you stop at the right region instead of grepping the whole thing.
Sort by citation key is the safest default — it's stable and unambiguous. Sort by year is useful for chronological reviews. Sort by first author surfaces all entries from the same lab together. The "best" sort depends on the use case, but pick one and stick with it across the project.
Formatting Conventions
Within an entry, two formatting choices recur:
Braces vs. quotes. author = {Knuth, Donald E.} vs. author = "Knuth, Donald E.". Both are legal. Braces are safer because they don't require escaping internal quotes and they preserve capitalization (BibTeX styles often lowercase title words unless they're inside braces).
Field order. Most reference managers emit fields in their own preferred order. The styles don't care, but inconsistent order makes scanning harder. A common convention: author/editor first, then title, then venue (journal/booktitle/publisher), then year, then volume/number/pages, then identifiers (doi, url, isbn), then notes.
Indentation. Two-space indentation for fields is the de facto standard. The actual width matters less than consistency across the file.
A formatter that re-emits each entry with consistent indentation, field order, and quote style erases all three of these inconsistencies in one pass.
The Submission-Day Audit
A pre-submission checklist for any paper-length bibliography:
- Dedupe by key. Look for collisions. Run a script (or use a tool) that flags multiple entries with the same key.
- Dedupe by title. Look for the same paper under different keys. This is harder to automate because titles vary slightly between sources — a manual pass is worth it.
- Required-field check. For every entry, confirm the required fields for its type are present. An automated check is fast; a manual pass catches the cases where a field is present but obviously wrong (a
@inproceedingswhosebooktitleis "Proceedings of the"). - Citation existence check. Run
bibtexand look at the log for "I didn't find a database entry for X" warnings. Each one is a\cite{}in your manuscript that resolves to nothing. - Unused entries. Run a reverse check: which entries in your .bib are never cited? Some authors keep them as a reading list; others prefer a tight, fully-cited .bib.
The first three are easy to automate. The Utilora BibTeX Formatter does steps 1, 3, and the formatting pass — paste the file, pick sort and dedupe options, and copy the cleaned result back.
Author Name Normalization
Donald E. Knuth, D. E. Knuth, Donald Ervin Knuth, Knuth, Donald E., Knuth, D. — BibTeX understands all of these, but they don't always produce identical output through every style. The lastname, firstname form is the safest because it tells BibTeX explicitly which token is the last name.
For multi-author entries, separate with and (capital A is fine but most styles handle either). Avoid using commas to separate authors — BibTeX will interpret Smith, J. and Doe, J. correctly, but Smith, J., Doe, J. is ambiguous.
For long author lists, the others keyword expands to "et al." in most styles: author = {Smith, J. and Doe, J. and others}. This is more robust than typing "et al." literally, which some styles will treat as another author name.
Cross-References and Strings
Two BibTeX features that reduce duplication in large bibliographies:
@string definitions let you abbreviate venue names: @string{JACM = "Journal of the ACM"}, then use journal = JACM in entries. Useful if you cite the same venue many times.
crossref lets one entry inherit fields from another. The classic use is conference proceedings: define a @proceedings entry for the conference, then have each @inproceedings cross-reference it for booktitle, year, editor, etc.
Both features are powerful and both are common sources of subtle bugs. If you use crossref, the referenced entry must appear after the entries that reference it in the .bib file (yes, this is the opposite of intuition). If you use @string, make sure your reference manager preserves the macros on export — many tools expand them silently.
When to Skip BibTeX Entirely
For papers, BibTeX-via-LaTeX remains the most reliable pipeline despite its quirks. For other writing — blog posts, books, technical documentation — formats like Markdown plus CSL (Citation Style Language) via Pandoc are often more convenient. The choice usually comes down to the publisher: if the venue requires LaTeX submission, BibTeX is the path of least resistance.
For longer-running projects, consider a hybrid: store references in Zotero (which handles deduplication and metadata better than a raw .bib file), then export to BibTeX for the LaTeX manuscript. The Zotero library is the source of truth; the .bib is generated.
Identifying Identifiers
The hardest references to clean up are the ones missing identifiers — no DOI, no arXiv ID, no ISBN. They're hard to deduplicate (titles vary across sources) and they make readers hunt to find the actual paper. When you add a reference, take the extra ten seconds to find the DOI or arXiv ID and include it. Your future self and your readers both benefit.
If you've inherited a .bib file from collaborators and the entries are missing identifiers, the Utilora Citation Extractor is the complementary tool: paste the manuscript text, get every DOI, arXiv ID, PMID, and ISBN as a clean list, and then track down the rest manually.
Conclusion
BibTeX hygiene is one of those background tasks that looks like procrastination right up until the moment it isn't. A messy .bib file produces papers that technically compile but quietly carry duplicate citations, missing fields, and references the reader can't follow. The fix is not heroic; it's a routine: sort, dedupe, validate required fields, scan for missing identifiers. Do it before each submission and the file stays healthy across years of project work.
For ad-hoc cleanup, the Utilora BibTeX Formatter handles sort, dedupe, and validation in a single browser tab — paste, configure, copy back. Pair it with Citation Extractor for identifier audits and Markdown Table Generator when you need to summarize a reading list as a table for collaborators.
Try these tools
Parse, sort, dedupe, and reformat BibTeX bibliography entries entirely in your browser. Flags missing required fields.
Pull DOIs, arXiv IDs, PMIDs, ISBNs, and URLs out of any prose, paper, or references list. Runs in your browser.
Convert CSV, TSV, or pipe-separated data into aligned Markdown tables. Configure per-column alignment. Runs in your browser.