Concordance and KWIC: Corpus Linguistics in a Browser Tab
Concordance views were the original tool of corpus linguistics. They're also the right tool when you want to know how a word is actually used, not just how often it appears. Here's the technique and when to reach for it.
Concordance and KWIC: Corpus Linguistics in a Browser Tab
When a Bible scholar in the 13th century wanted to find every appearance of a word in scripture, they compiled a concordance — a list of every occurrence with surrounding context. The work took decades. Hugh of Saint-Cher's concordance of the Vulgate, finished around 1230, is the earliest serious attempt. Five hundred years later, concordance-making became a profession; Alexander Cruden's 1737 concordance of the King James Bible took fifteen years of solo work and ran to 740,000 entries.
Today the same operation runs in milliseconds. The technique remains useful — not for theological scholarship, but for any task where you need to see how a word is used, not just how often. This post is about concordances, the KWIC variant, n-gram frequency analysis, and the practical reasons researchers still reach for them.
What a Concordance Actually Is
A concordance is an index of every occurrence of every word in a text, with surrounding context. Each entry is a line: the target word in the middle, a fixed window of characters or words on either side. The classic display format is keyword-in-context (KWIC), where every line is centered on the target word so the eye can scan the column of keywords vertically:
and so it was that he became afraid of the dark, and would not
The little prince was deeply afraid to admit he might be wrong
there is nothing more to be afraid of, said the fox, gently,
The vertical alignment is the entire point. By scanning the column you can quickly see the collocates — the words that appear alongside the target. Patterns emerge: afraid of, afraid to, afraid of the dark. Concordances make collocates visible in a way that raw frequency tables never can.
Why Frequency Alone Isn't Enough
A word frequency table tells you that "good" appears 247 times in a document. It doesn't tell you whether "good" is used as a predicate ("the food is good"), a quantifier ("good news"), or part of a fixed expression ("good morning"). The semantic load differs by context. Frequency averages over context.
A concordance preserves context. You see all 247 occurrences with their neighbors. Patterns of use jump out: which adjectives modify "good", which nouns follow it, whether it's mostly literal or mostly figurative. Linguists call this collocation analysis and it's the bread and butter of corpus linguistics.
Frequency tables remain useful as a starting point — they tell you which words are worth investigating. Concordances are the follow-up that tells you what each word is actually doing.
N-grams: Two-Word and Three-Word Patterns
An n-gram is a contiguous sequence of n words. Unigrams (single words), bigrams (pairs), and trigrams (triples) are the most common.
A bigram frequency table for English prose surfaces fixed expressions: "of the", "in the", "to the". With a stopword filter applied, the same table surfaces content patterns: "machine learning", "neural network", "training data".
Trigrams catch longer fixed phrases: "state of the art", "in this paper", "as shown in". Trigram tables are useful for detecting stock phrasing in academic writing or marketing copy.
Higher-order n-grams (4-grams, 5-grams) get sparse fast. Most 5-grams appear once in any given document, so the frequency table degenerates into a list of unique sequences. For most analysis, unigrams through trigrams are the sweet spot.
Stopwords: The Filter That Makes Frequencies Useful
The 100 most common English words ("the", "of", "and", "to", "a", "in", "is", "it", "you", "that", …) make up roughly half of all words in any English text. A raw unigram frequency table is therefore dominated by function words. The top-20 list of an English novel is essentially the same list every time.
Filtering out a stopword list before counting reveals the content words that distinguish the text. The choice of stopword list matters:
- Short lists (50–100 words) catch the highest-frequency function words but leave through some content-bearing function words.
- Long lists (1,000+ words) aggressively prune but can drop words that matter in some contexts (negation, modal verbs).
- Domain-specific lists add field-specific high-frequency words ("paper", "section", "Figure" for academic writing).
A reasonable default is a 100–200-word list of the most common English function words. The Utilora Text Concordance ships with such a list and lets you toggle it off when you want raw frequencies.
Type-Token Ratio
A simple lexical-diversity metric: the type-token ratio (TTR) is the number of unique words (types) divided by the number of total words (tokens).
- A short text has high TTR (most words are unique).
- A long text has lower TTR (vocabulary saturates).
- A text with lots of repetition has lower TTR than a text with varied vocabulary, at the same length.
TTR varies dramatically with text length, so it's not directly comparable across texts of different sizes. For same-length samples, TTR is a useful rough indicator of vocabulary richness. For comparing texts of different lengths, use moving-average TTR or vocd-D.
For most practical purposes — "is this writer's vocabulary varied or repetitive?" — eyeballing the TTR on a representative sample is enough. The Concordance tool shows TTR alongside the token count for this reason.
KWIC Window Size
The width of the context window around each keyword is a design choice with real consequences.
- Narrow windows (10–20 characters) show immediate neighbors. Useful for collocation analysis.
- Medium windows (40–80 characters) show the phrase or short clause. Useful for grammatical patterns.
- Wide windows (100+ characters) show the surrounding sentence. Useful for thematic patterns and quote extraction.
A wider window is more informative per occurrence but takes more vertical space, so you see fewer occurrences at once. The right window depends on what you're looking for. Switching between windows is a common workflow — start narrow to spot patterns, then expand to extract specific quotes.
Use Cases for Concordance Analysis
Concordances and KWIC views earn their keep in five recurring contexts:
1. Qualitative coding of transcripts. Researchers coding interview transcripts or open-ended survey responses use concordances to find every appearance of a theme. A KWIC view of "manager" in a workplace survey transcript surfaces every time the word appears with surrounding context — exactly the input qualitative coding needs.
2. Author style audits. Writers self-audit by running their own draft through a bigram frequency tool. The crutches appear at the top: "in fact", "it is important to", "it should be noted". Once you see your own tics in a table, you stop using them as reflexively.
3. Term consistency in technical writing. A long technical document accumulates inconsistent terminology: "user" in one chapter, "customer" in another, "client" in a third. A frequency table flags the variation; a KWIC view shows whether the terms are interchangeable or refer to distinct concepts.
4. Literary analysis. Comparing how an author uses a specific word across a chapter or novel. The classic case is following a motif: every appearance of "garden" in The Secret Garden, every appearance of "white" in Moby Dick.
5. Corpus-based language learning. Language learners use concordance lines from large corpora to see how a target word is actually used by native speakers. The collocations and grammatical patterns visible in the KWIC view fill gaps that dictionaries don't address.
What a Browser Concordance Tool Can and Can't Do
A client-side tool that processes text in your browser tab has natural limits.
It can handle:
- Single documents up to chapter length comfortably.
- Multi-document corpora when concatenated into one input.
- Real-time interaction with sliders and filters because everything is in memory.
- Full privacy because nothing leaves your device.
It can't handle:
- Million-word corpora without performance degradation. Bigram/trigram counting is O(n) but the dominant cost is memory, and JavaScript object allocation gets slow at scale.
- Cross-document statistics (concordancing across a large library where each document is a separate file). For that, dedicated tools like AntConc or sketch engine remain the right choice.
- POS-tagged or lemmatized analysis. Word-level frequency counts are case-sensitive (or case-folded), but they don't know that "ran" and "runs" are forms of "run".
For chapter-length inputs and most qualitative coding work, the browser-tab limit is comfortable. For million-word corpora, use a dedicated tool.
A Brief History
The conceptual lineage of concordance tools:
- Hugh of Saint-Cher, ~1230. First serious Bible concordance, compiled by a Dominican order with 500 monks contributing.
- Alexander Cruden, 1737. A Complete Concordance to the Holy Scriptures. Still in print.
- Father Roberto Busa, 1949. Began the Index Thomisticus, a complete concordance of Aquinas's works. Used IBM punched-card machines. Marked the birth of computational linguistics.
- John Sinclair and the COBUILD project, 1980s. First large-scale electronic corpus (the Bank of English). Established corpus linguistics as a field.
- Laurence Anthony's AntConc, 2002. Free desktop concordance tool that brought the technique to undergraduate teaching and small-corpus researchers.
- Web-native tools, 2010s onward. Concordance moved into the browser, alongside the broader trend of putting "applications you used to install" onto the web.
The fundamental technique hasn't changed in 800 years: index every word with context, present in aligned columns, look for patterns. What's changed is the work required to do it.
Practical Tips
A few things that make concordance analysis more useful in practice:
- Compare with and without stopwords. The two views surface different patterns. Function words show grammatical structure; content words show topical structure.
- Start with bigrams. The top-20 bigrams of any text are surprisingly informative. They're often the title-ish phrases that summarize the content.
- Use case-sensitive search for proper nouns. A KWIC for "Apple" the company should ignore "apple" the fruit.
- Use whole-word matching for short queries. Searching for "in" without whole-word matching matches every occurrence inside larger words.
- Iterate on the window size. Start narrow, widen when you need to extract quotes, narrow again when you want to scan for patterns.
Conclusion
Concordance and KWIC analysis are not glamorous. They're not AI. They're not the cutting edge of NLP. They are, however, the tool that linguists, qualitative researchers, and careful writers actually reach for when they want to understand how a word is being used. The technique is 800 years old and still earns its place in the toolkit.
For browser-native work, the Utilora Text Concordance gives you unigram, bigram, and trigram frequencies, a stopword filter, and a KWIC view with adjustable window — all running locally on whatever text you paste in. Nothing uploads. Use it on a chapter, a transcript, a draft, or your own writing to see what you're actually saying.
Pair it with Word Counter for quick length checks and Remove Duplicate Lines when you're pre-processing a corpus from line-delimited sources.
Try these tools
Build a keyword-in-context concordance and unigram/bigram/trigram frequency tables for any text. Runs in your browser.
Count words, characters, sentences, and estimate reading time instantly in your browser. No sign-up required.
Remove duplicate lines from any text instantly in your browser. Supports case-sensitive and case-insensitive modes.