Modern neural NER on clean news-style English often exceeds 90% F1 on benchmark datasets. Accuracy drops with typos, OCR noise, ambiguous abbreviations, and domain jargon. Production document pipelines combine NER or LLM extraction with human review and confidence scoring for business-critical fields.

What is Named Entity Recognition (NER)? Complete Guide 2026

Q: What is named entity recognition?

Named Entity Recognition (NER) is a Natural Language Processing technique that scans text and labels spans as predefined categories such as people, organisations, locations, dates, and monetary amounts. It turns unstructured prose into structured entity lists that downstream systems can index, filter, or import.

Q: What types of entities can NER detect?

Standard NER models detect persons, organisations, locations (GPE), dates, times, money, percentages, products, and events. Custom NER can be trained on domain-specific labels such as invoice numbers, contract IDs, patient IDs, or social-security numbers for payroll and compliance workflows.

Q: What is the difference between NER and NLP?

NLP (Natural Language Processing) is the broad field of teaching computers to work with human language — parsing, translation, summarisation, sentiment, and more. NER is one specific NLP task focused on finding and classifying named entities in text. Every NER system uses NLP techniques, but not every NLP application performs NER.

Q: How is NER used in document processing?

In document processing, NER identifies key fields — vendor names on invoices, employee names on payslips, dates on contracts — so data can be exported to spreadsheets or ERP systems without manual retyping. When combined with OCR on scanned PDFs and AI for context, NER becomes part of an end-to-end extraction pipeline.

Q: Can NER handle multiple languages?

Yes, but you need language-specific models or multilingual transformers. SpaCy ships models for dozens of languages; cloud APIs offer multi-language NER as well. For European business documents, Inputo runs OCR in seven languages (Spanish, English, French, German, Italian, Portuguese, Dutch) before AI maps entities to the correct payroll or HR fields.

Every invoice, payslip, and contract you receive is full of named entities — people, companies, dates, amounts, and locations buried in paragraphs and tables. Manually finding each one is slow and error-prone. Named Entity Recognition (NER) is the NLP technique that automates that work: it scans text and labels spans like John Smith as a person, Acme Ltd as an organisation, and €4,250.00 as money.

If you search for named entity recognition, NER NLP, or entity extraction, you are usually trying to solve a practical problem — turn unstructured documents into structured data you can import, analyse, or search. NER is the foundation of modern information extraction, powering everything from news aggregators to compliance systems and AI document platforms like Inputo.

This guide explains what NER is, how it works under the hood, which entity types it detects, how to run it in Python with NLTK and SpaCy, and why AI-powered extraction goes further than traditional NER when you need to understand document context — not just label isolated phrases.

What is Named Entity Recognition?

At its core, Named Entity Recognition is a Natural Language Processing task that identifies and classifies rigid designators in text — proper names and other fixed references that point to specific real-world objects. Philosophers use “rigid designator” to mean a term that refers to the same entity across possible worlds; in NLP, that translates to categories such as Person, Organisation, Location, Date, and Money.

Unlike keyword search, which looks for exact strings, NER understands that “Apple” in a business headline is likely an organisation, while “apple” in a recipe is food. The model assigns each detected span a label so downstream systems can route data: a CRM import, a compliance alert, or a spreadsheet column.

Where the term came from: MUC-6

The modern NER task was formalised at the Message Understanding Conference (MUC-6) in 1995, sponsored by DARPA. Researchers needed shared benchmarks for information extraction from news wire text. MUC-6 defined entity types — person, organisation, location, and more — and evaluation metrics so different labs could compare systems fairly. That conference effectively standardised “named entity recognition” as a distinct NLP subtask, separate from full parsing or machine translation.

Three decades later, MUC-style labels still appear in libraries like SpaCy (ORG, PERSON, GPE) and in enterprise extraction pipelines. The datasets have changed — social media, clinical notes, invoices — but the idea remains: find spans and assign types.

A classic NER example

Consider this sentence from a financial news wire:

“Apple is looking at buying U.K. startup for $1 billion.”

A trained NER model would typically output:

Apple → ORG (organisation)
U.K. → GPE (geo-political entity / location)
$1 billion → MONEY

Notice what NER does not tell you: which entity is the acquirer versus the target, whether the deal closed, or whether “U.K.” modifies “startup” or indicates jurisdiction. It labels local spans. That limitation matters when you move from news sentences to multi-page PDFs — a topic we return to in the NER vs AI section.

How NER Works

Production NER pipelines follow a predictable sequence. Whether the implementation uses hand-written rules or a transformer neural network, the logical stages are similar.

1. Tokenization

Raw text is split into tokens — usually words and punctuation. Tokenization must handle contractions, hyphenated names, and currency symbols so “$1” and “billion” can be merged into a single monetary entity later. For European documents, tokenizers also deal with decimal commas, VAT prefixes, and national ID formats.

2. Part-of-speech (POS) tagging

Each token receives a grammatical tag: noun, verb, adjective, and so on. POS features help the model distinguish “March” the month from “march” the verb, or “Washington” the person from “Washington” the place. Classical systems relied heavily on POS; modern neural NER often learns implicit syntax but still benefits from linguistic preprocessing on noisy OCR text.

3. Entity detection and classification

The model decides which token spans constitute entities and assigns a type. Detection answers where the entity is; classification answers what kind it is. Multi-word entities such as “European Central Bank” must be grouped into one span rather than three separate organisation hits.

IOB tagging format

Sequence labelling models often use the Inside-Outside-Beginning (IOB) scheme (also called BIO):

B- prefix marks the beginning of an entity (e.g. B-ORG for the first token of an organisation)
I- prefix marks tokens inside a continuing entity (e.g. I-ORG for “Bank” in “Deutsche Bank”)
O marks tokens outside any entity

For the Apple headline, tokens might be tagged: B-ORG (Apple), O (is), O (looking) … B-GPE (U.K.), … B-MONEY ($), I-MONEY (billion). Training data in CoNLL-2003 and similar shared tasks uses IOB labels so CRF and LSTM models can learn consistent span boundaries.

Machine learning vs rule-based NER

Rule-based NER uses dictionaries, regular expressions, and pattern grammars. A rule might say: “two capitalised tokens followed by Ltd or S.L. → organisation.” Rules are fast, interpretable, and excellent for structured identifiers (IBAN patterns, NHS numbers) but brittle on typos and novel phrasing.

Machine learning NER learns from annotated examples. Classical approaches combined hand-crafted features with Conditional Random Fields (CRF). Today, BERT, RoBERTa, and domain-specific transformers dominate benchmarks. They generalise to new vocabulary and handle context (“Paris Hilton” vs “Paris, France”) far better than regex alone.

Most real systems blend both: regex for high-precision IDs, ML for prose, and post-processing rules to fix systematic OCR errors. When documents arrive as scanned PDFs, OCR runs first — otherwise NER has nothing reliable to parse.

Types of Entities NER Can Detect

Entity inventories vary by model and training corpus. SpaCy’s English core model, for example, ships with a fixed label set suited to general text. The table below summarises common standard types and typical business examples.

Entity type	Label (typical)	Example
Person	`PERSON`	María García, Dr James Chen
Organisation	`ORG`	Deutsche Bank, NHS Trust
Location	`GPE`, `LOC`	Spain, River Thames
Date	`DATE`	15 March 2026, Q3 FY25
Time	`TIME`	14:30, midnight
Money	`MONEY`	€4,250.00, $1 billion
Percentage	`PERCENT`	21% VAT, 4.5% interest
Product	`PRODUCT`	iPhone 16, Office 365
Event	`EVENT`	Board meeting, Olympic Games

Custom and domain-specific entities

Off-the-shelf NER models are trained on news and web text. Payroll PDFs, medical forms, and legal contracts need custom entities: invoice numbers, contract clause IDs, DNI/NIE national IDs, social-security numbers (NSS, NIF, Codice Fiscale), policy numbers, SKU codes, and line-item descriptions tied to amounts.

Teams extend NER by fine-tuning transformers on labelled in-house documents or by layering regex and validation (checksum digits on tax IDs). Inputo’s approach combines general language understanding with field schemas for payroll exports — so “12345678Z” maps to an employee ID column, not merely a generic alphanumeric string.

Practical Applications of NER

NER is rarely the final product; it is the bridge between unstructured language and structured systems. Here are six high-impact use cases.

Document data extraction

This is Inputo’s core domain. Invoices, payslips, tax forms, and HR packets contain dozens of entities per page. NER (or LLM-based equivalents) locates names, amounts, and dates so they can populate Excel rows, CSV imports, or Word templates without retyping. When extraction must scale across hundreds of employees or suppliers, entity labelling is the first automated step — often after OCR on scans.

Customer support automation

Support tickets mention order numbers, product names, and account holders. NER routes tickets to the right queue, pre-fills CRM cases, and redacts personal data before logs are stored. Identifying ORG and PRODUCT entities helps link complaints to known incidents.

Medical record processing

Clinical notes reference patients, medications, dosages, and conditions. Specialised medical NER models label entities for coding (ICD-10), adverse-event monitoring, and de-identification. Regulatory constraints mean human review remains common, but NER reduces manual annotation time dramatically.

Legal document review

Contracts and court filings name parties, jurisdictions, effective dates, and monetary obligations. NER accelerates due diligence: reviewers filter clauses mentioning specific organisations or date ranges instead of reading every page linearly.

Financial data extraction

Earnings reports, credit memos, and KYC packets mix prose and tables. NER pulls MONEY, PERCENT, and ORG entities into risk models and compliance checks. Pairing NER with table detection handles line items that pure sentence models miss.

Search and recommendation

News aggregators, knowledge graphs, and enterprise search index entities so users can query “all articles mentioning Company X in France during 2025.” Recommendation engines use entity overlap to suggest related documents. Without NER, search degrades to brittle keyword matching.

How to Perform NER with Python

Developers usually start with NLTK for teaching examples or SpaCy for production prototypes. Both ship pre-trained English models; SpaCy additionally offers multilingual pipelines.

NLTK with `ne_chunk`

NLTK’s classical approach tokenises text, tags parts of speech, then applies a chunk grammar to group named entities. It is educational and lightweight but less accurate than modern neural models on messy business text.

import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

from nltk import word_tokenize, pos_tag, ne_chunk

sentence = "Apple is looking at buying U.K. startup for $1 billion."
tokens = word_tokenize(sentence)
tagged = pos_tag(tokens)
tree = ne_chunk(tagged)

for chunk in tree:
    if hasattr(chunk, 'label'):
        entity = " ".join(c[0] for c in chunk)
        print(f"{entity} → {chunk.label()}")
# Example output:
# Apple → ORGANIZATION
# U.K. → GPE
# $ 1 billion → MONEY

NLTK returns a nested tree structure; you walk it to collect entity strings and labels. For batch processing PDFs, you would extract plain text first (via pdfplumber or OCR), then run each paragraph through this pipeline.

SpaCy with `doc.ents`

SpaCy is the more common choice for applied work. Load a model once, process raw strings, and iterate doc.ents for start/end offsets and labels — useful for highlighting entities in UI or building training data.

import spacy

nlp = spacy.load("en_core_web_sm")
text = "Apple is looking at buying U.K. startup for $1 billion."
doc = nlp(text)

for ent in doc.ents:
    print(f"{ent.text} → {ent.label_} ({spacy.explain(ent.label_)})")
# Apple → ORG (Companies, agencies, institutions, etc.)
# U.K. → GPE (Countries, cities, states)
# $1 billion → MONEY (Monetary values, including unit)

SpaCy also exposes ent.start_char and ent.end_char for span highlighting. For other languages, swap in es_core_news_sm, de_core_news_sm, and similar models — critical for European HR documents that mix languages on one page.

Neither snippet understands that “$1 billion” is an acquisition price tied to Apple’s bid. They label spans in isolation — which leads naturally to the next topic.

NER vs AI-Powered Document Extraction

Traditional NER answers: “What labelled spans appear in this sentence?” It does not reliably answer: “What is the employee’s gross salary?” or “Which line item belongs to which VAT rate?” Context, layout, and cross-field relationships sit outside classic NER’s design.

Consider a payslip fragment after OCR:

Employee: John Smith · Gross pay: €50,000 · Employer: Acme Iberia S.L.

NER returns three entities: PERSON (John Smith), MONEY (€50,000), ORG (Acme Iberia S.L.). Useful — but a payroll import needs structured fields: employee_name, gross_salary, employer_name. Two different MONEY amounts on a dense payslip (gross, net, deductions) confuse span-only models unless rules encode relative position.

AI-powered extraction (large language models such as Claude, used in Inputo) reads the full document — including tables, headers, and repeated sections — and outputs schema-aligned JSON. The model infers that €50,000 beside “Gross pay” is salary, not tax withheld, because it understands labels and layout, not just entity types.

Another gap: coreference and roles. NER tags “John Smith” and “the employee” separately; AI can link them. NER tags all dates equally; AI can distinguish hire date, pay period, and payment date from surrounding wording.

That does not make NER obsolete. NER remains fast, cheap, and interpretable for high-volume filtering. Modern platforms use a tiered strategy: OCR for scans, NER or regex for obvious identifiers, and LLM extraction where semantic understanding and field mapping matter. For a deeper walkthrough of PDF pipelines, see our guide on how to extract data from PDF files using AI.

How Inputo Uses AI for Intelligent Data Extraction

Inputo is built for European payroll, HR, and finance teams drowning in PDFs. The pipeline combines OCR, entity-aware AI, and export formats your software already accepts.

OCR for scanned documents

Many payslips and government forms arrive as image-only PDFs. Inputo runs multi-language OCR (Spanish, English, French, German, Italian, Portuguese, Dutch) before any entity work begins. Without a clean text layer, even the best NER model hallucinates labels on garbled tokens.

Claude AI for context-aware extraction

After text is available, Claude analyses document structure: tables, headings, repeated employee blocks, footnotes. It extracts entities and assigns them to the correct semantic fields — DNI/NIE to national ID, NSS to social security, gross and net amounts to separate columns — rather than returning a flat list of MONEY spans.

Field mapping and validation

Business rules validate formats (Spanish DNI checksum, date normalisation to ISO). Country-specific payroll exports map fields to A3Nom, TeamSystem, PHC GO, Moneysoft, and Silae layouts so gestorías import without manual column remapping.

Structured exports

Results download as Excel, CSV, or filled Word templates. The same engine powers the public PDF to Excel converter for ad-hoc table extraction. Files are processed and deleted — not stored for training.

Upload a payslip or invoice and see AI extraction go beyond traditional NER — mapped fields, not just labelled spans.

Try Inputo free →

Frequently asked questions

What is named entity recognition?

Named Entity Recognition (NER) is an NLP technique that scans text, detects meaningful spans (names, places, amounts, dates), and assigns each span a category. It is a core step in turning unstructured documents into structured data for search, analytics, and automated imports.

What types of entities can NER detect?

Standard models cover persons, organisations, locations, dates, times, money, percentages, products, and events. Custom training adds domain labels — invoice IDs, contract numbers, medical codes — when off-the-shelf tagsets are not enough for your industry.

What is the difference between NER and NLP?

NLP is the entire discipline of processing human language with computers. Necessary tasks include translation, summarisation, sentiment analysis, parsing, and NER. Think of NER as one specialised tool in the broader NLP toolbox.

Is NER accurate?

On clean newswire English, neural NER often achieves high 90s F1 scores on benchmarks. Real documents introduce OCR noise, unusual fonts, tables, and domain jargon — accuracy falls unless you fine-tune, add rules, or combine NER with human review. For payroll and compliance, always validate extracted fields before import.

How is NER used in document processing?

Document pipelines use NER to locate candidate fields, redact sensitive spans, or populate database columns. In production, NER often pairs with layout analysis, OCR, and LLM-based field mapping — especially when documents mix prose and tables across multiple pages.

Can NER handle multiple languages?

Yes, with language-specific or multilingual models. European HR workflows frequently need Spanish, English, French, German, Italian, Portuguese, and Dutch on the same platform. Inputo’s OCR covers those seven languages; AI extraction then maps entities to the correct payroll schema regardless of where they appear on the page.

Conclusion

Named Entity Recognition is the quiet workhorse behind modern information extraction. Born at MUC-6, refined through IOB tagging and neural models, and available in every major NLP library, NER turns raw text into labelled spans you can act on. For simple sentences — “Apple … U.K. … $1 billion” — it works beautifully.

Business documents demand more: relationships between entities, table structure, multilingual scans, and export-ready field names. That is where AI-powered platforms extend NER’s foundation. Whether you experiment with NLTK and SpaCy in Python or upload your next payslip batch to Inputo, you are building on the same idea — find what matters in text, automatically.

Ready to move from entity labels to structured payroll and invoice data? Launch Inputo, upload a document, and download mapped Excel or CSV in minutes.

NER finds the names and amounts — Inputo maps them to the fields your software expects. Try the full app today.

Launch Inputo app →

What is Named Entity Recognition? A Complete Guide to NER in 2026