Every invoice, payslip, and contract you receive is full of named entities — people, companies, dates, amounts, and locations buried in paragraphs and tables. Manually finding each one is slow and error-prone. Named Entity Recognition (NER) is the NLP technique that automates that work: it scans text and labels spans like John Smith as a person, Acme Ltd as an organisation, and €4,250.00 as money.
If you search for named entity recognition, NER NLP, or entity extraction, you are usually trying to solve a practical problem — turn unstructured documents into structured data you can import, analyse, or search. NER is the foundation of modern information extraction, powering everything from news aggregators to compliance systems and AI document platforms like Inputo.
This guide explains what NER is, how it works under the hood, which entity types it detects, how to run it in Python with NLTK and SpaCy, and why AI-powered extraction goes further than traditional NER when you need to understand document context — not just label isolated phrases.
What is Named Entity Recognition?
At its core, Named Entity Recognition is a Natural Language Processing task that identifies and classifies rigid designators in text — proper names and other fixed references that point to specific real-world objects. Philosophers use “rigid designator” to mean a term that refers to the same entity across possible worlds; in NLP, that translates to categories such as Person, Organisation, Location, Date, and Money.
Unlike keyword search, which looks for exact strings, NER understands that “Apple” in a business headline is likely an organisation, while “apple” in a recipe is food. The model assigns each detected span a label so downstream systems can route data: a CRM import, a compliance alert, or a spreadsheet column.
Where the term came from: MUC-6
The modern NER task was formalised at the Message Understanding Conference (MUC-6) in 1995, sponsored by DARPA. Researchers needed shared benchmarks for information extraction from news wire text. MUC-6 defined entity types — person, organisation, location, and more — and evaluation metrics so different labs could compare systems fairly. That conference effectively standardised “named entity recognition” as a distinct NLP subtask, separate from full parsing or machine translation.
Three decades later, MUC-style labels still appear in libraries like SpaCy (ORG, PERSON, GPE) and in enterprise extraction pipelines. The datasets have changed — social media, clinical notes, invoices — but the idea remains: find spans and assign types.
A classic NER example
Consider this sentence from a financial news wire:
“Apple is looking at buying U.K. startup for $1 billion.”
A trained NER model would typically output:
- Apple →
ORG(organisation) - U.K. →
GPE(geo-political entity / location) - $1 billion →
MONEY
Notice what NER does not tell you: which entity is the acquirer versus the target, whether the deal closed, or whether “U.K.” modifies “startup” or indicates jurisdiction. It labels local spans. That limitation matters when you move from news sentences to multi-page PDFs — a topic we return to in the NER vs AI section.
How NER Works
Production NER pipelines follow a predictable sequence. Whether the implementation uses hand-written rules or a transformer neural network, the logical stages are similar.
1. Tokenization
Raw text is split into tokens — usually words and punctuation. Tokenization must handle contractions, hyphenated names, and currency symbols so “$1” and “billion” can be merged into a single monetary entity later. For European documents, tokenizers also deal with decimal commas, VAT prefixes, and national ID formats.
2. Part-of-speech (POS) tagging
Each token receives a grammatical tag: noun, verb, adjective, and so on. POS features help the model distinguish “March” the month from “march” the verb, or “Washington” the person from “Washington” the place. Classical systems relied heavily on POS; modern neural NER often learns implicit syntax but still benefits from linguistic preprocessing on noisy OCR text.
3. Entity detection and classification
The model decides which token spans constitute entities and assigns a type. Detection answers where the entity is; classification answers what kind it is. Multi-word entities such as “European Central Bank” must be grouped into one span rather than three separate organisation hits.
IOB tagging format
Sequence labelling models often use the Inside-Outside-Beginning (IOB) scheme (also called BIO):
- B- prefix marks the beginning of an entity (e.g.
B-ORGfor the first token of an organisation) - I- prefix marks tokens inside a continuing entity (e.g.
I-ORGfor “Bank” in “Deutsche Bank”) - O marks tokens outside any entity
For the Apple headline, tokens might be tagged: B-ORG (Apple), O (is), O (looking) … B-GPE (U.K.), … B-MONEY ($), I-MONEY (billion). Training data in CoNLL-2003 and similar shared tasks uses IOB labels so CRF and LSTM models can learn consistent span boundaries.
Machine learning vs rule-based NER
Rule-based NER uses dictionaries, regular expressions, and pattern grammars. A rule might say: “two capitalised tokens followed by Ltd or S.L. → organisation.” Rules are fast, interpretable, and excellent for structured identifiers (IBAN patterns, NHS numbers) but brittle on typos and novel phrasing.
Machine learning NER learns from annotated examples. Classical approaches combined hand-crafted features with Conditional Random Fields (CRF). Today, BERT, RoBERTa, and domain-specific transformers dominate benchmarks. They generalise to new vocabulary and handle context (“Paris Hilton” vs “Paris, France”) far better than regex alone.
Most real systems blend both: regex for high-precision IDs, ML for prose, and post-processing rules to fix systematic OCR errors. When documents arrive as scanned PDFs, OCR runs first — otherwise NER has nothing reliable to parse.
Types of Entities NER Can Detect
Entity inventories vary by model and training corpus. SpaCy’s English core model, for example, ships with a fixed label set suited to general text. The table below summarises common standard types and typical business examples.
| Entity type | Label (typical) | Example |
|---|---|---|
| Person | PERSON |
María García, Dr James Chen |
| Organisation | ORG |
Deutsche Bank, NHS Trust |
| Location | GPE, LOC |
Spain, River Thames |
| Date | DATE |
15 March 2026, Q3 FY25 |
| Time | TIME |
14:30, midnight |
| Money | MONEY |
€4,250.00, $1 billion |
| Percentage | PERCENT |
21% VAT, 4.5% interest |
| Product | PRODUCT |
iPhone 16, Office 365 |
| Event | EVENT |
Board meeting, Olympic Games |
Custom and domain-specific entities
Off-the-shelf NER models are trained on news and web text. Payroll PDFs, medical forms, and legal contracts need custom entities: invoice numbers, contract clause IDs, DNI/NIE national IDs, social-security numbers (NSS, NIF, Codice Fiscale), policy numbers, SKU codes, and line-item descriptions tied to amounts.
Teams extend NER by fine-tuning transformers on labelled in-house documents or by layering regex and validation (checksum digits on tax IDs). Inputo’s approach combines general language understanding with field schemas for payroll exports — so “12345678Z” maps to an employee ID column, not merely a generic alphanumeric string.
Practical Applications of NER
NER is rarely the final product; it is the bridge between unstructured language and structured systems. Here are six high-impact use cases.
Document data extraction
This is Inputo’s core domain. Invoices, payslips, tax forms, and HR packets contain dozens of entities per page. NER (or LLM-based equivalents) locates names, amounts, and dates so they can populate Excel rows, CSV imports, or Word templates without retyping. When extraction must scale across hundreds of employees or suppliers, entity labelling is the first automated step — often after OCR on scans.
Customer support automation
Support tickets mention order numbers, product names, and account holders. NER routes tickets to the right queue, pre-fills CRM cases, and redacts personal data before logs are stored. Identifying ORG and PRODUCT entities helps link complaints to known incidents.
Medical record processing
Clinical notes reference patients, medications, dosages, and conditions. Specialised medical NER models label entities for coding (ICD-10), adverse-event monitoring, and de-identification. Regulatory constraints mean human review remains common, but NER reduces manual annotation time dramatically.
Legal document review
Contracts and court filings name parties, jurisdictions, effective dates, and monetary obligations. NER accelerates due diligence: reviewers filter clauses mentioning specific organisations or date ranges instead of reading every page linearly.
Financial data extraction
Earnings reports, credit memos, and KYC packets mix prose and tables. NER pulls MONEY, PERCENT, and ORG entities into risk models and compliance checks. Pairing NER with table detection handles line items that pure sentence models miss.
Search and recommendation
News aggregators, knowledge graphs, and enterprise search index entities so users can query “all articles mentioning Company X in France during 2025.” Recommendation engines use entity overlap to suggest related documents. Without NER, search degrades to brittle keyword matching.
How to Perform NER with Python
Developers usually start with NLTK for teaching examples or SpaCy for production prototypes. Both ship pre-trained English models; SpaCy additionally offers multilingual pipelines.
NLTK with ne_chunk
NLTK’s classical approach tokenises text, tags parts of speech, then applies a chunk grammar to group named entities. It is educational and lightweight but less accurate than modern neural models on messy business text.
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
from nltk import word_tokenize, pos_tag, ne_chunk
sentence = "Apple is looking at buying U.K. startup for $1 billion."
tokens = word_tokenize(sentence)
tagged = pos_tag(tokens)
tree = ne_chunk(tagged)
for chunk in tree:
if hasattr(chunk, 'label'):
entity = " ".join(c[0] for c in chunk)
print(f"{entity} → {chunk.label()}")
# Example output:
# Apple → ORGANIZATION
# U.K. → GPE
# $ 1 billion → MONEY
NLTK returns a nested tree structure; you walk it to collect entity strings and labels. For batch processing PDFs, you would extract plain text first (via pdfplumber or OCR), then run each paragraph through this pipeline.
SpaCy with doc.ents
SpaCy is the more common choice for applied work. Load a model once, process raw strings, and iterate doc.ents for start/end offsets and labels — useful for highlighting entities in UI or building training data.
import spacy
nlp = spacy.load("en_core_web_sm")
text = "Apple is looking at buying U.K. startup for $1 billion."
doc = nlp(text)
for ent in doc.ents:
print(f"{ent.text} → {ent.label_} ({spacy.explain(ent.label_)})")
# Apple → ORG (Companies, agencies, institutions, etc.)
# U.K. → GPE (Countries, cities, states)
# $1 billion → MONEY (Monetary values, including unit)
SpaCy also exposes ent.start_char and ent.end_char for span highlighting. For other languages, swap in es_core_news_sm, de_core_news_sm, and similar models — critical for European HR documents that mix languages on one page.
Neither snippet understands that “$1 billion” is an acquisition price tied to Apple’s bid. They label spans in isolation — which leads naturally to the next topic.
NER vs AI-Powered Document Extraction
Traditional NER answers: “What labelled spans appear in this sentence?” It does not reliably answer: “What is the employee’s gross salary?” or “Which line item belongs to which VAT rate?” Context, layout, and cross-field relationships sit outside classic NER’s design.
Consider a payslip fragment after OCR:
Employee: John Smith · Gross pay: €50,000 · Employer: Acme Iberia S.L.
NER returns three entities: PERSON (John Smith), MONEY (€50,000), ORG (Acme Iberia S.L.). Useful — but a payroll import needs structured fields: employee_name, gross_salary, employer_name. Two different MONEY amounts on a dense payslip (gross, net, deductions) confuse span-only models unless rules encode relative position.
AI-powered extraction (large language models such as Claude, used in Inputo) reads the full document — including tables, headers, and repeated sections — and outputs schema-aligned JSON. The model infers that €50,000 beside “Gross pay” is salary, not tax withheld, because it understands labels and layout, not just entity types.
Another gap: coreference and roles. NER tags “John Smith” and “the employee” separately; AI can link them. NER tags all dates equally; AI can distinguish hire date, pay period, and payment date from surrounding wording.
That does not make NER obsolete. NER remains fast, cheap, and interpretable for high-volume filtering. Modern platforms use a tiered strategy: OCR for scans, NER or regex for obvious identifiers, and LLM extraction where semantic understanding and field mapping matter. For a deeper walkthrough of PDF pipelines, see our guide on how to extract data from PDF files using AI.
How Inputo Uses AI for Intelligent Data Extraction
Inputo is built for European payroll, HR, and finance teams drowning in PDFs. The pipeline combines OCR, entity-aware AI, and export formats your software already accepts.
OCR for scanned documents
Many payslips and government forms arrive as image-only PDFs. Inputo runs multi-language OCR (Spanish, English, French, German, Italian, Portuguese, Dutch) before any entity work begins. Without a clean text layer, even the best NER model hallucinates labels on garbled tokens.
Claude AI for context-aware extraction
After text is available, Claude analyses document structure: tables, headings, repeated employee blocks, footnotes. It extracts entities and assigns them to the correct semantic fields — DNI/NIE to national ID, NSS to social security, gross and net amounts to separate columns — rather than returning a flat list of MONEY spans.
Field mapping and validation
Business rules validate formats (Spanish DNI checksum, date normalisation to ISO). Country-specific payroll exports map fields to A3Nom, TeamSystem, PHC GO, Moneysoft, and Silae layouts so gestorías import without manual column remapping.
Structured exports
Results download as Excel, CSV, or filled Word templates. The same engine powers the public PDF to Excel converter for ad-hoc table extraction. Files are processed and deleted — not stored for training.
Upload a payslip or invoice and see AI extraction go beyond traditional NER — mapped fields, not just labelled spans.
Try Inputo free →Frequently asked questions
What is named entity recognition?
Named Entity Recognition (NER) is an NLP technique that scans text, detects meaningful spans (names, places, amounts, dates), and assigns each span a category. It is a core step in turning unstructured documents into structured data for search, analytics, and automated imports.
What types of entities can NER detect?
Standard models cover persons, organisations, locations, dates, times, money, percentages, products, and events. Custom training adds domain labels — invoice IDs, contract numbers, medical codes — when off-the-shelf tagsets are not enough for your industry.
What is the difference between NER and NLP?
NLP is the entire discipline of processing human language with computers. Necessary tasks include translation, summarisation, sentiment analysis, parsing, and NER. Think of NER as one specialised tool in the broader NLP toolbox.
Is NER accurate?
On clean newswire English, neural NER often achieves high 90s F1 scores on benchmarks. Real documents introduce OCR noise, unusual fonts, tables, and domain jargon — accuracy falls unless you fine-tune, add rules, or combine NER with human review. For payroll and compliance, always validate extracted fields before import.
How is NER used in document processing?
Document pipelines use NER to locate candidate fields, redact sensitive spans, or populate database columns. In production, NER often pairs with layout analysis, OCR, and LLM-based field mapping — especially when documents mix prose and tables across multiple pages.
Can NER handle multiple languages?
Yes, with language-specific or multilingual models. European HR workflows frequently need Spanish, English, French, German, Italian, Portuguese, and Dutch on the same platform. Inputo’s OCR covers those seven languages; AI extraction then maps entities to the correct payroll schema regardless of where they appear on the page.
Conclusion
Named Entity Recognition is the quiet workhorse behind modern information extraction. Born at MUC-6, refined through IOB tagging and neural models, and available in every major NLP library, NER turns raw text into labelled spans you can act on. For simple sentences — “Apple … U.K. … $1 billion” — it works beautifully.
Business documents demand more: relationships between entities, table structure, multilingual scans, and export-ready field names. That is where AI-powered platforms extend NER’s foundation. Whether you experiment with NLTK and SpaCy in Python or upload your next payslip batch to Inputo, you are building on the same idea — find what matters in text, automatically.
Ready to move from entity labels to structured payroll and invoice data? Launch Inputo, upload a document, and download mapped Excel or CSV in minutes.
NER finds the names and amounts — Inputo maps them to the fields your software expects. Try the full app today.
Launch Inputo app →