30 May 2026

How to turn a scanned document into a searchable PDF — OCR guide.

A scanned book, contract, invoice or ID — all of it can be turned into a searchable PDF via OCR. Guide for working with Serbian and English text.

If someone sent you a scanned PDF, you probably tried to select text and — nothing. Selection captures a rectangle instead of a word. The reason: the PDF contains an image of the page, not real text. The fix: OCR.

What OCR does

OCR (Optical Character Recognition) is an algorithm that "reads" images and recognises letters, words, sentences. The output is usually a searchable PDF with an invisible text layer placed over the image — the text is there, but does not change the look of the original.

Practical consequences: you can select text, search the PDF (Ctrl+F), copy it into Word, use it with a screen reader.

Run OCR→

What our OCR service does under the hood

We use Tesseract 4 (the engine Google Books also uses), with Serbian and English language models. Additional techniques we apply:

Auto-rotation (deskew) — straightens slightly tilted scans.
Noise cleaning — removes black dots, smudges and artefacts.
Skip-text mode — pages that already have text are not processed twice.
UTF-8 encoding — Cyrillic and Latin work the same.

What OCR does not do perfectly

Handwriting is not well supported. Tesseract is trained on print.
Very poor scans (blurry, faded, marked) produce many recognition errors.
Text inside diagrams or images may be skipped.
Tables are often recognised as text (without table structure).

Practical tips for better OCR

If you scan with a phone — use the "Office Lens" (Microsoft) or "Adobe Scan" app — they auto-correct perspective and contrast.
Scan at 300 DPI if you can. Lower DPI = less detail for OCR.
Pre-process the image: boost contrast, remove colours if the original has yellow stains.
Mixed languages (Serbian + English in the same document) is fine — our service uses both dictionaries simultaneously.

What to do next with an OCR'd PDF

The most common next actions:

Convert to Word for text editing (our PDF to Word auto-uses OCR for scans, but you get double the quality if you OCR first then convert).
Extract clean text as .txt for quotes, analysis, copy-paste.
Compress — an OCR'd PDF is usually 30-50% larger than the original (because of the extra text).

Extract text as .txt→

OCR is not magic — it gives you raw text, you decide what next. But once you have a searchable PDF, every other operation becomes much easier.

related tools

ocr pdf

pdf to word

pdf to text

What OCR does

Practical consequences: you can select text, search the PDF (Ctrl+F), copy it into Word, use it with a screen reader.

What our OCR service does under the hood

We use Tesseract 4 (the engine Google Books also uses), with Serbian and English language models. Additional techniques we apply:

Auto-rotation (deskew) — straightens slightly tilted scans.

Noise cleaning — removes black dots, smudges and artefacts.

Skip-text mode — pages that already have text are not processed twice.

UTF-8 encoding — Cyrillic and Latin work the same.

Practical tips for better OCR

If you scan with a phone — use the "Office Lens" (Microsoft) or "Adobe Scan" app — they auto-correct perspective and contrast.

Scan at 300 DPI if you can. Lower DPI = less detail for OCR.

Pre-process the image: boost contrast, remove colours if the original has yellow stains.

Mixed languages (Serbian + English in the same document) is fine — our service uses both dictionaries simultaneously.

What to do next with an OCR'd PDF

The most common next actions:

Convert to Word for text editing (our PDF to Word auto-uses OCR for scans, but you get double the quality if you OCR first then convert).

Extract clean text as .txt for quotes, analysis, copy-paste.

Compress — an OCR'd PDF is usually 30-50% larger than the original (because of the extra text).

OCR is not magic — it gives you raw text, you decide what next. But once you have a searchable PDF, every other operation becomes much easier.