How to turn a scanned document into a searchable PDF — OCR guide.
A scanned book, contract, invoice or ID — all of it can be turned into a searchable PDF via OCR. Guide for working with Serbian and English text.
If someone sent you a scanned PDF, you probably tried to select text and — nothing. Selection captures a rectangle instead of a word. The reason: the PDF contains an image of the page, not real text. The fix: OCR.
What OCR does
OCR (Optical Character Recognition) is an algorithm that "reads" images and recognises letters, words, sentences. The output is usually a searchable PDF with an invisible text layer placed over the image — the text is there, but does not change the look of the original.
Practical consequences: you can select text, search the PDF (Ctrl+F), copy it into Word, use it with a screen reader.
What our OCR service does under the hood
We use Tesseract 4 (the engine Google Books also uses), with Serbian and English language models. Additional techniques we apply:
- Auto-rotation (deskew) — straightens slightly tilted scans.
- Noise cleaning — removes black dots, smudges and artefacts.
- Skip-text mode — pages that already have text are not processed twice.
- UTF-8 encoding — Cyrillic and Latin work the same.
What OCR does not do perfectly
- Handwriting is not well supported. Tesseract is trained on print.
- Very poor scans (blurry, faded, marked) produce many recognition errors.
- Text inside diagrams or images may be skipped.
- Tables are often recognised as text (without table structure).
Practical tips for better OCR
- If you scan with a phone — use the "Office Lens" (Microsoft) or "Adobe Scan" app — they auto-correct perspective and contrast.
- Scan at 300 DPI if you can. Lower DPI = less detail for OCR.
- Pre-process the image: boost contrast, remove colours if the original has yellow stains.
- Mixed languages (Serbian + English in the same document) is fine — our service uses both dictionaries simultaneously.
What to do next with an OCR'd PDF
The most common next actions:
- Convert to Word for text editing (our PDF to Word auto-uses OCR for scans, but you get double the quality if you OCR first then convert).
- Extract clean text as .txt for quotes, analysis, copy-paste.
- Compress — an OCR'd PDF is usually 30-50% larger than the original (because of the extra text).