OCR, extract text from images & PDFs

Drop an image or PDF and get the recognized text. The OCR runs entirely in your browser (on-device, offline after the first load), nothing is uploaded.

How OCR · image/PDF to text works

OCR converts a scanned image or image-based PDF into text you can copy, search and edit, using tesseract.js running entirely inside your browser. You choose the document language from the selector, the relevant language model downloads to your browser once, and all subsequent recognition runs offline from that cached model. Your scanned files are never transmitted to any server during the conversion.

Recognition accuracy depends strongly on scan quality. Clean, high-contrast scans at 200 DPI or above, with minimal background noise and straight page alignment, produce the best results. Blurry, low-resolution or heavily compressed JPEGs, pages with columns or complex layouts, and handwritten text all reduce accuracy. The tool outputs a plain text block; for structured output such as preserved tables or multi-column layout, post-processing is needed. Running the PDF Deskew tool on crooked scans before OCR typically improves recognition rates.

Written by Bastien Sulyan

How to use OCR · image/PDF to text, step by step

Drop your scanned image (PNG, JPG, TIFF) or image-based PDF onto the upload area.
Select the primary language of the document from the language dropdown.
If this is your first time using that language, wait for the language model to download (this happens once).
Click extract text and wait for tesseract.js to process each page.
Copy the recognised text or download it as a plain text file.

Common use cases

A scanned receipt needs its line items extracted into a spreadsheet; run OCR to get the text, then paste into your accounting software.
An archive of scanned journal articles needs to be made text-searchable; convert each to text with OCR for indexing.
A photographed whiteboard from a meeting contains notes that need to be turned into an editable document.
A historical scanned document in German needs its text extracted for translation; select German as the language before running OCR.

Frequently asked questions

Why do I need to download a language model before OCR works?

tesseract.js uses trained neural network data files specific to each language. These files are several megabytes each and are downloaded once from this site (we host them ourselves, no third-party CDN) the first time you select that language. After the initial download the model is cached by your browser, and all further recognition for that language runs completely offline.

What factors most affect OCR accuracy?

Scan resolution (200 DPI minimum, 300 DPI recommended), image sharpness, contrast between text and background, and whether the page is straight all strongly affect accuracy. Heavily compressed JPEG scans, very small fonts, and pages with mixed orientations or complex column layouts are the most common sources of recognition errors.

Can OCR read handwritten text?

tesseract.js is trained primarily on printed text. Handwriting recognition accuracy is generally low and unreliable, especially for cursive script. For handwritten documents, dedicated handwriting recognition tools produce better results.

Are my scanned documents sent anywhere during text extraction?

No. After the language model is cached, every recognition task happens entirely in your browser using tesseract.js. Scanned documents may contain personal or confidential content; this local-only processing means that content never reaches a server.

Does the tool preserve the layout of the original scan?

The output is a plain text stream in reading order. Tables, columns, headers and other layout elements are not preserved as structure; the tool outputs the text content only. For layout-preserving output, a more advanced OCR pipeline with layout analysis is required.

Can I OCR a PDF that already contains selectable text?

The tool can process image-based PDFs where each page is a raster image with no embedded text. If your PDF already has a text layer (you can select and copy text in a viewer), running OCR is unnecessary; the existing text layer gives you the same information without the recognition step.

Can I run OCR on a photo taken with my phone?

Yes, and tesseract.js works on mobile browsers, so you can even open this page on the phone that took the picture. Photos taken at an angle or with uneven lighting recognise worse than a flatbed scan; straightening the shot and cropping out the background first helps.

Do I need to create an account or pay to use OCR?

No. There is no sign-up and no fee. The only download involved is the one-time language model tesseract.js needs, which is a one-off engine download, not a subscription or paywall.

Related tools

Keep everything local, explore complementary tools.

All PDF tools