Introducing: PDF to Text (2024)

PDFs hold tons of valuable information that we’d like to set free using the power of Alteryx! And they are so ubiquitous that they feel familiar and easy. But when the Alteryx Intelligence Suite team sat down to design our new PDF to Text tool, we realized there was a lot more to the Portable Document Format than meets the eye. That complexity shaped the choices we made as we designed the new tool. We hope pulling back the curtain on that process will be interesting and helpful as you start using the tool!

Source: GIPHY

Fundamentally, a PDF is a file created following the rules in the Portable Document Format. The PDF specification was first introduced by Adobe in 1993 and was released as an open standard managed by the International Organization for Standardization (ISO) in 2008. The current version of the ISO standard for PDFs is almost 1000 pages long, and between the original introduction and the current standard, there have been several intermediate specifications. These standards have, in turn, been implemented by many different PDF writing programs that made different choices in how to apply the specifications. The result of this evolution over time and the flexibility of the 1000-page standard:

Two identical-looking PDFs can have very different internal structures and content.

Source: GIPHY

If you’ve ever tried to open up a PDF with a text editor to look for the text and other elements that you see with a PDF viewer, you may have experienced something like this:

Source: GIPHY

That being said, any given PDF file may contain some of the following elements:

Bitmap graphics (photographs, scans, other images specified pixel-by-pixel)
Vector graphics (instructions for creating drawings using shapes and lines)
Text stored as content streams (instructions on where and how to draw text on the page)
Multimedia objects, links, and other embedded content
Fonts packaged with the file so they can travel with the document
Instructions for how and where to draw or embed each element on each page

When it comes specifically to text, there is a spectrum of approaches to creating PDFs that made it more complicated for us to design a good PDF text extraction tool:

Common PDF Creation Techniques	Implications for Text Storage and Extraction
Taking a picture or scanning a document	Text is stored as bitmap graphics and requires Optical Character Recognition (OCR) to extract text
Using OCR to overlay transparent text on top of a scanned or photo-based document	Text appears twice in the document - once as bitmap graphics in the image, and again as an invisible text content overlay to support copy-pasting and searching
Optimizing PDF size by converting characters in a non-typical font into vector graphics (drawings of the letters) instead of embedding the whole font in the document	Text is stored as vector graphics and requires OCR to extract text
Combining pictures of text, drawings of text, and text content on a single page	Text is stored as bitmap graphics, vector graphics, and text content, so extracting all the words requires both reading the text content and applying OCR to the text stored as bitmap and vector graphics
Writing a digital “True PDF” document with all text stored as text content	Huzzah! Text content extraction will retrieve all the text in this document! (Unless there are words embedded in images like logos or diagrams or pictures.)

Source: GIPHY

In 2020, Alteryx Intelligence Suite was launched with tools designed to extract data from PDFs. In our original approach, we first convert all PDFs to images using Image Input. Then we apply OCR to the image of each page using Image to Text. This is great because it always works, regardless of variability in how the PDF was created!

However, even an excellent OCR model applied to the most pristine images of text only has ~97% accuracy. Which is also great! But if a page of text has hundreds of characters, small inaccuracies may accumulate. (Also, the OCR models can be a bit slow.) Since at least some PDFs have text content that might be read directly (and quickly! with near 100% accuracy, in most cases!), we started to wonder if there might be a way to bring that text content into Alteryx.

Source: GIPHY

Enter: PDF to Text! Our initial goal with PDF to Text was just to extract the text content from PDF documents. Then we met the invoice below:

This is a real invoice that Alteryx was sent by one of our vendors (although all the names and numbers have been anonymized for everyone’s privacy). For this page, text content alone will get us about half the text on this page, but the rest of the text is stored as graphic content. And depending on the use case, the text content might contain everything we need, or…. it might not.

Source: GIPHY

So we realized we needed to do a few things:

Give users the ability to combine text content with OCR results from the graphic content of each page. We called this “magic” internally during the development process, as it took some creative thinking to make the solution work. This is the Read Text and Image Content Text Extraction Option in PDF to Text. It gives the most complete and accurate result for text on the page but takes a bit longer (~1-2 seconds per page, depending on the document and your computer hardware).

Source: GIPHY

Give users the ability to Read Text Content Only for the times when all the content they care about is available as text content, and they don’t want to take the time to run OCR on each page. This can be much faster (~0.2 - 1 second per page, again depending on the document and your computer hardware)! But also… a little scary! Because it’s hard to tell what you might be missing in graphic text!

Source: GIPHY

Give users guard rails that will let them experiment with Read Text Content Only while assessing whether they might be losing critical content present as graphic text. Specifically:
- Output Image of Page Graphicsresults in an image BLOB (binary large object) in the Image output column with the Output Option column value “pdf graphics”. This image can be rendered by connecting an Image tool with the Get Image from Binary Data in Field option and visually inspected with a Browse tool attached to the Image tool. It shows only what is “left behind” by the text content extraction.

- Risk Score for Text Encoded as Graphics goes one step further and applies OCR to only the graphic elements of each page. It counts the number of graphic text words and outputs that in the Graphic Text Word Count column. It also assigns a Graphic Text Risk level to each page based on that word count.
  - 9 or fewer graphic text words (such as might be found in a logo): “low” risk
  - 10-29 words: “medium” risk
  - 30 or more words: “high” risk

We developed those thresholds by looking at a representative set of documents, but you can calibrate your own risk levels using the raw word counts and images of page graphics for your documents and assign those risk levels using a Formula tool. You can also use the Risk level or the Graphic Text Word Count to filter your pages downstream into different processing workflows.

Combining the Read Text Content Only option with the Risk Score for Text Encoded as Graphics option is not significantly faster than the Read Text and Image Content option, as both are reading in text content and applying OCR to each page. This combination does, however, give users the opportunity to explore what risks they would be taking if they implemented Read Text Content Only without the risk score in exchange for the speed improvements that come with dispensing with the OCR.

Source: GIPHY

We also give users the ability to Preview what the Read Text Content Only vs. Read Text and Image Content options might extract. When a single file is selected with the “Browse” button in the PDF to Text configuration window, the Preview window below will show what content each text extraction option can access. For instance, in the example below we can see that for this file, most of the text would be extracted by Read Text Content Only (right), but text embedded in the images of the toolbars will be skipped (for better or for worse, depending on the way the data will be used downstream).

A bonus of Read Text Content Only mode: more languages! The OCR used in Read Text and Image Content and Risk Score for Text Encoded as Graphics uses the languages specified in the Language selection to refine its results. However, the text content extraction is reading characters directly from the PDF, and as long as it can read those characters, it does not care what language they are from!

Source: GIPHY

Thanks for joining us on this journey through the inner space of PDFs and the resulting options we’ve provided in PDF to Text! We’re looking forward to seeing what you can do with the tool!

To find additional resources on the AIS tools, click here:

Alteryx Intelligence Suite Learning Path
Alteryx Intelligence Suite Tools Help Main Page

FAQs

How to convert PDF to readable text? ›

Make a PDF searchable with Adobe Acrobat.

Open Adobe Acrobat on your computer.
Click Open.
Find and select the document you want to make searchable, then click Open.
Head to Tools and select Recognize Text.
Press PDF Output Style Searchable Image.
Select OK.

Can I upload a PDF to ChatGPT? ›

Of course, you probably already know this. However, with its latest update, ChatGPT now allows users to upload documents, PDFs, and spreadsheets directly into the platform for analysis. This breakthrough feature promises to save businesses countless hours in data processing and content creation.

Read On ›

How do you turn a PDF into text to speech? ›

Use Adobe's free Acrobat Reader app to have the text in your PDF read aloud to you. Simply follow these steps to have Acrobat Reader read PDF aloud: Open Reader and navigate to the document page you want to have read aloud. From the top-left menu, click View, then Read Out Loud.

Discover More Details ›

How do you extract answers from a PDF? ›

Method 1: Copy and Paste the Text

One of the most widely used options to extract text from PDF documents is to simply copy and paste the text. Many people prefer this method because copying and pasting text is a familiar process — something that you do nearly every day.

Show Me More ›

How do you make a PDF more readable? ›

The best and easiest way to sharpen a PDF image is to simply scan the original document again. Often, blurry pages result from scanning errors, such as a bump to the machine or a dirty scanning plate. No amount of image editing and noise reduction will ever make such an image resolve more clearly.

Discover More ›

How do I convert PDF content to text? ›

PDF to Text – Convert PDF to Text Online for Free

Drag your file into the PDF-to-Text converter.
Select OCR if needed, or choose “Convert selectable text.”
Wait while we convert your file in seconds.
Download your file as a fully editable Word doc!

May 15, 2023

Get More Info Here ›

Can ChatGPT analyze files? ›

Alongside memory, it's good to remember that ChatGPT can also use existing file-upload capabilities to analyze text and images.

Find Out More ›

What is ChatGPT PDF? ›

ChatGPT is a Large Language Model (LLM) based on Transformer architecture that has the ability to generate human-like responses in a conversational context. It uses deep learning algorithms to generate natural language responses to input text.

Get More Info ›

How much does ChatGPT-4 cost? ›

Price: $20 per month. Availability: Web or mobile app. Features: Voice recognition; memory retention; multiple GPTs to choose from. Image generation: Yes.

Read The Full Story ›

How do I convert PDF text to words? ›

How to convert PDFs to Word

Click the Select a file button above or drag and drop files into the PDF drop zone.
Select the PDF you want to convert to the DOCX file format.
Watch Acrobat automatically convert the file from PDF to an editable Word document.
Download the converted Word document or sign in to share it.

Know More ›

What is the AI that can read PDF to you? ›

NaturalReader is a TTS solution that caters to foreign language learners, students, dyslexic learners, and working professionals. It has a solid selection of AI-generated content users can access to listen to PDFs, Microsoft Word files, EPUBs, web pages, HTML, RTFs, and more.

Discover More Details ›

What is the free PDF reader with text to speech? ›

ReadAloud. ReadAloud is a free PDF audio reader available through browser extensions for Chrome, Edge, and Firefox. It supports 40+ languages and uses text to speech technology to convert webpage text to audio.