Introducing: PDF to Text (2024)

PDFs hold tons of valuable information that we’d like to set free using the power of Alteryx! And they are so ubiquitous that they feel familiar and easy. But when the Alteryx Intelligence Suite team sat down to design our new PDF to Text tool, we realized there was a lot more to the Portable Document Format than meets the eye. That complexity shaped the choices we made as we designed the new tool. We hope pulling back the curtain on that process will be interesting and helpful as you start using the tool!

Source: GIPHY

Fundamentally, a PDF is a file created following the rules in the Portable Document Format. The PDF specification was first introduced by Adobe in 1993 and was released as an open standard managed by the International Organization for Standardization (ISO) in 2008. The current version of the ISO standard for PDFs is almost 1000 pages long, and between the original introduction and the current standard, there have been several intermediate specifications. These standards have, in turn, been implemented by many different PDF writing programs that made different choices in how to apply the specifications. The result of this evolution over time and the flexibility of the 1000-page standard:

Two identical-looking PDFs can have very different internal structures and content.

Source: GIPHY

If you’ve ever tried to open up a PDF with a text editor to look for the text and other elements that you see with a PDF viewer, you may have experienced something like this:

Source: GIPHY

That being said, any given PDF file may contain some of the following elements:

  • Bitmap graphics (photographs, scans, other images specified pixel-by-pixel)
  • Vector graphics (instructions for creating drawings using shapes and lines)
  • Text stored as content streams (instructions on where and how to draw text on the page)
  • Multimedia objects, links, and other embedded content
  • Fonts packaged with the file so they can travel with the document
  • Instructions for how and where to draw or embed each element on each page

Introducing: PDF to Text (1)

When it comes specifically to text, there is a spectrum of approaches to creating PDFs that made it more complicated for us to design a good PDF text extraction tool:

Common PDF Creation Techniques

Implications for Text Storage and Extraction

Taking a picture or scanning a document

Text is stored as bitmap graphics and requires Optical Character Recognition (OCR) to extract text

Using OCR to overlay transparent text on top of a scanned or photo-based document

Text appears twice in the document - once as bitmap graphics in the image, and again as an invisible text content overlay to support copy-pasting and searching

Optimizing PDF size by converting characters in a non-typical font into vector graphics (drawings of the letters) instead of embedding the whole font in the document

Text is stored as vector graphics and requires OCR to extract text

Combining pictures of text, drawings of text, and text content on a single page

Text is stored as bitmap graphics, vector graphics, and text content, so extracting all the words requires both reading the text content and applying OCR to the text stored as bitmap and vector graphics

Writing a digital “True PDF” document with all text stored as text content

Huzzah! Text content extraction will retrieve all the text in this document! (Unless there are words embedded in images like logos or diagrams or pictures.)

Source: GIPHY

In 2020, Alteryx Intelligence Suite was launched with tools designed to extract data from PDFs. In our original approach, we first convert all PDFs to images using Image Input. Then we apply OCR to the image of each page using Image to Text. This is great because it always works, regardless of variability in how the PDF was created!

Introducing: PDF to Text (2)

However, even an excellent OCR model applied to the most pristine images of text only has ~97% accuracy. Which is also great! But if a page of text has hundreds of characters, small inaccuracies may accumulate. (Also, the OCR models can be a bit slow.) Since at least some PDFs have text content that might be read directly (and quickly! with near 100% accuracy, in most cases!), we started to wonder if there might be a way to bring that text content into Alteryx.

Source: GIPHY

Enter: PDF to Text! Our initial goal with PDF to Text was just to extract the text content from PDF documents. Then we met the invoice below:

Introducing: PDF to Text (3)

This is a real invoice that Alteryx was sent by one of our vendors (although all the names and numbers have been anonymized for everyone’s privacy). For this page, text content alone will get us about half the text on this page, but the rest of the text is stored as graphic content. And depending on the use case, the text content might contain everything we need, or…. it might not.

Source: GIPHY

So we realized we needed to do a few things:

  • Give users the ability to combine text content with OCR results from the graphic content of each page. We called this “magic” internally during the development process, as it took some creative thinking to make the solution work. This is the Read Text and Image Content Text Extraction Option in PDF to Text. It gives the most complete and accurate result for text on the page but takes a bit longer (~1-2 seconds per page, depending on the document and your computer hardware).

Source: GIPHY

  • Give users the ability to Read Text Content Only for the times when all the content they care about is available as text content, and they don’t want to take the time to run OCR on each page. This can be much faster (~0.2 - 1 second per page, again depending on the document and your computer hardware)! But also… a little scary! Because it’s hard to tell what you might be missing in graphic text!

Source: GIPHY

  • Give users guard rails that will let them experiment with Read Text Content Only while assessing whether they might be losing critical content present as graphic text. Specifically:
    • Output Image of Page Graphicsresults in an image BLOB (binary large object) in the Image output column with the Output Option column value “pdf graphics”. This image can be rendered by connecting an Image tool with the Get Image from Binary Data in Field option and visually inspected with a Browse tool attached to the Image tool. It shows only what is “left behind” by the text content extraction.

Introducing: PDF to Text (4)

    • Risk Score for Text Encoded as Graphics goes one step further and applies OCR to only the graphic elements of each page. It counts the number of graphic text words and outputs that in the Graphic Text Word Count column. It also assigns a Graphic Text Risk level to each page based on that word count.
      • 9 or fewer graphic text words (such as might be found in a logo): “low” risk
      • 10-29 words: “medium” risk
      • 30 or more words: “high” risk

We developed those thresholds by looking at a representative set of documents, but you can calibrate your own risk levels using the raw word counts and images of page graphics for your documents and assign those risk levels using a Formula tool. You can also use the Risk level or the Graphic Text Word Count to filter your pages downstream into different processing workflows.

Combining the Read Text Content Only option with the Risk Score for Text Encoded as Graphics option is not significantly faster than the Read Text and Image Content option, as both are reading in text content and applying OCR to each page. This combination does, however, give users the opportunity to explore what risks they would be taking if they implemented Read Text Content Only without the risk score in exchange for the speed improvements that come with dispensing with the OCR.

Source: GIPHY

  • We also give users the ability to Preview what the Read Text Content Only vs. Read Text and Image Content options might extract. When a single file is selected with the “Browse” button in the PDF to Text configuration window, the Preview window below will show what content each text extraction option can access. For instance, in the example below we can see that for this file, most of the text would be extracted by Read Text Content Only (right), but text embedded in the images of the toolbars will be skipped (for better or for worse, depending on the way the data will be used downstream).

Introducing: PDF to Text (5)

  • A bonus of Read Text Content Only mode: more languages! The OCR used in Read Text and Image Content and Risk Score for Text Encoded as Graphics uses the languages specified in the Language selection to refine its results. However, the text content extraction is reading characters directly from the PDF, and as long as it can read those characters, it does not care what language they are from!

Source: GIPHY

Thanks for joining us on this journey through the inner space of PDFs and the resulting options we’ve provided in PDF to Text! We’re looking forward to seeing what you can do with the tool!

To find additional resources on the AIS tools, click here:

  1. Alteryx Intelligence Suite Learning Path
  2. Alteryx Intelligence Suite Tools Help Main Page
Introducing: PDF to Text (2024)

FAQs

How to convert PDF to readable text? ›

Make a PDF searchable with Adobe Acrobat.
  1. Open Adobe Acrobat on your computer.
  2. Click Open.
  3. Find and select the document you want to make searchable, then click Open.
  4. Head to Tools and select Recognize Text.
  5. Press PDF Output Style Searchable Image.
  6. Select OK.

Can I upload a PDF to ChatGPT? ›

Of course, you probably already know this. However, with its latest update, ChatGPT now allows users to upload documents, PDFs, and spreadsheets directly into the platform for analysis. This breakthrough feature promises to save businesses countless hours in data processing and content creation.

How do you turn a PDF into text to speech? ›

Use Adobe's free Acrobat Reader app to have the text in your PDF read aloud to you. Simply follow these steps to have Acrobat Reader read PDF aloud: Open Reader and navigate to the document page you want to have read aloud. From the top-left menu, click View, then Read Out Loud.

How do you extract answers from a PDF? ›

Method 1: Copy and Paste the Text

One of the most widely used options to extract text from PDF documents is to simply copy and paste the text. Many people prefer this method because copying and pasting text is a familiar process — something that you do nearly every day.

How do you make a PDF more readable? ›

The best and easiest way to sharpen a PDF image is to simply scan the original document again. Often, blurry pages result from scanning errors, such as a bump to the machine or a dirty scanning plate. No amount of image editing and noise reduction will ever make such an image resolve more clearly.

How do I convert PDF content to text? ›

PDF to Text – Convert PDF to Text Online for Free
  1. Drag your file into the PDF-to-Text converter.
  2. Select OCR if needed, or choose “Convert selectable text.”
  3. Wait while we convert your file in seconds.
  4. Download your file as a fully editable Word doc!
May 15, 2023

Can ChatGPT analyze files? ›

Alongside memory, it's good to remember that ChatGPT can also use existing file-upload capabilities to analyze text and images.

What is ChatGPT PDF? ›

ChatGPT is a Large Language Model (LLM) based on Transformer architecture that has the ability to generate human-like responses in a conversational context. It uses deep learning algorithms to generate natural language responses to input text.

How much does ChatGPT-4 cost? ›

Price: $20 per month. Availability: Web or mobile app. Features: Voice recognition; memory retention; multiple GPTs to choose from. Image generation: Yes.

How do I convert PDF text to words? ›

How to convert PDFs to Word
  1. Click the Select a file button above or drag and drop files into the PDF drop zone.
  2. Select the PDF you want to convert to the DOCX file format.
  3. Watch Acrobat automatically convert the file from PDF to an editable Word document.
  4. Download the converted Word document or sign in to share it.

What is the AI that can read PDF to you? ›

NaturalReader is a TTS solution that caters to foreign language learners, students, dyslexic learners, and working professionals. It has a solid selection of AI-generated content users can access to listen to PDFs, Microsoft Word files, EPUBs, web pages, HTML, RTFs, and more.

What is the free PDF reader with text to speech? ›

ReadAloud. ReadAloud is a free PDF audio reader available through browser extensions for Chrome, Edge, and Firefox. It supports 40+ languages and uses text to speech technology to convert webpage text to audio.

How do you write answers to a PDF? ›

Add new text to a PDF using a PC.
  1. Open your file in the Acrobat PDF Editor.
  2. Select Fill & Sign on the right side of the screen.
  3. Choose the Add Text tool, which looks like an upper-case “A” next to a lower-case “b.”
  4. Click anywhere in the PDF where you'd like to add text and start typing.

How do I get text from a PDF? ›

Open the PDF document using a PDF reader like Adobe Acrobat Reader. Select the text you want to extract by dragging your mouse cursor over the desired area. Right-click on the selected text and choose the "Copy" option. Open a text editor or word processing software (e.g., Microsoft Word, Google Docs).

How do I collect responses from a PDF form? ›

Collect PDF form data
  1. Open the First Email response in your inbox.
  2. Right Click on the Form attachment and choose Save As.
  3. Save the file in the same location as your distributed form but add a suffix indicating the name of the person from whom you have received the form. ...
  4. Close Outlook or whatever email software you use.

How do I open a PDF in readable format? ›

To view PDF files on your Windows PC, you first need to download a PDF reader. Luckily, many PDF readers, such as Adobe Acrobat, have free versions, so you don't have to pay to read PDFs on your computer. Download the reader from the internet and follow the prompts to install it on your computer.

How do I convert a PDF to text without losing formatting? ›

In the “Save As” dialog box, click on the “Settings” button to configure the conversion options. Step 4. In the Conversion Settings window, make sure to check “Retain Page Layout.” This ensures that the formatting of the PDF document will be preserved in the resulting Word document. Click “o*k” to apply the settings.

Top Articles
Latest Posts
Article information

Author: Arielle Torp

Last Updated:

Views: 6398

Rating: 4 / 5 (41 voted)

Reviews: 80% of readers found this page helpful

Author information

Name: Arielle Torp

Birthday: 1997-09-20

Address: 87313 Erdman Vista, North Dustinborough, WA 37563

Phone: +97216742823598

Job: Central Technology Officer

Hobby: Taekwondo, Macrame, Foreign language learning, Kite flying, Cooking, Skiing, Computer programming

Introduction: My name is Arielle Torp, I am a comfortable, kind, zealous, lovely, jolly, colorful, adventurous person who loves writing and wants to share my knowledge and understanding with you.