PDF Text Extraction using KNIME, Regex, and Python

August 17, 2022 - Online
PDF Text Extraction using KNIME, Regex, and Python

In this webinar, we will parse PDF documents using the no-code, free tool KNIME and integrate it with code-based tools - Regex and Python.

PDFs bring a number of unique challenges. For instance, how do we know if the PDF is text-based or image-based? If text-based, extracting the text can be done with 1 node and a few clicks in KNIME. But if the PDF is image-based we need to perform Optical Character Recognition (OCR) first to extract the text. But what if we have thousands of PDFs of mixed types? Similarly, tables found in PDFs are almost always tough to extract, so what techniques does KNIME offer in this case? And can KNIME handle non-English or non-ASCII languages? Come join us for this 1 hour presentation with Victor Palacios (KNIME Team Member) who will tackle each of these interesting problems.

In this webinar, we will:

  1. Learn different ways to read text- or image-based PDFs in KNIME.

  2. Examine the quality of our input PDFs to understand our output.

  3. Extract text from PDFs using KNIME, Regex, and Python integrations. 


How do I join the webinar?

You’ll receive a link with your registration confirmation. Make sure you have a stable internet connection!

Will I be able to ask questions?

Absolutely - fire away!

Where do I find the latest version of KNIME Analytics Platform?

Download the latest free, open source version of KNIME here: knime.com/download.

What other resources will help me to get started in KNIME?
You might also like Show all events

What are you looking for?