Skip to main content

Working concepts of the PDF Connector

The PDF Connector allows developers to search, extract, and fill in text and annotations. A developer can create an automation that audits received documents to find specific information quickly and display that information to an end-user for review or further decisioning. For example, an insurance company receives medical insurance claim documents from medical practitioners. An automation can quickly scan those documents, extract specific data values, and present the health summary of the patient to an agent, to determine how the claim should proceed.

It is important to understand how the PDF Connector analyzes the document to develop an automation that extracts data and completes form fields.

Using the PDF Connector, you interrogate the document structure to locate the items needed for the automation. For example, the following image shows an invoice for Astend Technologies. This invoice document type has the same format structure, similar to an application interface having the same placement of fields, buttons, and tabs every time it runs. 

Sample PDF document

Pre-defined document elements

Pre-defined elements make it easier to work with PDF files in an automation. The elements are based on how a human visually perceives a file. The PDF Connector uses whitespace thresholds to define the elements. Whitespace thresholds vary by document type because different fonts, font sizes, and table formats call for different threshold configurations.

There are four pre-defined elements.

  • Pages
  • Lines
  • Segments
  • Words

Click the highlighted areas of the following image to learn about the pre-defined document elements:

Pages

Like any document, a PDF document consists of the largest pre-defined element, the page. Understanding how the PDF connector sees page numbering can help when using the page element in automations. The connector uses absolute numbering, which means that the first page of the document is considered to be page 1, with the remaining pages numbered sequentially.

For example, the United States passport application document contains 6 pages. The first four pages are instructions and information; the remaining two pages are the application form. To complete the form itself, the automation needs to reference page 5 of the document to locate any fields for input.

You can use the PDF Connector methods to return a Page object, scan the elements defined on that page, and perform operations on that page.

Lines

Lines divide up the page element. By analyzing the whitespace above and below the line, the PDF Connector determines where the line is compared to other lines on the same page.

Use PDF Connector methods to return a Line object to find which page it is on, the line number, and the line text.

Segments

Like lines, segments also divide up a page. Whereas a line reviews top and bottom whitespace, the PDF Connector uses the whitespace before and after characters to break a line up into segments. Segments do not span more than one line.

Use PDF Connector methods to return a segment object to retrieve the page number, the segment text, other nearby segments, and other objects.

You can also use segments to search for a word or character pattern. This lets you parse out that segment or collect data from an adjoining segment. For example, suppose you want to extract a phone number from the PDF file, and the number is stored like this:

Phone:          123-456-7890

In  the PDF Connector component, this is two segments:

Segment 1             Segment 2

Phone:          123-456-7890

You could not search for a specific phone number because it would differ in each file. Instead, you could use the FindRelativeSegment method to search for ‘Phone:’ (segment 1), and get the segment after it (segment 2), whose text would be the actual phone number.

Other element types

There are two other element types that you can use in automations with the PDF Connector.

  • Phrases
  • Annotations

Phrases

Phrases differ from pre-defined lines and segment elements in that they typically are not located in the same place in each document occurrence. This means that phrase data only exists in memory, while the automation executes a method relevant to the phrase data. When a phrase is found as an object, it only returns the text of that object, where it can use the data for annotation.

Annotations

You can read, add, and delete annotations in a file in Pega Robot Studio. Because annotations occupy certain positions on a page, you can access the annotation object through another found element (a line, segment, word, or phrase). Annotations can be saved within a document type, but they may occur anywhere within a document and therefore cannot be configured as part of the document type. 

Through PDF Connector methods, you can create annotations to include text and highlight the PDF text to reference the annotations. Other PDF Connector methods allow you to delete annotations and return annotation text or a list of annotations. 

Check your understanding with the following interaction:
 


This Topic is available in the following Module:

If you are having problems with your training, please review the Pega Academy Support FAQs.

Did you find this content helpful?

Want to help us improve this content?

We'd prefer it if you saw us at our best.

Pega Academy has detected you are using a browser which may prevent you from experiencing the site as intended. To improve your experience, please update your browser.

Close Deprecation Notice