Working concepts of the PDF Connector

The PDF Connector allows developers to search, extract, and fill in text and annotations. A developer can create an automation that audits received documents to find specific information quickly and display that information to an end-user for review or further decisioning. For example, an insurance company receives medical insurance claim documents from medical practitioners. An automation can quickly scan those documents, extract specific data values, and present the health summary of the patient to an agent, to determine how the claim should proceed.

For the developer, it is important to understand how the PDF Connector analyzes the document so that they know how to develop an automation to extract data or fill in form data correctly.

Using the PDF Connector, a developer ‘interrogates’ the document structure to locate the specific items needed for the automation. For example, the following image shows an invoice for Astend Technologies. This PDF document type is always created in the same structure, in the same way that an application always has the same fields, buttons, and tabs in the same place every time that it runs.

Pre-defined document elements

Predefined elements are objects that make it easier to work with PDF files in automation. The elements are based upon how a human visually perceives a file. The PDF Connector uses whitespace thresholds to define the elements. Whitespace thresholds vary by document type because different fonts, font sizes and table formats will call for different threshold configurations.

There are four pre-defined elements.

Pages
Lines
Segments
Words

Click the highlighted areas of the following image to learn about the pre-defined document elements:

Pages

Like any document, a PDF document consists of the largest pre-defined element, the page. Understanding how the PDF connector sees page numbering can help when using the page element in automations. The connector uses absolute numbering, which means that the first page of the document is considered to be page 1, with the remaining pages numbered sequentially.

For example, the United States passport application document contains 6 pages. The first four pages are instructions and information; the remaining two pages are the application form. To complete the form itself, the automation needs to reference page 5 of the document to locate any fields for input.

You can use the PDF Connector methods to return a Page object, scan the elements defined on that page, and perform operations on that page.

Lines

Lines divide up the page element. By analyzing the whitespace above and below the characters, the PDF Connector determines where the line is compared to other lines on the same page.

Use PDF Connector methods to return a Line object to find which page it is on, the line number, as well as the line text

Segments

Like lines, segments also divide up a page. Whereas a line reviews top and bottom whitespace, the PDF Connector uses the whitespace before and after characters to break a line up into segments. Segments do not span more than one line.

Use PDF Connector methods to return a segment object to retrieve the page number, the segment text, other nearby segments, and other objects.

You can also use segments to search for a word or character pattern. This lets you then parse out that segment or to collect data from an adjoining segment. For example, suppose you want to extract a phone number from the PDF file and the number is stored like this:

Phone: 123-456-7890

To the PDF Connector component, this is two segments:

Segment 1 Segment 2

Phone: 123-456-7890

You could not search for a specific phone number because it would differ in each file. Instead, you could use the FindRelativeSegment method to search for ‘Phone:’ (segment 1), and get the segment after it (segment 2), whose text would be the actual phone number.

In the center of the following image, slide the vertical line to compare the difference between how the PDF Connector sees lines and segments:

Other element types

There are two other element types that you can use in automations with the PDF Connector.

Phrases
Annotations

Phrases

Phrases differ from pre-defined lines and segment elements in that they typically are not located in the same place in each occurrence of a document. This means that phrase data only exists in memory, while the automation executes a method relevant to the phrase data. When a phrase is found as an object, it only returns the text of that object, so that it can find the data for annotation.

Annotations

You can read, add, and delete annotations in a file in Pega Robot Studio. Because annotations occupy certain positions on a page, you can access the annotation object through another found element (a line, segment, word, or phrase). Annotations can be saved within a document type, but may occur anywhere within a document and therefore cannot be configured as part of the document type.

Through PDF Connector methods, you can create annotations to include text as well as highlight the PDF text to reference the annotations. Other methods allow you to delete annotations as well as return annotation text or a list of annotations.

This Topic is available in the following Module:

Implement PDF files with robotic automations v1

Get help

If you are having problems with your training, please review the Pega Academy Support FAQs.

Did you find this content helpful?

Yes

Want to help us improve this content?

Suggest an edit