Skip to main content

Data extraction using whitespace thresholds

Understanding how the PDFConnector evaluates the whitespace threshold of a document is crucial when you use PDF documents in an automation. Whitespace thresholds are the space between two similar elements. The PdfConnector measures that space in pointsThe following image illustrates the size of a point compared to an inch, though it is not to scale.

inch to point illustration

The PdfConnector identifies sets characters and groups in one of the following ways:

  • Lines
  • Segments
  • Words

Lines are sets of characters that have both a top and bottom location within their threshold. As the highest level element, you adjust the whitespace threshold for lines first.

When the document recognizes lines, it adjusts the whitespace for segments, which are groups of characters that do not span lines, such as a group of words that has a certain amount of whitespace among them. Segment whitespace thresholds are essential because, in columnar documents, you can use the whitespace threshold to extract a column.

Words are characters grouped by the space between each character. They are the most granular element. The dividing line between a segment or a word occurs when the space between two characters exceeds the threshold.

Lines contain segments, but the segments in the lines cannot cross line boundaries. Segments also include words, but the words in the segment cannot cross the segment boundaries. Unless the segment threshold is larger than the word threshold, each segment has only one word because the component evaluates segments first and then words in segments.

You use thresholds to extract data the way you need. If your business requires you to read a line that contains "500 7th AVENUE" as an address, you adjust the segment whitespace threshold to extract the first part as a segment, the green pipe character as the next segment,  "8th FLOOR" as the next, and so forth.

 

example pdf segments
Note: The PdfConnector does not require adjusting thresholds to read form fields. Any documents that contain text-based identifiers require adjustments because changes within whitespace thresholds can cause issues in your automation. For more information about identifiers, see PDF Usage and Configuration
 
 

Analyzing whitespace thresholds

The PdfConnector compares the amount of whitespace between two similar elements. It then compares the amount to the various whitespace thresholds set in the document to determine if the predefined elements comprise a single element or separate elements.

For example, when comparing lines, the component first compares the top positions of two text strings to determine if they are on the same line and then compares the bottom positions of the strings.

pdf lines

If the difference between the tops of the two strings and the bottoms of the two strings is equal to or less than the threshold setting, the system considers the text strings to be on the same line.

line threshold

The component compares the space between two characters by using vertical spacing for lines and then compares the space by using horizontal spacing for segments and words. For instance, when it analyzes an invoice for segments, notice that the threshold of 10 points consolidates the dollar signs and values in the table for Quantity and Unit Price to stretch across columns.  Reducing the segment threshold to 3 points lets the connector separate these elements freeing the developer from creating automations that adjust for that difference.

In the center of the following image, slide the vertical line to compare how the PdfConnector processes the two sets of elements. 

Drag the description into the correct whitespace threshold box, and then click Check to verify your selections.


If you are having problems with your training, please review the Pega Academy Support FAQs.

Did you find this content helpful?

Want to help us improve this content?

We'd prefer it if you saw us at our best.

Pega Academy has detected you are using a browser which may prevent you from experiencing the site as intended. To improve your experience, please update your browser.

Close Deprecation Notice