Skip to main content

Content ingestion and chunking for Pega GenAI Knowledge Buddy

Data ingestion is the process of importing and processing assorted data files from multiple sources into a storage or computing system, such as a data warehouse or database, where users or systems can access and analyze them. If you use Pega GenAI Knowledge Buddy™ with Pega Knowledge™, you can also include PDFs and other text-formatted attachments for content ingestion.

Data chunking, also known as content chunking, is the process of breaking down large pieces of data into smaller, more manageable segments called chunks. The core principle involves separating processed data into logical units while maintaining the semantic meaning and relevance of the content. When you have a large document with a complex structure that is difficult to process, the chunking process breaks it into segments, maintains semantic meaning, and optimizes it for processing by AI systems. This structured approach enables AI models to retrieve and process information more effectively, leading to better results in applications such as natural language processing, machine learning models, and information retrieval systems.

The following figure illustrates how a large document flows through the chunking process to create manageable segments for AI applications:

the content chunking process

How Pega GenAI™ Knowledge Buddy chunks content

Pega Knowledge Buddy uses an automated content chunking process that begins when text-based content is ingested. The system automatically applies chunking to all ingested content, which also helps maintain robust content security. Each text chunk is secured by one or more roles, and the system checks the roles assigned to a user's profile before providing access to specific chunks. This ensures that only authorized users can access secured content, making Knowledge Buddy particularly valuable for organizations handling sensitive or confidential information.

Knowledge Buddy offers three chunking methods to accommodate different content types and your organizational needs: Size, Title and None.

Chunking methods

The following table describes the chunking methods that Knowledge Buddy supports as well as their default parameters:

Chunking Method

Size

Title

None

Description

This is the default approach that defines chunks by character count. By default, each chunk consists of 1000 characters with a 200-character overlap between consecutive chunks. This overlap is crucial because it maintains context and continuity, ensuring that important information isn't lost at chunk boundaries.

This combines the SIZE method with an additional feature--it prefixes each chunk with the document's title. This approach enables Knowledge Buddy to easily associate text chunks with their source documents, providing better context when generating responses to user questions.

This gives you complete control over the chunking process. When set to NONE, Knowledge Buddy doesn't perform automatic chunking, allowing you to either send entire documents or manually divide content into meaningful segments based on your specific requirements.

Default parameters

Chunk size: 1000

Chunk overlap: 200

Chunk size: 1000 + title

Chunk overlap: 200

Chunk size: NA

Chunk overlap: NA

Note: You set the chunking method and parameters when you create a data collection. However, you can override the chunking method and parameters at the data source level, or at the content level.

The following diagram shows how Pega GenAI Knowledge Buddy uses content chunks:

content chunking in pega genAI knowledge buddy

Content ingestion using a KM article and REST API

Steps to ingest a KM article into Knowledge Buddy:

When you receive an article, the Buddy Ingestion API (REST API) ingests new content or updates existing content. Then, you break down the article into smaller chunks. After that, we generate embeddings for each chunk using the Pega GenAI™ gateway and store them in the database.

For example, you receive an article, "How to add, change, or update your address." You use the Buddy Ingestion API to ingest the article into Knowledge Buddy. Next, you break down the article into smaller chunks. Then, you generate embeddings for each of these chunks and store this information in your database to make it accessible for users who search for information on customer service in e-commerce.

Note: While ingested content might contain videos and images, Knowledge Buddy cannot answer questions about images or videos. Additionally, Knowledge Buddy cannot provide image or video responses to questions or generate images or videos as output for a question.

Sample JSON structures for ingesting content with REST API

The following codes are examples of JSON structures that you can use for content ingestion with the REST API:

Example 1:

api example 1

Example 2:

api example 2

The JSON structures include the following properties:

objectID: The most important aspect is the objectID, which is a single value that must be pushed into the database and serves as the key value for the content. The objectID is crucial whether it is a webpage, URL, document type, or reference such as KC-0039.

dataSource: The dataSource refers to the name of the data source defined in Knowledge Buddy. If you attempt to push an objectID without access to the corresponding data source, then the objectID will be rejected.

title: The title is also important and should accurately reflect the ingested content.

chunkingMethod: The chunkingMethod refers to the default chunking size and overlap.

roles: roles are mandatory to secure content, and every piece of content pushed into the system must have an associated role.

text: text refers to the array of text values that the REST API pushes to Knowledge Buddy.

attributes: The attributes (Global attributes) are name-value pairs that you can add as needed; you can use multiple attributes. The REST API automatically applies these attributes to all text chunks.

In addition, you can add your own attributes at the content level for each piece of content you ingest. For example, you can push the category or URL of an article as the content-level attribute.

When you tag an article or put an article in a category, that category becomes the value for the "category" attribute, as shown in the following sample code:

"attributes": [
{
"name": "category",
"values": [
{
"value": "{{category}}"
}
]
}

]

When ingesting content from Pega Knowledge, these steps are automated for you. When pushing out attributes, you push the title, article ID, URL of the article, and so on. However, if you are ingesting your own content, you can select values that you want to push as part of your injection process.

Methods to have Knowledge Buddy ingest Pega Knowledge articles

You can use the following methods to have Knowledge Buddy ingest content from Pega Knowledge articles:

1. Publish content 

By default, when a knowledge article is published in the Pega Knowledge portal, Knowledge Buddy automatically ingests the article based on the content type selected for the knowledge article.

When you create a new content type in the Pega Knowledge portal, that content type creates a corresponding data source for Pega Knowledge Buddy with the name Knowledge_ {ContentType ID}. For example, if there is a content type with the name Smartphones, the data source corresponding to this is Knowledge_Smartphones.

2. Bulk publish content

On the Content landing page, you can filter content by category, and then select multiple articles. To publish all the selected articles, use the Actions list, select the Change status option, and then select Resolved-Published.

Bulk publish content

3. Sync all

For users who update to a new build or if they want Knowledge Buddy to ingest all published content at once, a Sync all action ingests all the published content and creates the data source.

On the Taxonomy landing, on the Article synchronization tab, click Sync all to synchronize all categories with articles.

Select the checkbox to re-index the article text in Knowledge Buddy. 

Sync all

4. Article attachments

You can create an Attachments article by selecting the Article type is an attachment article only checkbox in the Display settings section when you create a knowledge article. You can mark the checkbox on any content type or category.

Note: You can set the Article type is an attachment article only checkbox to be selected by default for particular content types, depending on your organizational needs. To do this, in the Pega Knowledge portal navigate to Configurations>Content types then edit the chosen content type.
the display settings section with the article type is an attachment article only checkbox

Unlike a regular knowledge article, an Attachments articles does not have an article body. Instead, you upload a single file which is then attached to the article. When you publish the article, the system automatically ingests the attached content to Knowledge Buddy depending on the content type you selected for the knowledge article.

When you create an Attachments article, use one of the following recommended file types:

  1. Microsoft Office WORD files
  2. PDF
  3. Text content, which can include HTML or Markdown format

Knowledge Buddy is currently unable to answer questions concerning images or video, and it cannot produce images or video as an output to a question.

an attachments article type created in the pega knowledge portal
Note: Every single piece of content that you push to the system must have a role associated with it.

You have reached the end of this video. What have you learned?

  • How to ingest content using a KM article and REST API.
  • How Knowledge Buddy can ingest Pega Knowledge articles.

This Topic is available in the following Modules:

If you are having problems with your training, please review the Pega Academy Support FAQs.

Did you find this content helpful?

Want to help us improve this content?

We'd prefer it if you saw us at our best.

Pega Academy has detected you are using a browser which may prevent you from experiencing the site as intended. To improve your experience, please update your browser.

Close Deprecation Notice