Ingesting customer data into Pega Customer Decision Hub

To make the best, most relevant next-best-action decisions, Pega Customer Decision Hub™ requires up-to-date data about the customers, their demographics, product holdings, activity data, and other information that third-party systems store.

Because this information updates frequently, ensure that the updates propagate to Pega Customer Decision Hub. Not all data updates occur with the same regularity. Based on the frequency of updates, customer data can fall into two categories: non-volatile or volatile.

Non-volatile data include data attributes, which update at regular intervals, such as on a daily, weekly, or monthly basis.

Volatile data updates typically occur in real-time, such as transactional updates and mobile and web click-throughs.

A key design principle to follow is that customer data required by Pega Customer Decision Hub should be available at rest. Avoid making real-time calls to access the data when making a next-best-action decision because the end-to-end time to make a decision depends on the performance of these real-time calls.

For processing non-volatile data, establish a file transfer mechanism using secure-file-transfer-protocol (SFTP) to support any required batch transfers. The system processes and ingests files from a secure repository through a guided and prescriptive method by using the Data Jobs that you define in Customer Decision Hub. For volatile data, it is recommended to establish services that process those updates as they occur, to ensure optimum results. Details and best practices for managing data can be found in the Pega Customer Decision Hub Implementation Guide.

After performing the data mapping workshop, and after the data is transformed into a structure that matches the customer data model as defined in the context dictionary, a key set of activities must occur to ingest data.

As a best practice, use the Data Jobs that you define in Customer Profile Designer so that the system ingests data into Pega Customer Decision Hub. There are three types of files that are involved when processing data; data files, manifest files, and token files.

The client's extraction-transformation-load (ETL) team generates the data files that contain the actual customer data. The files may contain new and updated customer data and data for purging or deleting existing records. A data file's column structure should match the data model in Customer Decision Hub to avoid complex mapping and data manipulation.

A best practice is to divide large files into multiple smaller files to allow parallel transfer and faster processing. Files can be in CSV or JSON format and can be compressed. File encryption is also supported.

The manifest files are XML files containing metadata about the transferred data files. The primary purpose of the manifest file is for record and file validations. If file and record validation is required then, there must be one manifest file for each ingestion or purge file to be transferred.

The processType attribute in the manifest XML file can take two values: DataIngest and DataDelete. Use the DataIngest process type to ingest the data into the target data source and use DataDelete to delete records from the target data source. The system uses other attributes such as totalRecordCount, size and customAttributes for additional validations. Data jobs that does not use a manifest file can only be used to ingest data.

The following table details the elements in a manifest file:

Name	Parent	Count	Description
manifest	-	1	Root element.
processType	manifest	1	If the current import adds or removes records. Possible values are DataIngest and DataDelete.
totalRecordCount	manifest	0..1	The number of records in all data files; used in post-validation if specified.
files	manifest	1	List of files.
file	files	1..n	Contains the metadata of one import file.
name	file	1	Path to the file relative to the repository root.
size	file	0..1	Size of the file in bytes – used in pre-validation if specified.
customAttributes	file	0..1	List of the custom attributes of a file.
customAttribute	file	1..n	Custom attribute of a file; used for custom pre-validation.
name	customAttribute	1	Name of the custom attribute.
value	customAttribute	1	Value of the custom attribute

The system uses token files to signal that all file transfer to the repository is complete and the data ingestion process can begin. As a best practice, keep all files for a data job in its dedicated folder in the file repository to ensure that the detection of a token file does not trigger another data job. The system only needs a token file if the data job requires file detection. Scheduled data jobs do not require a token file. The TOK files must be generated last after all file transfer is complete.

Once the ETL team prepares the files in an external environment, the files are uploaded to a repository that Customer Decision Hub can access. As a best practice, use a dedicated folder in the repository for every job. The folder contains all the necessary files, and the manifest dictates the relative path to the data files.

Important information about File Transfers

Once the files are in the repository, the system can initiate a data import job with one of the following methods:

File detection: This trigger initiates the run when the token files are uploaded to the file repository.
Scheduler: A scheduled time and frequency triggers the data import job.

Every time a data import job begins, the system creates a work object (PegaMKT-Work-JobImport). This work object is the run of the Data Job. The run has four main stages:

Data Validation: This stage is the initialization of the run, and you can extend the stage to complete additional custom validations. To perform custom validation, complete the two following actions:

Provide custom attributes for each file in the manifest file.
Define an activity that uses these attributes to verify whether the file content is correct (pyDataImportCustomPreValidation).

Data Staging: In this stage, the available records can go through an intermediate storage. By default, the staging data flow has an abstract destination that helps with post-validation. You can customize the corresponding data flow to have an intermediate data storage. This stage primarily helps with the following scenarios:

Performing post-validation of records to ensure that the specified records mentioned in the manifest file match the actual counts.
Record de-duplication.

Data Processing: The actual data ingestion to the customer data source occurs in this stage. You can customize the ingestion data flow if you have a different ingestion logic.

Data Post-Processing: This stage occurs after data ingestion and deletion to the destination data source. The import run performs the following activities:

Data archival: The system archives the source files in the ArchivedDataFiles folder that is relative to the root folder. Each run finds its place in the folder that corresponds to its completion date.
Data cleanup: Cleanup occurs in the current source folder once the files are archived successfully. You can manage the retention policy for the archived files in the general settings of the Data Import that are available in App Studio. The default values are seven days for the files and 30 days for the runs.

You manage the flow of data from one source to another source through a data set and data flows. Data sets access different types of stored data and insert or update the data into various destinations. Data flows are scalable and resilient data pipelines that ingest, process and move data from one or more sources to one or more destinations. In a data ingestion case, there are two types of data sets. A file data set accesses the data in the repository, and a database data set stores the data in corresponding tables in the database. Customer Profile Designer automatically configures all the data sets and data flows to process the data for you.

In summary, Customer Decision Hub uses data jobs to process data.

First, the data for the file repository is structured externally to match the data model in Customer Decision Hub and then compressed. For file and record level validations, a manifest file is necessary which holds additional information about the data file that it represents.

The ETL team loads the files from an external environment to a file repository to which Customer Decision Hub has access.

File detection or a given schedule can trigger the data jobs. The system requires a token file to initiate the data job when file detection trigger is selected. File listeners continuously access the file repository for the token file. When the file listeners detect the file, the ingestion process begins. The process begins when the defined time arrives for a scheduled data job.

Case types streamline and automate the flow of files from the repository into their destinations, as the process is visible and provides various error handling options. Customer Decision Hub includes a preconfigured case type for you.

After the process begins, the system creates a data job work object. During this process, the files are validated, staged, and moved to their final destination. Finally, the files are archived.

Get help

If you are having problems with your training, please review the Pega Academy Support FAQs.

Did you find this content helpful?

Yes

Want to help us improve this content?

Suggest an edit

Ingesting customer data into Pega Customer Decision Hub

We'd prefer it if you saw us at our best.