Ingesting customer data into Pega Customer Decision Hub
To make the best, most relevant next-best-action decisions, Pega Customer Decision Hub™ requires up-to-date data about the customers, their demographics, product holdings, activity data, and other information that third-party systems store.
Because this information updates frequently, ensure that the updates propagate to the Customer Insights Cache. Not all data updates occur with the same regularity. Based on the frequency of updates, customer data can fall into two categories: non-volatile or volatile.
Non-volatile data is the type of data that changes at regular intervals, for example, on a daily, weekly, or monthly basis. With batch import data jobs, you can import non-volatile customer data by using a file repository and files as the source data.
Volatile data updates typically occur in real-time, in scenarios such as transactional updates and mobile and web click-throughs. Real-time data jobs use stream data sources, such as Kafka, Kinesis, or Pega Stream.
A key design principle to follow is that customer data required by Pega Customer Decision Hub should be available at rest. Avoid making real-time calls to access the data when making a next-best-action decision because the end-to-end time to make a decision depends on the performance of these real-time calls.
Processing non-volatile data
To process non-volatile data, establish a file transfer mechanism by using secure-file-transfer-protocol (SFTP) to support any required batch transfers. As a best practice, use the batch import data jobs that you define in Customer Profile Designer so that the system ingests data into the Customer Insights Cache.
There are three types of files that you use when processing data: data files, manifest files, and token files.
The extraction-transformation-load (ETL) team of an organization generates the data files that contain the actual customer data. The files might contain new and updated customer data and data for purging or deleting existing records. The column structure of a data file should match the data model in Customer Insights Cache to avoid complex mapping and data manipulation.
As a best practice, divide large files into multiple smaller files to allow parallel transfer and faster processing. Files can be in a CSV or JSON format, and you can compress these files. File encryption is also supported.
The manifest files are XML files that contain metadata about the transferred data files. The primary purpose of the manifest file is for record and file validations. If your system requires file and record validation, there must be one manifest file for each ingestion or purge file.
The processType attribute in the manifest XML file can take two values: DataIngest and DataDelete. Use the DataIngest process type to ingest the data into the target data source and use DataDelete to delete records from the target data source. The system uses other attributes such as totalRecordCount, size, and customAttributes for additional validations. For data jobs that do not use a manifest file, the system can use them only to ingest data.
The following table details the elements in a manifest file:
Name | Parent | Count | Description |
---|---|---|---|
manifest |
- |
1 |
Root element. |
processType |
manifest |
1 |
If the current import adds or removes records. Possible values are DataIngest and DataDelete. If the property does not exist in the manifest, DataIngest is the default value. |
totalRecordCount |
manifest |
0..1 |
The number of records in all data files; used in post-validation if specified. |
files |
manifest |
1 |
List of files. |
file |
files |
1..n |
Contains the metadata of one import file. |
name |
file |
1 |
Path to the file relative to the repository root. |
size |
file |
0..1 |
Size of the file in bytes – used in pre-validation if specified. |
customAttributes |
file |
0..1 |
List of the custom attributes of a file. |
customAttribute |
file |
1..n |
Custom attribute of a file; used for custom pre-validation. |
name |
customAttribute |
1 |
Name of the custom attribute. |
value |
customAttribute |
1 |
Value of the custom attribute |
The system uses token files to signal that all file transfer to the repository is complete and the data ingestion process can begin. As a best practice, keep all files for a batch import data job in its dedicated folder in the file repository to ensure that the detection of a token file does not trigger another data job. The system needs a token file only if the data job requires file detection. Scheduled data jobs do not require a token file. Generation of the TOK files must occur last after all file transfers are complete.
After the ETL team prepares the files in an external environment, they upload the files to a repository that Customer Decision Hub can access. As a best practice, use a dedicated folder in the repository for every data job. The folder contains all the necessary files, and the manifest dictates the relative path to the data files.
Once the files are in the repository, the system can initiate a batch import data job with one of the following methods:
- File detection: This trigger initiates the run when the token files upload to the file repository.
- Scheduler: A scheduled time and frequency triggers the data import job.
When an import job for the batch import data begins, the system creates a work object (PegaMKT-Work-JobImport). This work object is the run of the data job. The run has four main stages:
Data Validation: This stage is the initialization of the run, and you can extend the stage to complete additional custom validations. To perform custom validation, complete the two following actions:
- Provide custom attributes for each file in the manifest file.
- Define an activity that uses these attributes to verify whether the file content is correct (pyDataImportCustomPreValidation).
Data Staging: In this stage, the available records can go through an intermediate storage. By default, the staging data flow has an abstract destination that helps with post-validation. You can customize the corresponding data flow to have an intermediate data storage. This stage primarily helps with the following scenarios:
- Post-validation of records to ensure that the specified records mentioned in the manifest file match the actual counts.
- Record de-duplication.
Data Processing: The actual data ingestion to the customer data source occurs in this stage. You can customize the ingestion data flow if you have a different ingestion logic.
Data Post-Processing: This stage occurs after data ingestion and deletion to the destination data source. The import run performs the following activities:
- Data archival: The system archives the source files in the ArchivedDataFiles folder, which is relative to the root folder. Each run finds its place in the folder that corresponds to its completion date.
- Data cleanup: Cleanup occurs in the current source folder after the successful archiving of files. You can manage the retention policy for the archived files in the general settings of the Data Import that are available in App Studio. The default values are seven days for the files and 30 days for the runs.
You manage the flow of data from one source to another source through Data Sets and Data Flows. Data Sets access different types of stored data and insert or update the data into various destinations. Data Flows are scalable and resilient data pipelines that ingest, process, and move data from one or more sources to one or more destinations. In a data ingestion case, there are two types of Data Sets. A File Data Set accesses the data in the repository, and a Database Data Set stores the data in corresponding tables in the database.
Processing volatile data
For volatile data, it is recommended to establish services that process data updates in the corresponding systems and send them to Customer Decision Hub as they occur. Customer Decision Hub uses real-time import data jobs to process such volatile data and provide Next Best Actions on the most up-to-date snapshot of the customer data in its Customer Insights Cache.
When active, as new data becomes available in the source Data Set, real-time import data jobs process the data continuously and update the relevant profile data sources.
Real-time import data jobs also support data reconciliation; the system uses a date timestamp property to determine if the incoming data or the data in the destination profile data source is the latest, or requires an update. This feature is optional and usually works in tandem with batch import data jobs.
Real-time data import jobs require Stream Data Sets as their source. Pega supports connecting to various streaming data sources such as Kafka, Amazon Kinesis, and Pega Stream. You can create new Data Sets directly from Customer Profile Designer.
When new data becomes available in the Stream Data Set, the system processes the data through a Data Flow in real time. Customer Decision Hub provides extension points through a Sub Data Flow component, where you can perform data manipulation and specific condition-based filtering.
By default, the Data Flow updates the full record in its destination and hence the incoming data must contain all the fields in its destination. This process is known as a full record update. In some cases, only some fields of the target record might require an update. In this scenario, use the Sub Data Flow extension to create a new extension Data Flow that merges the incoming fields with the original record. The pyImportJobProcessingType property determines the processing type of the data job. If this value is not active, the system ingests the data. To delete, update the property with DataDelete.
The destination of a real-time data job must be a profile data source that is already on the allow list. Depending on the type of destination data set, the system allows only certain operations.
In summary, Customer Decision Hub uses data jobs to process volatile and non-volatile data.
To process non-volatile data, the data for the file repository is structured externally to match the data model in Customer Insights Cache and then compressed. For file and record level validations, a manifest file is necessary to hold additional information about the data file that it represents.
The ETL team loads the files from an external environment to a file repository to which Customer Decision Hub has access.
File detection or a given schedule can trigger the data jobs. When you select a file detection trigger, the system requires a token file to initiate the data job. File listeners continuously access the file repository for the token file. When the file listeners detect the file, the ingestion process begins. The process begins when the defined time arrives for a scheduled data job.
Case types streamline and automate the flow of files from the repository into their destinations, as the process is visible and provides various error-handling options. Customer Decision Hub includes a preconfigured case type for you.
After the process begins, the system creates a data job work object. During this process, the files are validated, staged, and moved to their final destination. Finally, the files are archived.
Customer Decision Hub receives and stores the data in Stream Data Sets to process volatile data. As new data becomes available in the Stream, the system processes the data, and insertion, update, or deletion of the target record occurs in real time. Data reconciliation is available to update the target record only when the source data is more recent.
This Topic is available in the following Modules:
If you are having problems with your training, please review the Pega Academy Support FAQs.
Want to help us improve this content?