Data Sets overview
Learn about the types of Data Set rules that you can use in Pega Platform™ to access data.
Data Sets overview
Data Sets help you to manage and organize your data efficiently. They bridge your application and external systems of record, such as file repositories and real-time data streams. For example, if you have a business process that requires customer data, you can use a Data Set to fetch this data from your database and make it available to your application. Data Sets help your application interact with data in a structured and efficient manner.
Data Sets can contain different types of data, such as customer attributes, product holding, or their activities on various channels. With Data Sets, you can store, enrich, and summarize data to derive valuable customer insights and use them as input for informed decision-making and AI modeling.
You can group your Data Sets under four main categories. Each category contains different types of Data Sets. The functionalities and configuration of these Data Sets depend on the type that you select.
Database Management (DBM) to store and manage data in an internal database table or a backing database, such as Cassandra.
File system to read and write data from and into files in various destinations.
General to create, aggregate, and summarize data.
Stream to process real-time data.
Database table Data Sets
By using database table Data Sets, you can query data stored in the relational database internal to Pega Platform or an external database, such as PostreSQL or Oracle. You can quickly access the data by using a particular key that you define in the Data Set. You can define multiple keys to query the source. Database table Data Sets support Browse, Browse by keys, Delete by keys, Save, and Truncate operations.
Decision Data Store Data Sets
Decision Data Store Data Sets stage the data for fast decision management. Horizontally scalable and supported by decision data nodes, Decision Data Store Data Sets make data available for real-time and batch processing. The data is always stored in a Cassandra-backed table. The data stored in a Decision Data Store can also contain nested structures.
HBase Data Sets
Apache HBase is an open-source, NoSQL, distributed big data store. HBase Data Sets read and save data from an external Apache HBase storage.
File Data Sets
File Data Sets are a tool for reading and writing data from and to files. To read and write to a destination in a repository, select the Files on repository option.
To upload or download a file, use the Embedded file option. This option is not recommended for larger files.
HDFS Data Sets
HDFS Data Sets read and save data from an external Apache Hadoop File System (HDFS). This type of Data Set supports partitioning to create distributed runs with Data Flows.
Monte Carlo Data Sets
Monte Carlo Data Sets are a tool for generating any number of random data records for various information types. Its main purpose is to generate data for testing purposes without real data. Monte Carlo Data Sets can use a batch type, where the Data Set creates a set number of records, or real-time, where the records are created constantly at a given rate.
Summary Data Sets
Summary Data Sets aggregate various types of data to limit and refine it for use in decision strategies, models, or Data Flows. Summary Data Sets source their data from stream Data Sets or Data Flows with a stream source and an abstract destination.
Kafka Data Sets
The Kafka Data Set is a high-throughput and low-latency platform for handling real-time data feeds. Kafka Data Sets are characterized by high performance and horizontal scalability regarding event and message queuing. You can partition them to enable load distribution across the Kafka cluster.
Kinesis Data Sets
Kinesis Data Sets connect to an instance of Amazon Kinesis Data Streams to access data. Kinesis Data Streams capture, process, and store high volume of data in real time. The type of data includes IT infrastructure log data, application logs, social media, market data feeds, or web clickstream data.
Stream Data Sets
Stream Data Sets process a continuous data stream of events (records). Use a Pega REST connector Rule to populate the stream data set with external data. The stream Data Set also exposes REST and WebSocket endpoint. Kafka and Kinesis are both distributed streaming platforms designed for real-time data processing, but they are developed by different companies and have some differences. Apache Kafka, an open-source project, is maintained by the Apache Software Foundation. It serves as a distributed event streaming platform and is known for its high throughput, fault tolerance, and scalability. Kafka allows for the storage and processing of large volumes of data in a fault-tolerant manner. On the other hand, Amazon Kinesis is a managed streaming service provided by AWS. Kinesis offers similar capabilities to Kafka but is a fully managed service that simplifies infrastructure management tasks. While Kafka provides more flexibility as it can be deployed on various infrastructures, Kinesis offers ease of use and integration with other AWS services, making it a preferred choice for users within the Amazon Web Services ecosystem. Ultimately, the choice between Kafka and Kinesis depends on specific project requirements and the preferred cloud environment.
Visual Business Director Data Sets - Internal
Visual Business Director (VBD) Data Sets store data that you can view in the Visual Business Director planner to assess the success of your business strategy.
This Topic is available in the following Modules:
Want to help us improve this content?