Architectural patterns for Data Sets
Lead System Architects typically start with Data Pages (for caching and reading) and Case data (for authoring and persistence). These are necessary but often insufficient when applications must deliver high-throughput, low-latency access to operational decisioning data, resilient ingestion from external systems, or durable, queryable staging that participates in analytical and real-time pipelines.
Data Sets provide a storage and connection abstraction that:
- Encapsulates where data resides, such as internal relational database tables, Decision Data Store (DDS, backed by Apache Cassandra), messaging topics (Kafka/HStream), cloud analytics platforms (BigQuery or Snowflake), and file systems (files or Hadoop Distributed File System (HDFS)), and how it is accessed.
- Aligns with Decision Management runtime needs, including keys for fast lookups, partitioning for scale, and consistent read/write semantics.
- Acts as first-class sources and destinations in Data Flows (stream and batch pipelines), enabling transformations, strategies, text analytics, and event processing without being tightly coupled to UI Rules or Case processing.
When your application needs to process or transform large volumes of data at scale and maintain fast operational reads for informed decisions (for example, offers, eligibility, risk, or Next Best Action), you require a layer designed for throughput, partitioning, and operational consistency. That layer is Data Sets (commonly DDS for decisioning workloads).
Common scenarios for Data Sets
The following architectural patterns illustrate how Data Sets enable scalable, low-latency data handling across diverse decisioning scenarios:
Operational decisioning caches for sub-millisecond lookups
In real-time decisioning environments, performance is critical. Applications often retrieve customer profiles, counters, or model inputs in milliseconds to meet Service-Level Agreements (SLAs). Relying only on relational tables or Data Pages can introduce latency and scalability constraints because they lack partition-aware distribution. Data Sets, particularly DDS Data Sets, address this by defining partition and sort keys that distribute data across decisioning nodes for horizontal scalability. Architects can choose between strong and eventual consistency to balance accuracy and performance, which supports predictable, low-latency access under heavy load.
Streaming ingestion and event pipelines (for example, clickstream or transactions)
Modern applications increasingly rely on real-time event processing, such as clickstream data, financial transactions, or IoT signals. These scenarios require continuous ingestion, enrichment, and routing of events without bottlenecks. Data Sets backed by Kafka integrate directly with real-time Data Flows for concurrent and resilient stream processing. This architecture supports high-volume pipelines while maintaining observability and fault tolerance.
Staging layers for batch transformations and data exchange
Enterprise systems often need to land large external datasets, such as files from HDFS or tables from Snowflake or BigQuery, before cleansing, joining, and publishing downstream. Data Sets formalize these storage formats and allow distributed batch processing through Data Flows. This approach provides repeatability, observability, and governance for extract, transform, and load (ETL) operations. Architects can use Data Flow run limits and priorities to protect resources and maintain system stability during heavy batch loads.
Feature stores for Adaptive or Predictive Models and monitoring marts
Adaptive and Predictive Models depend on historical data and feature-rich inputs. To monitor performance and support analytics, applications need to persist model snapshots, outcomes, and key performance indicators. Pega Platform™ includes built-in Data Sets, such as pyModelSnapshots, for this purpose. Architects can extend these structures to create custom monitoring marts to support robust governance and transparency for AI-driven decision-making. This feature is essential for organizations that prioritize explainability and compliance in Predictive Models.
Low-latency, high-volume counters, and aggregates
Certain business rules depend on real-time counters, for example, offers for each customer per day or failed login attempts for each device. Implementing these only with Case data or relational tables can introduce contention and latency. DDS Data Sets, with their partitioning and key-based design, keep counters close to decision nodes, ensuring fast updates and reads. Use Data Flows to continuously maintain these counters and support dynamic decision strategies without compromising performance.
Check your knowledge with the following interaction: