Resilience
Resilience is the ability of a system to withstand disruptions and recover to a stable state without compromising functionality. A resilient system minimizes service downtime and maintains consistent performance to support uninterrupted user experiences. To enhance resilience, apply well-defined access controls to limit code exposure. This approach reduces vulnerabilities and helps to prevent security breaches.
Need for system resilience
As a Lead System Architect, you design and oversee systems and applications that are robust, reliable, and capable of recovering efficiently from failures. To support resilience, apply best practices for error handling, redundancy, failover mechanisms, and continuous monitoring to detect and resolve issues proactively. These practices help maintain high availability and performance under adverse conditions to meet Service-Level Agreements and support consistent user experiences.
Resilience helps you prepare for incidents in advance, which enables you to:
- Increase service reliability.
- Minimize the impact of incidents.
- Reduce downtime.
- Provide clear instructions for incident response.
- Automatically repair specific malfunctions.
A production-ready system delivers consistent performance and a reliable user experience. When designing a system, evaluate the following parameters:
- Stability: Works reliably and behaves as consumers expect.
- Scalability: Handles increased demand while maintaining performance.
- Performance: Processes tasks quickly and efficiently to deliver expected business value.
- Resilience: Absorbs failures while serving traffic and meeting service-level objectives.
- Observability: Supports monitoring, diagnostics, and issue detection in production.
- Documentation: Offers clear guidance to help users understand the system and resolve issues.
Pega-specific resilience features
Pega Platform™ and Pega Cloud® include built-in resilience features that support high availability, fault tolerance, and seamless recovery across enterprise environments. These features help maintain consistent performance and service continuity during system failures, maintenance activities, and unexpected outages.
Clustering and load balancing
- Enterprise-grade clustering in Pega Platform enables multiple nodes to share a database while operating independently to improve scalability and reliability.
- Uniform configuration across nodes is essential for seamless failover.
- Integration with enterprise load balancers ensures efficient traffic distribution and prevents overload.
- Redundancy is built into connectors and services to eliminate single points of failure.
Application server failover
- Automatic failover redirects users to healthy servers during outages.
- Quiesce functionality allows graceful server maintenance without data loss.
- Rolling restarts enable updates without service disruption.
- Crash recovery preserves user sessions, allowing seamless continuation post-failure.
Database connection resilience
- Dynamic connection pooling in Pega Platform manages high-frequency database interactions..
- Automated recovery handles outages up to 120 seconds without manual intervention.
- Failover strategies include clustered or replicated databases with standby switchover.
- Customizable Java Database Connectivity settings allow tuning for performance and stability.
- Best practices guide optimal pool configurations for various deployments.
Pega Cloud resilience
- Pega Cloud deployments use multi-availability zone architectures, automated scaling, and geographic redundancy to support resilience.
- Integration with cloud-native load balancers enhances health checks and traffic distribution.
Standards for system resilience
Effectively use metrics, tracers, and logs for root cause analysis, as these tools help understand what goes wrong. CPU use is also an important metric for an overall health check. Monitor the frequency and duration of Java garbage collections and inbound and outbound network traffic.
Testing in Kubernetes
Testing in the Kubernetes environment can trigger internal Kubernetes events such as pod scale-up and scale-down. Monitor these events as part of the test procedure. Regarding services, prepare for hardware and software failures because Kubernetes pods are impermanent. It is a best practice to have at least two pod replicas, which keeps uptime higher than dependent services.
Role of automation
The CI/CD automated build and deployment process checks for issues in the development cycle, which helps ensure that developers deploy only reliable and secure code. Developers should implement unit testing, integrated testing, and security tests to identify vulnerabilities. Every application developer improves resilience by enhancing secure coding standards, establishing best practices, and continuously refining according to current standards. Building a resilient system requires coordination between development, security, and testing teams. Automated testing, such as chaos engineering, helps create a robust system.
Microbenchmark testing
The process of identifying metrics for an application and using those metrics to evaluate and maintain its quality is known as benchmark testing. Run the units of performance-critical code several times in a specific environment with a standard processor and memory configuration for consistent results. Record the execution time of each run and consolidate the data to obtain a mean or average time, known as micro-benchmarks.
Check your knowledge with the following interaction: