Skip to main content

Resilience

Resilience is how a system withstands and recovers from issues to its original state. A resilient system enhances the user experience by minimizing service disruptions. One way to enhance resilience is to reduce code exposure with appropriate access permissions, thus reducing vulnerability and security breaches.

Need for system resilience

For a Lead System Architect, resilience involves ensuring that the systems and applications that they design and oversee are robust, reliable, and capable of recovering quickly from failures. This responsibility includes implementing best practices for error handling, redundancy, failover mechanisms, and continuous monitoring to detect and address issues proactively. The goal is to maintain high availability and performance, even under adverse conditions, to meet Service-Level Agreements and ensure a seamless user experience.

Resilience helps you prepare for incidents in advance, which enables you to:

  • Increase the resiliency of your service.
  • Reduce the impact of an incident.
  • Shorten downtime.
  • Provide clear instructions on what to do in the event of an incident.
  • Repair some malfunctions automatically.

A production-ready system aims to provide users with a good experience. When building a system, consider parameters that determine the health of a system, failover mechanisms, ways to foresee issues, and diagnostic mechanisms. For a system to be production-ready, it must meet the following standards:

  • Stability: Works reliably and behaves as consumers expect.
  • Scalability: Meets increased demand while maintaining performance.
  • Performance: Processes tasks quickly and efficiently, delivering expected business value.
  • Resilience: Absorbs failures while continuing to serve traffic and meet service-level objectives (SLOs).
  • Observability: Allows comprehension, interrogation, and probing while in production.
  • Documentation: Provides proper guidance to help users understand the system and diagnose issues. 

Standards for system resilience

Use metrics, tracers, and logs effectively for root cause analysis, as these tools help understand what goes wrong. CPU use is also an important metric for an overall health check. Monitor the frequency and duration of Java garbage collections and inbound and outbound network traffic.

Testing in Kubernetes

Testing in the Kubernetes environment can trigger internal Kubernetes events such as pod scale-up and scale-down. Monitor these events as part of the test procedure. Regarding services, prepare for hardware and software failures because Kubernetes pods are impermanent. It is a best practice to have at least two pod replicas, which keeps uptime higher than dependent services.

Role of automation

The CI/CD automated build and deployment process checks for issues in the development cycle, which helps ensure that developers deploy only reliable and secure code. Developers should implement unit testing, integrated testing, and security tests to identify vulnerabilities. Every application developer improves resilience by enhancing secure coding standards, establishing best practices, and continuously refining according to current standards. Building a resilient system requires coordination between development, security, and testing teams. Automated testing, such as chaos engineering, helps create a robust system.

Microbenchmark testing

The process of identifying metrics for an application and using those metrics to evaluate and maintain its quality is known as benchmark testing. Run the units of performance-critical code several times in a specific environment with a standard processor and memory configuration for consistent results. Record the execution time of each run and consolidate the data to obtain a mean or average time, known as micro-benchmarks.

Check your knowledge with the following interaction:


This Topic is available in the following Module:

If you are having problems with your training, please review the Pega Academy Support FAQs.

Did you find this content helpful?

Want to help us improve this content?

We'd prefer it if you saw us at our best.

Pega Academy has detected you are using a browser which may prevent you from experiencing the site as intended. To improve your experience, please update your browser.

Close Deprecation Notice