Mission Critical Software: Navigating the ‘Always On’ Challenge 00020 3005379394

Mission Critical Software: Navigating the ‘Always On’ Challenge

Many systems undergo routine maintenance, causing planned downtime usually outside regular working hours, often unnoticed by users. However critical systems can’t afford any downtime due to potential catastrophic consequences, including loss of life and significant financial damage. Examples are hospitals systems, airplane or railway navigation systems and systems managing the energy grid. In addition, some systems must adhere to strict compliance regulations.

Achieving uninterrupted availability for such systems requires specific measures. Implementing these measures is not as simple as pushing a button. Incorporating them into early designs and development is often more efficient and cost-effective than trying to apply them when something has already gone wrong.

Development measures establish a resilient system that seamlessly transitions between versions, reduces complexity, handles errors effectively, and ensures redundancy. Operational measures involve ongoing management, encompassing updates, service levels, security, resource monitoring, and risk diversification. This distinction underscores the comprehensive approach necessary for maintaining constant software system availability, with development building resilience and operations sustaining uninterrupted performance.

This blog post provides an overview of crucial considerations when developing and operating mission-critical software systems that must remain ‘always on’.

Development Considerations

Zero Downtime Deployments

Deployments should never cause downtime for an ‘always on’ system. How to achieve this goal varies depending on the software type. An essential requirement is ensuring backward compatibility, allowing the coexistence of old and new software versions. This enables a seamless or gradual transition from the old version to the new one. Achieving backward compatibility may involve introducing default values for new parameters and making new parameters optional. Such an approach allows for a new interface while still supporting the old one.

Simpler is more reliable

Reducing the number of components in a chain can significantly increase its overall availability. This principle is rooted in the concept of “simpler is more reliable.” With fewer components, there are fewer points of potential failure, which leads to a lower probability of system downtime. Additionally, streamlined chains are easier to monitor, maintain, and troubleshoot, allowing for quicker identification and resolution of issues. By minimizing complexity and dependencies, organizations can achieve higher levels of system availability and reliability, ultimately providing a more robust and resilient infrastructure for their operations.

Stateful vs. Stateless Systems

For stateful systems like databases, solutions such as edition-based redefinition (available in Oracle databases) allow running multiple versions of database code concurrently, enabling switching between them as needed. Alternatively, you can operate multiple databases alongside each other, one with the old code and another with the new code, connecting your application to the new database while maintaining access to the old one. In this case you might need to consider synchronization of the databases.

Stateless applications require different solutions. Transitioning from an old to a new version often relies on a component that abstracts application access. This component can for example be a load balancer or an explicit proxy service, which typically supports seamless migration to new versions. Additional tools like API Gateways, Service Buses, or Service Meshes can serve similar purposes. Some platforms even offer built-in functionalities, such as Azure Functions’ deployment slots in Microsoft Azure or rolling updates by container platforms.

Error Handling

An ‘always on’ application must be robust, ensuring continuous operation. It should be capable of handling issues like out-of-memory errors, memory leaks, or resource-intensive code that can lead to performance degradation. Implementing safeguards, such as limiting concurrent requests per instance or adopting resource-efficient request processing techniques (e.g., using non-blocking code, for example reactive frameworks in Java), can enhance application resilience. These measures also fortify the system against distributed denial-of-service (DDoS) attacks, which often exploit resource vulnerabilities.

Maintaining system availability necessitates having at least one active instance of each component, which can be achieved through redundancy and failover mechanisms, ensuring that if one component fails, another can seamlessly take over to sustain uninterrupted operation.

Alerting

In the event of an unexplained instance failure, it’s essential to detect and rectify the issue promptly to prevent recurrence. Implement robust alerting mechanisms that notify relevant personnel of any problems. Furthermore, preemptive action can prevent the escalation of problems caused by resources nearing their limits. In distributed applications, monitoring the communication between loosely coupled components or applications, such as through queues or topics (e.g., MQTT, JMS, Kafka), becomes crucial. Dead letter queues should also be monitored. If the volume of queued messages increases significantly in a short amount of time, it’s a sign that further investigation is necessary. Loose coupling simplifies the monitoring of communication between components and enhances the availability of individual components, as their uptime is not directly dependent on the availability of other components.

Parallel Communication

To reduce the risk of component or platform failure, implement multiple communication paths from point A to point B. This can involve deploying multiple instances of components on different platforms. Nevertheless, even within an active/passive failover setup, there could be a brief period of downtime while detecting the failure and executing the failover. Cloud providers typically specify recovery time objectives (RTO) and recovery point objectives (RPO) that provide insights into the duration of a disaster recovery event and potential data loss. An active/active solution, in which traffic is routed through diverse paths, can minimize downtime but may increase resource usage, including components, network, and storage. Data synchronization between physically separated instances may also be necessary, depending on your specific requirements and technology stack. In such a case you should also consider the required consistency across instances.

The Bus Factor

The bus factor refers to the minimum number of team members who, if they were suddenly unable to work on a project (e.g., getting hit by a bus), could jeopardize its progress or continuity. This emphasizing the need to avoid knowledge concentration in a single person’s head. Ensuring that project members are replaceable is essential for maintaining constant availability. Documentation plays a pivotal role in this context. While it doesn’t need to be exhaustive, it should provide sufficient information to enable developers to locate and fix bugs efficiently.

Performance and Load Testing

Building a robust platform requires a clear understanding of potential bottlenecks and weak points to address them proactively. Furthermore, if you anticipate a significant increase in load, you must determine whether your platform can handle it through capacity planning. Implementing performance tests can identify areas for improvement before high loads disrupt production. Performance optimization can involve standard component tuning based on suppliers’ recommendations or focusing on specific use cases to pinpoint delays and make targeted improvements. Monitoring individual component performance is crucial for identifying issues quickly, enabling informed decisions about necessary actions.

Operational Considerations

The Stack

Comprehensive consideration of the entire technology stack is vital for achieving an ‘always on’ solution. Hypervisor, operating system, container platform, and application server upgrades, as well as application updates, must all occur without causing downtime. Clustering solutions or features like live kernel upgrades (available in some operating systems) can facilitate seamless upgrades. In some cases, deploying a parallel stack alongside the existing one can enable transparent failover.

Local Hosting, IaaS, PaaS

Platform-as-a-Service (PaaS) solutions can simplify infrastructure management and leverage cloud providers’ uptime guarantees. Be mindful of the types of updates performed by cloud providers and design resilient architectures accordingly. For instance, Microsoft Azure scheduled and unscheduled maintenance of components are typically applied concurrently across all availability zones in a region for a particular component and thus can cause downtime for the component. One approach to address this challenge is to consider hosting the application in distinct paired regions.

Service Level Agreements (SLA)

Understanding the Service Level Agreements (SLAs) of components and platforms is crucial. Real-world availability measurements should align with these SLAs. Consider latency and throughput requirements in addition to uptime guarantees. It’s essential to assess the cumulative availability of chained components, as multiple components with individual high SLAs can result in lower overall availability due to chaining. Paralyzing execution can help boost availability, particularly in multi-component or multi-platform scenarios.

External Influences

Protection against external threats, especially Distributed Denial of Service (DDoS) attacks, is critical for resources exposed through public endpoints. Implement rate limiting, employ protection mechanisms close to the source, and set timeouts for unreliable external APIs. You can consider measures like IP whitelisting or geo-blocking or only initiate connections from your system and do not allow external access at all. When dealing with external APIs, consider both authentication and authorization methods, especially when they entail potentially expensive operations like Active Directory (AD) lookups. Implementing stateless communication methods like UDP may also enhance security, considering the need for proper authentication and authorization (e.g., DTLS (a protocol) or AEAD (a cryptographic construction)). The trade-off is that you lose a delivery guarantee.

Proactive Resource Monitoring and Management

Proactive resource monitoring is a vital practice that allows you to detect and resolve potential issues before they have a chance to impact system availability. This involves implementing features like autoscaling and auto recovery to effectively manage dynamic resources. Autoscaling can help in mitigating distributed denial-of-service (DDoS) attacks. Additionally, automatic instance restart and failover measures provide a crucial buffer, allowing you time to address faults and minimize downtime. Even in bare-metal operations, scaling with zero downtime should be a consideration, which includes evaluating hardware-related factors.

Effective Component and Flow Performance Monitoring

Monitoring the performance of individual components and the overall flow of your system is essential for ensuring reliability. Consider implementing heartbeats and performance metrics tracking as part of your monitoring strategy. Heartbeats can identify issues such as communication failures between components, offering insights even when individual components appear functional. Visualizing these heartbeats through dashboards simplifies issue detection. Furthermore, monitoring individual components helps answer supplier inquiries about component availability and uncovers bottlenecks that can affect the entire flow.

By integrating these monitoring mechanisms early in your project, you simplify their implementation and accumulate historical data for assessing the effectiveness of various measures. Implementing smart anomaly detection adds an extra layer, alerting you to deviations from normal behavior, which is valuable for addressing both gradual and abrupt changes in system conditions.

Risk Spreading

Diversifying risk can encompass the distribution of components across technologically diverse platforms. This strategy mitigates the impact of problems that may affect a single platform, such as global Azure-wide issues, while leaving other providers largely unaffected. Another method involves dividing customers or data flows among multiple platforms, a practice known as sharding. This approach guarantees uninterrupted service for a segment of users or data flows, even in the event of a platform or component failure. Additionally, geographical diversification, achieved by hosting applications in various regions or data centers, diminishes the likelihood of simultaneous outages caused by natural disasters or localized problems.

Supplier Country of Origin

Consider the trustworthiness of external suppliers and evaluate the geopolitical landscape, as alliances and policies can change over time. Dependence on foreign suppliers, especially from countries with shifting policies related to privacy and security, carries risks. Diversifying your technology stack and potentially hosting parts of your application on-premises can mitigate such dependencies. Assess the suitability of specific components, as certain solutions may be challenging to run locally or on alternative cloud platforms.

Disaster Recovery Testing

Regularly test your disaster recovery procedures, ensuring that backups are usable and contain all necessary data for restoration. Testing rollbacks of deployments is also crucial to facilitate swift recovery in case of issues. Establish clear responsibilities and escalation paths for incident response, as manual actions and human decision-making are integral to effective disaster recovery. Having thorough documentation and incident response plans in place before emergencies occur is essential.

Documentation

Documentation is invaluable for operational teams, especially when troubleshooting issues outside their core expertise. Maintain detailed documentation that includes initial troubleshooting steps, system access procedures, contact information, and escalation paths. This documentation ensures efficient incident resolution and minimizes downtime during critical situations.

Finally

In conclusion, achieving an ‘always on’ software system demands meticulous planning and proactive measures at every stage, from development to operations. Prioritize resilience, redundancy, and continuous monitoring to ensure your system remains highly available, even in the face of unforeseen challenges. By considering these key factors, you can build and maintain software systems that truly deliver on the promise of being ‘always on.’

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.