Network Redundancy & Resilience
True network resilience is not defined by the presence of spare parts, but by predictable, bounded system behaviour during and after a failure - ensuring operations continue within known, acceptable limits.
Designing Networks That Continue to Operate When Things Go Wrong
The Redundancy Fallacy: When Spare Components Don't Create Resilience
Many industrial networks possess redundancy in name only - their recovery mechanisms are slow, disruptive, or unpredictable, causing operational failure despite the presence of backup equipment.
The critical distinction lies between redundancy (having alternative components) and resilience (the system's ability to absorb failure and continue functioning). A network can be fully redundant yet completely non-resilient if the process of failing over introduces latency spikes, packet reordering, or control instability that the connected operational technology (OT) cannot tolerate.
In these scenarios, the redundancy mechanism itself becomes the source of the outage. The problem is not a lack of backup paths, but a lack of architectural discipline around failure behaviour. Resilience requires engineering the network's response to faults with the same rigor as its normal operation.
Why Deterministic Recovery Is Non-Negotiable
For control and safety systems, the time bound of recovery is often more critical than the fact of recovery. Resilience is measured in milliseconds of acceptable disruption, not minutes of eventual restoration.
Signalling sequences, closed-loop control, and protection schemes operate within strict timing windows. If network communication is interrupted beyond these bounds - even if connectivity is restored seconds later - the control system will likely enact a fail-safe response, such as tripping a motor or halting a production line. The network has technically recovered, but the operation has failed.
Therefore, resilient network design mandates bounded recovery times. This is achieved through protocols and topologies engineered for fast, predictable convergence, such as Parallel Redundancy Protocol (PRP), High-availability Seamless Redundancy (HSR), or deterministic Ethernet rings with sub-50ms switchover guarantees.
Designing with Failure as the Primary Input
Resilient architecture begins not with designing for success, but by analysing and designing for known failure modes. This inverts the traditional "design-then-harden" approach.
Effective failure analysis for industrial networks must consider:
Achieving End-to-End Resilience, Not Isolated Redundancy
Partial redundancy creates illusory safety. True resilience requires examining and securing every link in the operational chain, from power supply to the endpoint device.
A common failure pattern is the "weakest link" scenario: dual core switches with a single power supply; diverse fibre routes that converge into the same cabinet; or redundant network paths leading to a single-homed PLC. Resilience demands an end-to-end audit of the entire data path.
| Network Element | Common Partial Redundancy Pitfall | Resilient Design Principle |
|---|---|---|
| Physical Layer | Dual fibres in the same conduit or cable sheath. | Diverse physical routing (different conduits, opposite sides of corridor). |
| Power | Redundant switches fed from the same UPS or electrical panel. | Diverse power sources (different circuits, separate UPS/generator). |
| Active Devices | Stacked or clustered switches sharing a control plane that can fail. | Fully independent devices with deterministic failover protocols (PRP, DLR). |
| End Devices | Critical PLC or RTU with only one network interface. | Dual-homed devices or use of redundant protocol end-stations. |
The Critical Role of Diversity in Reducing Correlated Risk
Duplication is not diversity. Two identical components sharing the same risk profile will likely fail together. Diversity deliberately introduces differences to break failure correlation.
Strategic diversity does not always mean doubling costs. It means making deliberate choices to separate risk. For example:
- Route Diversity: Running primary and secondary fibres on opposite sides of a rail line or using separate poles.
- Technology Diversity: Using fibre as the primary backbone and licensed microwave as the secondary, avoiding a common cut risk.
- Vendor/Model Diversity: Using different switch models for primary and backup to avoid a common firmware or hardware bug affecting both.
- Operational Diversity: Scheduling maintenance on redundant paths at different times.
Preserving Behavioural Consistency During Failover
Some redundancy mechanisms restore connectivity but alter fundamental network characteristics - such as latency, jitter, or path symmetry - in ways that destabilise applications.
In industrial networks, consistency is safety. A control system tuned for a 2ms round-trip latency may malfunction if a failover introduces a 20ms latency, even though the link is technically "up." Resilient design selects and configures redundancy mechanisms to preserve these critical behavioural parameters.
This often means avoiding protocols that rely on complex reconvergence (like traditional spanning-tree) in favour of those that maintain active-active parallel paths (like PRP) or that guarantee sub-millisecond switchover with no change in path characteristics.
Operational Simplicity and the Testability Imperative
Complexity is the enemy of resilience. Overly intricate redundancy designs are difficult to maintain, understand, and - most critically - test.
A resilient architecture must be inherently understandable and testable. Network operators should be able to predict the exact sequence of events during a failure. This requires clear documentation, logical topologies, and built-in mechanisms for safe, non-disruptive testing.
Regular, scheduled failover testing is not an optional best practice; it is a core operational requirement. It validates the design, builds team confidence, and uncovers hidden dependencies that have crept in over time. A network that cannot be safely tested cannot be considered resilient.
Lifecycle Management of Resilience
Resilience degrades silently over the network lifecycle through ad-hoc changes, expansions, and technology refreshes that were not evaluated for their impact on the overall fault model.
Maintaining resilience requires governance. Each proposed change must be assessed against the original failure assumptions. Does a new server connected to both network rings introduce a new bridging loop risk? Does a software upgrade alter the failover timing? A formal change control process that includes a resilience impact assessment is essential to prevent the gradual erosion of designed-in safety.
Resilience is the measurable ability of a network to maintain predictable, acceptable behaviour in the face of expected - and unexpected - faults.
Throughput Technologies advises on network redundancy and resilience as a systems engineering discipline. We focus on moving beyond component checklists to architect networks with deterministic recovery, deliberate diversity, and end-to-end fault tolerance that aligns with the real-world safety and timing requirements of industrial operations.
Talk with a Resilience & Availability Specialist to conduct a failure mode analysis of your current network and design an architecture that delivers genuine operational confidence.