The previous posts in this series sketched out how the route from 10 Gbps to 100 Gbps and beyond approaches the theoretical capacity limit of a DWDM channel. Any system operated at the edge of the envelope tends to fail spectacularly, and high channel capacity optics are no exception. Lower bit rate transceivers had a narrow range of degraded operation where bit error rate (BER) would increase as the received signal level approached the lower limit. As we push the channel capacity to the limit, operating margins are reduced, and the margin for error all but disappears.
From Information Theory 101, we know that increasing throughput by a factor of 10x from 10 Gbps to 100 Gbps would require a 10x improvement in OSNR, with all other things being equal. Transmitting 100 Gbps with more sophisticated PM-DPSK modulation, rather than simple OOK, provided a 4x reduction in symbol rate by coding two bits per symbol in both polarization modes. That left a 2.5x gap that needed to be filled for full backwards compatibility of 100 Gbps waves on existing systems designed for 10 Gbps per DWDM channel.
If this OSNR gap could not be filled, then deployment of 100 Gbps waves would require costly and disruptive re-engineering of installed networks, limiting its utility. Once again, technology originally developed and deployed for wireless communications provided a solution. The secret weapon used to close this gap was improved forward error correction (FEC). But FEC is like a double-edge sword that cuts both ways.
By adding redundant bits to the bit stream, FEC allows bit errors from forward transmission to be corrected at the receiver, without requesting retransmission on a backward channel. This is analogous to redundant RAID arrays in disk drive storage. By including an additional disk drive, and adding redundant data to each disk, a RAID disk array can tolerate complete failure of any one disk without data loss. Likewise, by breaking a bit stream into blocks and adding redundant bits, FEC can correct a limited number of random bit errors, recovering the corrupted receive data as originally transmitted, without loss.
But like everything else, FEC has limits. For a given amount of redundant bits added, a corresponding amount of bit errors can be corrected. Once the input bit error rate reaches a particular FEC algorithm’s limit, the error correction process breaks down, and bit errors appear in the output data. The FEC algorithm fails completely if the bit error rate increases further, and the output data becomes unusable. This catastrophic failure mode exacerbates the so-called “cliff effect” of rapid degradation in digital transmission on noisy links.
Without FEC, the bit error rate would increase more gradually as the OSNR decreased. With FEC, the BER remains near zero as the OSNR degrades, because the algorithm cleans up low-level bit errors. When the received BER stretches the ability of the FEC algorithm to compensate, smaller decreases in OSNR will produce bigger increases in output BER with FEC, than without. So, FEC delays the onset of degraded performance, but it can only do this by reducing the margin for error.
Getting throughput closer to the theoretical OSNR limit requires more efficient FEC algorithms. With these more efficient algorithms, bit errors are corrected to an even lower level of OSNR. FEC does not move the theoretical OSNR limit, however; it just allows error free operation closer to that edge. Once OSNR approaches the limit, the more efficient FEC algorithm still breaks down, but the slippery slope is even steeper.
The key take away here is that empirical “plug-and-pray” deployments of optical gear become even more untenable as data rates increase, leading to brick wall failure modes that provide little or no warning of impending failure. Many operators have foolishly relied on degradation of output BER to serve as a warning system. Increasing dependence on FEC to improve throughput makes this pure folly.
Without proper design up front, rigorous validation of the as-built system against the design parameters, and constant vigilance over the system lifetime, reliable operation will just be an elusive goal. The margin for degraded operation, where intervention can preempt catastrophic failure, becomes vanishingly small as the channel capacity is stretched. Poor practices that have worked in the past will no longer produce the desired results.
The rapid increase in BER near the OSNR limit with FEC does not matter in the case of a fiber cut, but this sudden failure mode is relatively rare. It is much more common to see a gradual degradation of the fiber link over a time span of several days or months. This can be caused by an accumulation of many small macro-bending losses over time, or a single mechanical instability that slowly gets worse (e.g. a loose connector, cracked fiber, or kinked cable). With proper network performance monitoring, the erosion in optical margin or quality factor (Q-factor) can be detected and addressed at the network operations level in the normal course of business.
Without pro-active maintenance, problems propagate up the layers in the network stack. Adverse influences accumulating in the network at layer-0 eventually produce bit errors at layer-1. In an IP network, this causes CRC errors at layer-2 that require packet retransmission under TCP at layer-4. This leads to sluggish application performance at layer-7, which generates angry phone calls at layer-8. At this point, the problem is no longer a purely technical issue, because too many people outside the networking organization are adversely affected.
With FEC, this cascading failure chain snaps more quickly. The next post in this series will address how to make FEC an asset, rather than a liability, and expand on improving network reliability as more complex transmission schemes are necessarily employed to increase fiber capacity.