Incident backbone

Résolu

Panne partielle

Signalé le il y a 2 moisA duré environ 7 heures

Concernés

Backbone

Mises à jour

Mettre à jour
vendredi 28/02/2025 à 16:03
Mettre à jour
vendredi 28/02/2025 à 16:03
Rapport d’incident réseau du 14/02/2025
Chronologie des événements
- 10:24 : Détection de multiples fluctuations entre plusieurs équipements réseau, vraisemblablement causées par une intervention en datacenter.
- 10:29 : Signalement d’anomalies sur l’infrastructure système consécutives aux fluctuations.
- 11:00 : De nombreuses alertes persistent dans l’outil de supervision. Certains équipements apparaissent injoignables, bien qu’aucune coupure physique ne soit constatée.
- 11:04 : Un OLT impacté est sélectionné pour analyse :
  Accessible depuis les bastions d’administration.
  Injoignable depuis la supervision.
  Un traceroute effectué depuis le serveur de supervision s’arrête à un équipement réseau.
- 11:10 : Mise hors service temporaire de la liaison physique directe entre deux équipements réseau identifiés.
  L’OLT redevient joignable depuis le serveur de supervision.
  Après réactivation de la liaison, l’OLT redevient injoignable, orientant ainsi le diagnostic vers une zone spécifique du réseau.
- 11:18 : Analyse des journaux d’un routeur, révélant une saturation des ressources FIB.
```
RP/0/RP0/CPU0:Feb 14 09:24:30.259 UTC: fib_mgr[311]: %ROUTING-FIB-4-RSRC_LOW : FIB running low on DATA_TYPE_TABLE_SET resource memory. FIB will now begin resource constrained forwarding. Only route deletes will be handled in this state, which may result in mismatch between RIB/FIB. Traffic loss on certain prefixes can be expected. FIB will automatically resume normal operation, once the resource utilization returns to normal level.
```
- 11:41 : Décision de redémarrer un équipement réseau afin de libérer des ressources.
- 11:44 : Redémarrage effectué avec succès.
  Le trafic retrouve un fonctionnement nominal.
  L’infrastructure système est stabilisée.
- 11:57 : Confirmation d’un retour à la normale sous surveillance renforcée.
- 12:30 : Signalement d’interruptions sur des circuits L2 dans une zone spécifique.
- 13:03 : Détection d’une incohérence RIB/FIB sur un équipement réseau impactant toujours certains services.
- 13:23 : Demande d’extraction de journaux par la direction technique avant toute nouvelle opération de redémarrage.
- 13:27 : Sollicitation d’une validation pour le redémarrage de l’équipement.
- 14:03 : Relance auprès du support et de l’exploitation.
- 14:54 : Validation du redémarrage par la direction technique.
- 14:56 : Application de la directive max-metric router-lsa dans la configuration OSPF pour isoler l’équipement avant redémarrage.
  Rétablissement des services via un chemin alternatif.
- 14:57 : Annulation du redémarrage, l’application de la directive ayant résolu l’incohérence RIB/FIB.
```
RP/0/RP0/CPU0:Feb 14 13:56:34.603 UTC: fib_mgr[346]: %ROUTING-FIB-6-RSRC_OK : FIB resource state has returned to normal. FIB has exited resource constrained operation and normal forwarding has been restored.
```
- 14:59 : Suppression de la configuration max-metric router-lsa, retour à la normale confirmé.
- 15:01 : L’équipement est totalement réintégré dans la topologie OSPF. Les ressources FIB sont à nouveau disponibles et stables. Une surveillance accrue est maintenue.
Analyse et enseignements
Causes racines
L’incident résulte d’une saturation des ressources FIB sur un routeur clé du réseau, entraînant des incohérences RIB/FIB et des pertes de connectivité. L’intervention en datacenter a contribué à l’instabilité du réseau. Malgré cela, nous n’avons pas encore identifié les causes profondes des interruptions de L2 suite au redémarrage du premier équipement cœur.
Actions correctives
- Premier événement : L’isolation et le redémarrage du routeur cœur ont permis de supprimer les incohérences RIB/FIB.
- Deuxième événement : L’application de la directive max-metric router-lsa a permis de restaurer la situation sans nécessiter de redémarrage du routeur, limitant ainsi l’impact opérationnel.
Responsabilité
- Premier événement : L’incident est attribuable à un bug logiciel sur une version de nos routeurs cœur (Ielo), en raison d’une gestion insuffisante des ressources FIB. Une campagne de mise à jour logicielle et matérielle est en cours depuis le mois d’août ainsi qu’un renforcement des métriques d’alerte pour anticiper ces situations.
- Deuxième événement : À ce jour, nous n’avons pas encore identifié les facteurs ayant provoqué l’arrêt du trafic sur les liaisons L2.
Conclusion
L’incident du 14/02/2025 a été causé par une saturation des ressources FIB sur un routeur cœur du réseau, générant des incohérences RIB/FIB et des pertes de connectivité. Suite à une isolation et une restauration de la configuration, les équipements impactés ont été rétablis. Cette intervention a généré un second événement sur une autre partie du réseau, provoquant une interruption de service L2. L’application de la directive max-metric router-lsa a permis de rétablir la situation sans nécessiter de redémarrage. Une surveillance renforcée est mise en place pour prévenir toute récurrence.
Network Incident Report - 14/02/2025
Event Timeline
- 10:24: Detection of multiple fluctuations between several network devices, likely caused by data center maintenance.
- 10:29: Reporting of anomalies in the system infrastructure following the fluctuations.
- 11:00: Numerous alerts persist in the monitoring tool. Some devices appear unreachable, though no physical disconnection is detected.
- 11:04: An impacted OLT is selected for analysis:
  Accessible from administrative bastions.
  Unreachable from the monitoring system.
  A traceroute performed from the monitoring server stops at a specific network device.
- 11:10: Temporary deactivation of the direct physical link between two identified network devices.
  The OLT becomes reachable again from the monitoring server.
  After reactivating the link, the OLT becomes unreachable again, narrowing down the diagnostic focus to a specific network area.
- 11:18: Log analysis of a router reveals saturation of FIB resources.
```
RP/0/RP0/CPU0:Feb 14 09:24:30.259 UTC: fib_mgr[311]: %ROUTING-FIB-4-RSRC_LOW : FIB running low on DATA_TYPE_TABLE_SET resource memory. FIB will now begin resource constrained forwarding. Only route deletes will be handled in this state, which may result in mismatch between RIB/FIB. Traffic loss on certain prefixes can be expected. FIB will automatically resume normal operation, once the resource utilization returns to normal level.
```
- 11:41: Decision to restart a network device to free up resources.
- 11:44: Restart successfully completed.
  Traffic returns to normal operation.
  System infrastructure stabilizes.
- 11:57: Confirmation of a return to normal under increased monitoring.
- 12:30: Reports of disruptions on L2 circuits in a specific area.
- 13:03: Detection of a RIB/FIB inconsistency on a network device still affecting some services.
- 13:23: Request for log extraction by the technical management before any further restart operation.
- 13:27: Request for validation before restarting the device.
- 14:03: Follow-up with support and operations teams.
- 14:54: Restart approved by technical management.
- 14:56: Implementation of the max-metric router-lsa directive in OSPF configuration to isolate the device before restart.
  Services restored via an alternative path.
- 14:57: Restart canceled as applying the directive resolved the RIB/FIB inconsistency.
```
RP/0/RP0/CPU0:Feb 14 13:56:34.603 UTC: fib_mgr[346]: %ROUTING-FIB-6-RSRC_OK : FIB resource state has returned to normal. FIB has exited resource constrained operation and normal forwarding has been restored.
```
- 14:59: Removal of the max-metric router-lsa configuration, normal state confirmed.
- 15:01: The device is fully reintegrated into the OSPF topology. FIB resources are stable and available again. Increased monitoring is maintained.
Analysis and Lessons Learned
Root Causes
The incident resulted from FIB resource saturation on a key network router, leading to RIB/FIB inconsistencies and connectivity losses. The data center intervention contributed to network instability. However, the root cause of L2 interruptions following the initial core device restart remains unidentified.
Corrective Actions
- First Event: Isolating and restarting the core router eliminated the RIB/FIB inconsistencies.
- Second Event: Applying the max-metric router-lsa directive restored stability without requiring a router restart, minimizing operational impact.
Responsibility
- First Event: The incident is attributed to a software bug in a version of our core routers (Ielo), due to inadequate FIB resource management. A software and hardware update campaign has been ongoing since August, along with reinforced alert metrics to anticipate such situations.
- Second Event: The factors causing traffic loss on L2 links remain unidentified.
Conclusion
The incident on 14/02/2025 was caused by FIB resource saturation on a core network router, resulting in RIB/FIB inconsistencies and connectivity losses. Following isolation and configuration restoration, impacted devices were recovered. This intervention triggered a second event in another network section, causing L2 service interruption. Applying the max-metric router-lsa directive resolved the issue without requiring a restart. Enhanced monitoring is in place to prevent recurrence.
Résolu
vendredi 14/02/2025 à 16:18
Résolu
vendredi 14/02/2025 à 16:18
Bonjour,
Depuis notre dernière intervention à 14h57, nous n'observons plus d’impact. En conséquence, nous clôturons cet incident générique.
Nous nous excusons pour la gêne occasionnée et vous remercions pour votre compréhension.
Cordialement,
Hello,
Since our last intervention at 14:57, we have observed no further impact. As a result, we are closing this generic incident.
We apologize for any inconvenience caused and thank you for your understanding.
Regards
Mettre à jour
vendredi 14/02/2025 à 14:26
Mettre à jour
vendredi 14/02/2025 à 14:26
Bonjour,
Nos équipes ont remis aux normes l’équipement du cœur de réseau après avoir contrôlé sa configuration et procédé à son redémarrage. Cette remise aux normes a été effectuée à 14h57 heure locale.
À ce stade, nous n’observons aucun impact. Nous maintenons néanmoins une surveillance active de la situation jusqu’à 18h00.
Cordialement,
Hello,
Our teams have restored the core network equipment to standard after verifying its configuration and rebooting it. This restoration was completed at 14:57 local time.
At this stage, we do not observe any impact. However, we will continue to actively monitor the situation until 18:00.
Regards
Surveillé
vendredi 14/02/2025 à 10:54
Surveillé
vendredi 14/02/2025 à 10:54
Bonjour,
Nous suspectons qu’un équipement actif de notre cœur de réseau soit à l’origine de l’incident. Cet équipement a été isolé et nous constatons une amélioration de la situation.
Nous effectuons actuellement un inventaire de l’état des services signalés dans les tickets d’incidents. En parallèle, nos équipes d’ingénierie et d’exploitation poursuivent leur investigation sur l’équipement identifié.
Nous vous tiendrons informés de toute avancée significative.
Cordialement,
Service d'Exploitation ielo
Hello,
We suspect that an active device in our core network is the source of the incident. This device has been isolated, and we are observing an improvement in the situation.
We are currently assessing the status of reported services based on incident tickets. Meanwhile, our engineering and operations teams are continuing their investigation on the identified equipment.
We will keep you informed of any significant updates.
Regards
Détecté
vendredi 14/02/2025 à 09:24
Détecté
vendredi 14/02/2025 à 09:24
Bonjour,
Une perturbation affectant notre infrastructure Backbone a été identifiée. Nous constatons des interruptions sur certaines liaisons L2 au sein de notre coeurs. Nous observons aussi des cas de trafic unidirectionnel entre certaines collectes opérateurs et les routeurs d'accès.
Nos équipes sont mobilisées pour analyser et résoudre l’incident dans les meilleurs délais. Nous vous tiendrons informés de toute évolution significative.
Cordialement,
Service d'Exploitation ielo

----

Hello,
A trouble has been identified in our backbone infrastructure. We are observing interruptions on certain L2 links within our core. We are also observing cases of unidirectional traffic between certain trunk operator and access routers.
Our teams are mobilised to analyse and resolve the incident as quickly as possible. We will keep you informed of any significant developments.
Regards
ielo Operations Department

IELO - Incident backbone – Détails de l'incident

Tous les systèmes sont opérationnels

Incident backbone

Rapport d’incident réseau du 14/02/2025

Chronologie des événements

Analyse et enseignements

Causes racines

Actions correctives

Responsabilité

Conclusion

Network Incident Report - 14/02/2025

Event Timeline

Analysis and Lessons Learned

Root Causes

Corrective Actions

Responsibility

Conclusion