Faults in a network can cause downtimes or degrade performance severally. That is why it is crucial to automatically detect these and if possible, remediate or notify an administrator via Email, SMS or push message on an app. A management system can detect faults through events it gets from devices (e.g. SNMP traps). Another possibility to detect or predict faults is through analysing telemetry data. This is also called trend analysis.
1) Reduce downtime through infrastructure monitoring
A proactive way to minimize downtime and reduce the risk of security incidents is proper configuration management. This includes backing up configurations and tracking changes. Ideally, it also provides provisioning functionality linked with an approval or review process.
To ensure performance is at the targeted level, it needs to be monitored. Usually, network devices can be polled in intervals to get metrics and statistics. Modern network devices are also able to stream telemetry data (e.g., using gRPC) which is more efficient and allows a management system to get data without being bound to the common 5 minute intervals. Informed decisions can be made based on this data, e.g., adding more bandwidth to a link.
2) Some tools for managing and monitoring your infrastructure
These functionalities can be implemented with various tools. Some free tools that cover one of these topics have been implemented successfully by ngworx.ag engineers. These include – among others – Cacti (collecting performance data via SNMP), Icinga (availability monitoring), Influx (time series database) and NetShot (configuration backup). Another option is StableNet from Infosim which can cover all these topics in one single tool.