An Anomaly-based Detection System for Monitoring Kubernetes Infrastructures
Keywords:Anomaly Detection, Cloud Native, Deep Learning, Kubernetes, LATAM-DDoS-IoT Dataset, Machine Learning, One-Class Classification, Online Learning
Network monitoring is crucial to analyze infrastructure baselines and alert whenever an abnormal behavior is observed. However, human effort is limited in time and scope since many variables must be considered in real-time. In addition, infrastructures such as Kubernetes are complex by nature since they do not consider fixed equipment from which to gather data; instead, these infrastructures consider distributed, event-driven, and ephemeral containers that make it complicated to capture and track metrics. Artificial Intelligence models have demonstrated high detection rates for anomaly detection; therefore, there is a need to design and implement a global solution to collect complex data and orchestrate the whole Machine Learning Operations workflow. This document shares the findings and learnings from defining a cloud-native Artificial Intelligence infrastructure at Aligo to develop an anomaly-based detection system for monitoring on-premise Kubernetes infrastructures. After Chaos Engineering experiments, it is shown that the resulting deployed system is strong when alerting outliers and that an end-to-end infrastructure has been developed for conducting future Artificial Intelligence projects at the company.
J. G. Almaraz-Rivera, J. A. Perez-Diaz, and J. A. Cantoral-Ceballos, “Transport and application layer ddos attacks detection to iot devices by using machine learning and deep learning models,” Sensors, vol. 22, no. 9, 2022.
“Apache airflow and kubernetes.” https://airflow.apache.org/docs/apacheairflow/stable/kubernetes.html. Accessed on 4 July 2022.
“Cncf annual survey 2021.” https://www.cncf.io/reports/cncf-annual-survey-2021/. Accessed on 17 August 2022.
S. S. Khan and M. G. Madden, “One-class classification: taxonomy of study and review of techniques,” The Knowledge Engineering Review, vol. 29, no. 3, p. 345–374, 2014.
N. Seliya, A. Abdollah Zadeh, and T. M. Khoshgoftaar, “A literature review on one-class classification and its potential applications in big data,” Journal of Big Data, vol. 8, no. 1, p. 122, 2021.
D. Sculley, G. Holt, D. Golovin, E. Davydov, T. Phillips, D. Ebner, V. Chaudhary, M. Young, J.-F. Crespo, and D. Dennison, “Hidden technical debt in machine learning systems,” in Advances in Neural Information Processing Systems (C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, eds.), vol. 28, Curran Associates, Inc., 2015.
D. Sahoo, Q. Pham, J. Lu, and S. C. H. Hoi, “Online deep learning: Learning deep neural networks on the fly,” in Proceedings of the 27th International Joint Conference on Artificial Intelligence, IJCAI’18, p. 2660–2666, AAAI Press, 2018.
J. G. Almaraz-Rivera, J. A. Perez-Diaz, J. A. Cantoral-Ceballos, J. F. Botero, and L. A. Trejo, “Toward the protection of iot networks: Introducing the latam-ddos-iot dataset,” IEEE Access, vol. 10, pp. 106909-106920, 2022.
J. G. Almaraz-Rivera, J. A. Perez-Diaz, J. A. Cantoral-Ceballos, J. F. Botero, and L. A. Trejo, “Latam-ddos-iot dataset.” https://dx.doi.org/10.21227/rwtj-dd43, 2022.
S. T. Zargar, J. Joshi, and D. Tipper, “A survey of defense mechanisms against distributed denial of service (ddos) flooding attacks,” IEEE Communications Surveys Tutorials, vol. 15, no. 4, pp. 2046–2069, 2013.
P. Perera, P. Oza, and V. M. Patel, “One-class classification: A survey,” CoRR, vol. abs/2101.03064, 2021.
P. Arregoces, J. Vergara, S. A. Gutiérrez, and J. F. Botero, “Network-based intrusion detection: A one-class classification approach,” in NOMS 2022-2022 IEEE/IFIP Network Operations and Management Symposium, pp. 1–6, 2022.
J. B. Camiña, M. A. Medina-Pérez, R. Monroy, O. Loyola-González, L. A. P. Villanueva, and L. C. G. Gurrola, “Bagging-randomminer: a one-class classifier for file access-based masquerade detection,” Machine Vision and Applications, vol. 30, no. 5, pp. 959–974, 2019.
G. Ahmadi-Assalemi, H. Al-Khateeb, G. Epiphaniou, and A. Aggoun, “Super learner ensemble for anomaly detection and cyber-risk quantification in industrial control systems,” IEEE Internet of Things Journal, vol. 9, no. 15, pp. 13279–13297, 2022.
L. Bilge and T. Dumitraş, “Before we knew it: An empirical study of zero-day attacks in the real world,” in Proceedings of the 2012 ACM Conference on Computer and Communications Security, CCS ’12, (New York, NY, USA), p. 833–844, Association for Computing Machinery, 2012.
“Red hat emerging technologies. prometheus anomaly detection.” https://next.redhat.com/2019/11/18/prometheus-anomaly-detection/. Accessed on 15 September 2022.
L. P. Dewi, A. Noertjahyana, H. N. Palit, and K. Yedutun, “Server scalability using kubernetes,” in 2019 4th Technology Innovation Management and Engineering Science International Conference (TIMESiCON), pp. 1–4, 2019.
Y. Zhou, Y. Yu, and B. Ding, “Towards mlops: A case study of ml pipeline platform,” in 2020 International Conference on Artificial Intelligence and Computer Engineering (ICAICE), pp. 494–500, 2020.
H. Cai, C. Wang, and X. Zhou, “Deployment and verification of machine learning tool-chain based on kubernetes distributed clusters,” CCF Transactions on High Performance Computing, vol. 3, no. 2, pp. 157–170, 2021.
J. George and A. Saha, “End-to-end machine learning using kubeflow,” in 5th Joint International Conference on Data Science & Management of Data (9th ACM IKDD CODS and 27th COMAD), CODS-COMAD 2022, (New York, NY, USA), p. 336–338, Association for Computing Machinery, 2022.
M. Pawlicki, R. Kozik, and M. Choraś, “A survey on neural networks for (cyber-) security and (cyber-) security of neural networks,” Neurocomputing, vol. 500, pp. 1075–1087, 2022.
N. Koroniotis, N. Moustafa, E. Sitnikova, and B. Turnbull, “Towards the development of realistic botnet dataset in the internet of things for network forensic analytics: Bot-iot dataset,” Future Generation Computer Systems, vol. 100, pp. 779–796, 2019.
R. Biswas and S. Roy, “Botnet traffic identification using neural networks,” Multimedia Tools and Applications, vol. 80, no. 16, pp. 24147-24171, 2021.
D. A. Tamburri, “Sustainable mlops: Trends and challenges,” in 2020 22nd International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC), pp. 17–23, 2020.
“Machine learning operations.” https://www.run.ai/guides/machine-learning-operations. Accessed on 5 December 2022.
“Kubeflow vs airflow.” https://hevodata.com/learn/kubeflow-vs-airflow/. Accessed on 5 July 2022.
“Kubeflow architecture.” https://www.kubeflow.org/docs/started/architecture/. Accessed on 5 July 2022.
“Kubernetes architecture for ai workloads.” https://www.run.ai/guides/kubernetes-architecture. Accessed on 5 July 2022.
“Cncf cloud native interactive landscape.” https://landscape.cncf.io/. Accessed on 6 July 2022.
B. Schölkopf, J. C. Platt, J. Shawe-Taylor, A. J. Smola, and R. C. Williamson, “Estimating the Support of a High-Dimensional Distribution,” Neural Computation, vol. 13, pp. 1443–1471, 07 2001.
F. T. Liu, K. M. Ting, and Z.-H. Zhou, “Isolation-based anomaly detection,” ACM Trans. Knowl. Discov. Data, vol. 6, mar 2012.
C. Yin, S. Zhang, J. Wang, and N. N. Xiong, “Anomaly detection based on convolutional recurrent autoencoder for iot time series,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 52, no. 1, pp. 112–122, 2022.
A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperative style, high-performance deep learning library,” in Advances in Neural Information Processing Systems 32 (H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, eds.), pp. 8024–8035, Curran Associates, Inc., 2019.
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: Machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
L. Buitinck, G. Louppe, M. Blondel, F. Pedregosa, A. Mueller, O. Grisel, V. Niculae, P. Prettenhofer, A. Gramfort, J. Grobler, R. Layton, J. VanderPlas, A. Joly, B. Holt, and G. Varoquaux, “API design for machine learning software: experiences from the scikit-learn project,” in ECML PKDD Workshop: Languages for Data Mining and Machine Learning, pp. 108–122, 2013.
S. Pouyanfar, S. Sadiq, Y. Yan, H. Tian, Y. Tao, M. P. Reyes, M.-L. Shyu, S.-C. Chen, and S. S. Iyengar, “A survey on deep learning: Algorithms, techniques, and applications,” ACM Comput. Surv., vol. 51, sep 2018.
M.-A. Zöller and M. F. Huber, “Benchmark and survey of automated machine learning frameworks,” J. Artif. Int. Res., vol. 70, p. 409–472, may 2021.
D. M. J. Tax and R. P. W. Duin, “Support vector data description,” Machine Learning, vol. 54, no. 1, pp. 45–66, 2004.
T. Fawcett, “An introduction to roc analysis,” Pattern Recognition Letters, vol. 27, no. 8, pp. 861–874, 2006. ROC Analysis in Pattern Recognition.
N. Japkowicz, Assessment Metrics for Imbalanced Learning, ch. 8, pp. 187–206. John Wiley & Sons, Ltd, 2013.
N. Murphy, B. Beyer, C. Jones, and J. Petoff, Site Reliability Engineering: How Google Runs Production Systems. O’Reilly Media, 2016.
“Kubernetes metrics reference.” https://kubernetes.io/docs/reference/instrumentation/metrics/. Accessed on 2 December 2022.
Y. Kayode Saheed, A. Idris Abiodun, S. Misra, M. Kristiansen Holone, and R. Colomo-Palacios, “A machine learning-based intrusion detection for detecting internet of things network attacks,” Alexandria Engineering Journal, vol. 61, no. 12, pp. 9395–9409, 2022.
M. A. Pirbonyeh, M. A. Shayegan, G. Sotudeh, and S. Shamshirband, “Heterogeneous domain adaptation by features normalization and data topology preserving,” Knowledge-Based Systems, vol. 257, p. 109536, 2022.
“Principles of chaos engineering.” https://principlesofchaos.org/. Accessed on 8 August 2022.
“The netflix simian army.” https://netflixtechblog.com/the-netflix-simian-army-16e57fbab116. Accessed on 8 August 2022.
“Netflix chaos monkey upgraded.” https://netflixtechblog.com/netflix-chaos-monkey-upgraded-1d679429be5d. Accessed on 8 August 2022.
A. Rodríguez and L. Castillo, “A first step towards a general-purpose distributed cyberdefense system,” in Advances in Practical Applications of Agents, Multi-Agent Systems, and Complexity: The PAAMS Collection (Y. Demazeau, B. An, J. Bajo, and A. Fernández-Caballero, eds.), (Cham), pp. 237–247, Springer International Publishing, 2018.
X. Huang, F. Yan, L. Zhang, and K. Wang, “Honeygadget: A deception based approach for detecting code reuse attacks,” Information Systems Frontiers, vol. 23, no. 2, pp. 269–283, 2021.
J. Antunes, N. F. Neves, and P. Veríssimo, “Detection and prediction of resource-exhaustion vulnerabilities,” in 19th International Symposium on Software Reliability Engineering (ISSRE 2008), 11-14 November 2008, Seattle/Redmond, WA, USA, pp. 87–96, IEEE Computer Society, 2008.
“Prometheus storage.” https://prometheus.io/docs/prometheus/latest/storage/. Accessed on 8 August 2022.
“6 common kubernetes and container attack techniques and how to prevent them.” https://www.paloaltonetworks.com/blog/prisma-cloud/6-common-kubernetes-attacks/. Accessed on 8 August 2022.
J. Bergstra and Y. Bengio, “Random search for hyper-parameter optimization,” J. Mach. Learn. Res., vol. 13, p. 281–305, feb 2012.