Botnets are one of the major cyber infections used in several criminal activities. In most botnets, a Domain Generation Algorithm (DGA) is used by bots to make DNS queries aimed at establishing the connection with the Command and Control (C&C) server. The identification of such queries by monitoring the network DNS traffic is then crucial for bot detection. In this paper we present a methodology to detect DGA generated domain names based on a supervised machine learning process, trained with a dataset of known benign and malicious domain names. The proposed approach represents the domain names through a set of features which express the similarity between the 2-grams and 3-grams in a single unclassified domain name and those in domain names known as malicious or benign. We used the Kullback-Leibner divergence and the Jaccard Index to estimate the similarity, and we tested different machine learning algorithms to classify each domain name as benign or DGA-based (with both binary and multi-class approach). The results of our experiments demonstrate that the proposed methodology, which only exploits lexical features of domain names, attains a good level of accuracy and results in a general model able to classify previously unseen domains in an effective way. It is also able to outperform some of the state-of-the-art featur eless classification methods based on deep learning. © 2020 Elsevier Ltd

Algorithmically generated malicious domain names detection based on n-grams features

Morbidoni C.;
2021

Abstract

Botnets are one of the major cyber infections used in several criminal activities. In most botnets, a Domain Generation Algorithm (DGA) is used by bots to make DNS queries aimed at establishing the connection with the Command and Control (C&C) server. The identification of such queries by monitoring the network DNS traffic is then crucial for bot detection. In this paper we present a methodology to detect DGA generated domain names based on a supervised machine learning process, trained with a dataset of known benign and malicious domain names. The proposed approach represents the domain names through a set of features which express the similarity between the 2-grams and 3-grams in a single unclassified domain name and those in domain names known as malicious or benign. We used the Kullback-Leibner divergence and the Jaccard Index to estimate the similarity, and we tested different machine learning algorithms to classify each domain name as benign or DGA-based (with both binary and multi-class approach). The results of our experiments demonstrate that the proposed methodology, which only exploits lexical features of domain names, attains a good level of accuracy and results in a general model able to classify previously unseen domains in an effective way. It is also able to outperform some of the state-of-the-art featur eless classification methods based on deep learning. © 2020 Elsevier Ltd
File in questo prodotto:
File Dimensione Formato  
1-s2.0-S0957417420311957-main.pdf

Solo gestori archivio

Descrizione: Article
Tipologia: PDF editoriale
Dimensione 3.5 MB
Formato Adobe PDF
3.5 MB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: http://hdl.handle.net/11564/754841
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 4
  • ???jsp.display-item.citation.isi??? 5
social impact