Node classification on graph data is an important problem in many real-world applications. However, it requires labels for training, which can be difficult or expensive to obtain in practice. Consequently, typically only a small fraction of the accessible data is labeled. Recognizing this limitation, we consider the problem of spreading the labels from a small carefully chosen set of labeled data, also referred to as seeds, to a larger set of unlabeled data. Based on the common graph smoothness assumption, we cast this classification problem within the semi-supervised learning framework and propose a graph sampling design strategy for the seeds to improve the performance of the well-known label propagation algorithm. In particular, we show that more accurate predictions can be achieved if the seeds are “optimally” spread over the graph by means of a space-filling design, a sampling strategy particularly suited in cases in which no other attributes are available on the nodes. Both theoretical results and competitive experimental results on a variety of simulations and a real-world dataset demonstrate the effectiveness of the proposed methodology.
A space-filling sampling approach for collective classification of social media data
Gobbo, Emiliano del;Fontanella, Lara
;Ippoliti, Luigi;Zio, Simone Di;Fontanella, Sara;Cucco, Alex
2026-01-01
Abstract
Node classification on graph data is an important problem in many real-world applications. However, it requires labels for training, which can be difficult or expensive to obtain in practice. Consequently, typically only a small fraction of the accessible data is labeled. Recognizing this limitation, we consider the problem of spreading the labels from a small carefully chosen set of labeled data, also referred to as seeds, to a larger set of unlabeled data. Based on the common graph smoothness assumption, we cast this classification problem within the semi-supervised learning framework and propose a graph sampling design strategy for the seeds to improve the performance of the well-known label propagation algorithm. In particular, we show that more accurate predictions can be achieved if the seeds are “optimally” spread over the graph by means of a space-filling design, a sampling strategy particularly suited in cases in which no other attributes are available on the nodes. Both theoretical results and competitive experimental results on a variety of simulations and a real-world dataset demonstrate the effectiveness of the proposed methodology.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


