Alessio Xompero, PhD

Book chapters

Chapter 15 - Visual adversarial attacks and defenses
C. Oh, A. Xompero, A. Cavallaro
In Advanced Methods And Deep Learning In Computer Vision,
Editors: E. R. Davies, Matthew Turk, 1st Edition, Elsevier, pages 511-543, November 2021.

[abstract] [chapter] [bibtex] [webpage]

Journals

	Learning Privacy from Visual Entities A. Xompero and A. Cavallaro Proceedings on Privacy Enhancing Technologies (PoPETs), vol. 2025, n. 3, pp. 1-21, March 2025 [abstract] [paper] [arxiv] [bibtex] [code] [data] [trained models] Subjective interpretation and content diversity make predicting whether an image is private or public a challenging task. Graph neural networks combined with convolutional neural networks (CNNs), which consist of 14,000 to 500 millions parameters, generate features for visual entities (e.g., scene and object types) and identify the entities that contribute to the decision. In this paper, we show that using a simpler combination of transfer learning and a CNN to relate privacy with scene types optimises only 732 parameters while achieving comparable performance to that of graph-based methods. On the contrary, end-to-end training of graph-based methods can mask the contribution of individual components to the classification performance. Furthermore, we show that a high-dimensional feature vector, extracted with CNNs for each visual entity, is unnecessary and complexifies the model. The graph component has also negligible impact on performance, which is driven by fine-tuning the CNN to optimise image features for privacy nodes. @Article{Xompero2025PoPETs, title = {Learning Privacy from Visual Entities}, author = {Xompero, A. and Cavallaro, A.}, journal = {Proceedings on Privacy Enhancing Technologies}, volume = {2025}, number = {3}, pages={1--21}, month = {Mar}, year = {2025}, }
	Stereo Hand-Object Reconstruction for Human-to-Robot Handover Y. L. Pang, A. Xompero, C. Oh, A. Cavallaro IEEE Robotics and Automation Letters, Vol.10, n.6, June 2025 To be presented at IEEE/RSJ Int. Conf. Intell. Robots and Systems (IROS), Hangzhou, China, 19-25 October 2025 [abstract] [paper] [arxiv] [bibtex] [code] [webpage] Jointly estimating hand and object shape facilitates the grasping task in human-to-robot handovers. Relying on hand-crafted prior knowledge about the geometric structure of the object fails when generalising to unseen objects, and depth sensors fail to detect transparent objects such as drinking glasses. In this work, we propose a method for hand-object reconstruction that combines single-view reconstructions probabilistically to form a coherent stereo reconstruction. We learn 3D shape priors from a large synthetic hand-object dataset, and use RGB inputs to better capture transparent objects. We show that our method reduces the object Chamfer distance compared to existing RGB based hand-object reconstruction methods on single view and stereo settings. We process the reconstructed hand-object shape with a projection-based outlier removal step and use the output to guide a human-to-robot handover pipeline with wide-baseline stereo RGB cameras. Our hand-object reconstruction enables a robot to successfully receive a diverse range of household objects from the human. @article{Pang2025RAL, title = {Stereo Hand-Object Reconstruction for Human-to-Robot Handover}, author = {Pang, Y. L. and Xompero, A. and Oh, C. and Cavallaro, A.}, journal={IEEE Robotics and Automation Letters}, year={2025}, volume={10}, number={6}, pages={5761--5768}, doi={10.1109/LRA.2025.3562790} }
	Robotic Grasping and Manipulation Competition at the 2024 IEEE/RAS International Conference on Robotics and Automation Y. Sun, B. Calli, K. Kimble, F. wyffels, V. De Gusseme, K. Hang, S. D’Avella, A. Xompero, A. Cavallaro, M. A. Roa, J. Avendano, A. Mavrommati IEEE Robotics & Automation Magazine, Competitions, vol. 31, n. 4, pp. 174-185, December 2024. [abstract] [paper] [bibtex] [webpage] The Ninth Robotic Grasping and Manipulation Competition (RGMC) took place in Yokohama, Japan, during the 2024 IEEE/RAS International Conference on Robotics and Automation (ICRA). The series of RGMC events started in 2016 at the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) with strong support from the conference’s organization committee, and since then they have been held each year at ICRA or IROS. Across the editions, RGMC engaged the community in solving the open challenges associated with various robotic grasping and manipulation tasks for manufacturing, service robots, and logistics, and in advancing research and technology towards more realistic scenarios that can be encountered in daily activities at home or in warehouses. These tasks include assembling and disassembling boards, hand-in-hand grasping, picking and placing various objects, pouring liquids into a cup, bin picking, rearranging and setting formal tables, folding and unfolding cloths, and receiving objects handed over by a person. The goal of RGMC across these tasks is to assess the autonomous manipulation capabilities of a robotic arm when dealing with unknown or novel objects with varying physical properties and when handling scenarios with various degrees of uncertainty caused by a cluttered scene, random initial configurations, or human behaviors when interacting with the robot. For example, objects can vary in their shapes, appearances, transparency, deformability, and weight. RGMC challenges participants to design solutions that can perform outside of their labs and controlled conditions, within a limited time, and in some cases adapt to changing conditions within a task. For this year’s event in Yokohama, RGMC organized five tracks grouped into two categories: the manufacturing track and four essential skill tracks. The essential skill tracks include cloth manipulation, in-hand manipulation, picking in clutter, and human-to-robot handovers. All tasks in the tracks are performed autonomously by the robot, without any human intervention. The tasks in all the tracks were co-developed based on the benchmarks published in the literature by the organizers and the benchmarks’ authors. This year’s RGMC attracted the interest of over 70 teams worldwide across the various competition tracks. Among the registered teams and through a reviewing process, 31 teams from 10 different countries were selected to participate in Yokohama. Only a fraction of these qualified teams could commit and complete the tracks on site, with 11 teams competing for the cloth manipulation track. Only 20 teams could commit and complete the tracks on-site, with three teams participating in three tracks and two teams participating each in two tracks. @Article{Sun2024RAM, title = {Robotic Grasping and Manipulation Competition at the 2024 IEEE/RAS International Conference on Robotics and Automation}, author = {Sun, Y., and Calli, B. and Kimble, K. and wyffels, F. and De Gusseme, V. and Hang, K. and D’Avella, S. and Xompero, A. and Cavallaro, A. and Roa, M. A. and Avendano, J. and Mavrommati, A.}, journal = {IEEE Robotics and Automation Magazine}, volume = {31}, number = {4}, pages={174--185}, month = {Dec}, year = {2024}, }
	The CORSMAL Benchmark for the Prediction of the Properties of Containers A. Xompero, S. Donaher, V. Iashin, F. Palermo, G. Solak, C. Coppola, R. Ishikawa, Y. Nagao, R. Hachiuma, Q. Liu, F. Feng, C. Lan, R. H. M. Chan, G. Christmann, J. Song, G. Neeharika, C. K. T. Reddy, D. Jain, B. U. Rehman, A. Cavallaro IEEE Access, vol. 10, 2022. [abstract] [paper] [arxiv] [bibtex] [webpage] The contactless estimation of the weight of a container and the amount of its content manipulated by a person are key pre-requisites for safe human-to-robot handovers. However, opaqueness and transparencies of the container and the content, and variability of materials, shapes, and sizes, make this problem challenging. In this paper, we present a range of methods and an open framework to benchmark acoustic and visual perception for the estimation of the capacity of a container, and the type, mass, and amount of its content. The framework includes a dataset, specific tasks and performance measures. We conduct a fair and in-depth comparative analysis of methods that used this framework and audio-only or vision-only baselines designed from related works. Based on this analysis, we can conclude that audio-only and audio-visual classifiers are suitable for the estimation of the type and amount of the content using different types of convolutional neural networks, combined with either recurrent neural networks or a majority voting strategy, whereas computer vision methods are suitable to determine the capacity of the container using regression and geometric approaches. Classifying the content type and level using only audio achieves a weighted average F1-score up to 81% and 97%, respectively. Estimating the container capacity with vision-only approaches and filling mass with audio-visual approaches, multi-stage algorithms reaches up to 65% weighted average capacity and mass scores. These results show that there is still room of improvement for the design of future methods that will be ranked and compared on the individual leaderboards provided by our open framework. @Article{Xompero2022Access, title = {The CORSMAL Benchmark for the Prediction of the Properties of Containers}, author = {Xompero, A. and Donaher, S. and Iashin, V. and Palermo, F. and Solak, G. and Coppola, C. and Ishikawa, R. and Nagao, Y. and Hachiuma, R. and Liu, Q. and Feng, F. and Lan, C. and Chan, R. H. M. and Christmann, G. and Song, J. and Neeharika, G. and Reddy, C. K. T. and Jain, D. and Rehman, B. U. and Cavallaro, A.}, journal = {IEEE Access}, volume = {10}, number = {}, pages={41388--41402}, month = {Apr}, year = {2022}, }
	A spatio-temporal multi-scale binary descriptor A. Xompero, O. Lanz, and A. Cavallaro IEEE Transactions on Image Processing, vol. 29, no. 1, pp. 4362-4375, Dec. 2020. [abstract] [paper] [pre-print] [bibtex] [webpage] Binary descriptors are widely used for multi-view matching and robotic navigation. However, their matching performance decreases considerably under severe scale and viewpoint changes in non-planar scenes. To overcome this problem, we propose to encode the varying appearance of selected 3D scene points tracked by a moving camera with compact spatio-temporal descriptors. To this end, we first track interest points and capture their temporal variations at multiple scales. Then, we validate feature tracks through 3D reconstruction and compress the temporal sequence of descriptors by encoding the most frequent and stable binary values. Finally, we determine multi-scale correspondences across views with a matching strategy that handles severe scale differences. The proposed spatio-temporal multi-scale approach is generic and can be used with a variety of binary descriptors. We show the effectiveness of the joint multi-scale extraction and temporal reduction through comparisons of different temporal reduction strategies and the application to several binary descriptors. @Article{Xompero2020TIP_MST, title={A spatio-temporal multi-scale binary descriptor}, author={Alessio Xompero and Oswald Lanz and Andrea Cavallaro}, journal={IEEE Transactions on Image Processing}, volume={29}, number={1}, pages={4362--4375}, month={Dec}, year={2020}, issn={1057-7149}, doi={10.1109/TIP.2020.2965277} }
	Benchmark for Human-to-Robot Handovers of Unseen Containers with Unknown Filling R. Sanchez-Matilla, K. Chatzilygeroudis, A. Modas, N. Ferreira Duarte, A. Xompero, P. Frossard, A. Billard, and A. Cavallaro IEEE Robotics and Automation Letters, vol. 5, no. 2, pp. 1642-1649, Apr. 2020 [abstract] [pdf] [webpage] [bibtex] [code] [video] The real-time estimation through vision of the physical properties of objects manipulated by humans is important to inform the control of robots for performing accurate and safe grasps of objects handed over by humans. However, estimating the 3D pose and dimensions of previously unseen objects using only RGB cameras is challenging due to illumination variations, reflective surfaces, transparencies, and occlusions caused both by the human and the robot. In this paper we present a benchmark for dynamic human-to-robot handovers that do not rely on a motion capture system, markers, or prior knowledge of specific objects. To facilitate comparisons, the benchmark focuses on cups with different levels of transparencies and with an unknown amount of an unknown filling. The performance scores assess the overall system as well as its components in order to help isolate modules of the pipeline that need improvements. In addition to the task description and the performance scores, we also present and distribute as open source a baseline implementation for the overall pipeline to enable comparisons and facilitate progress. @Article{SanchezMatilla2020RAL_Benchmark, title={Benchmark for Human-to-Robot Handovers of Unseen Containers with Unknown Filling}, author={Ricardo Sanchez-Matilla, Konstantinos Chatzilygeroudis, Apostolos Modas, Nuno Ferreira Duarte, Alessio Xompero, Pascal Frossard, Aude Billard, and Andrea Cavallaro}, journal={IEEE Robotics and Automation Letters}, volume={5}, number={2}, pages={1642--1649}, month={Apr}, year={2020}, issn={2377-3766}, doi={10.1109/LRA.2020.2969200} }
	An On-line Variational Bayesian Model for Multi-Person Tracking from Cluttered Scenes S. Ba, X. Alameda-Pineda, A. Xompero, and R. Horaud Computer Vision and Image Understanding, 2016 [abstract] [pre-print] [bibtex] [video] Object tracking is an ubiquitous problem that appears in many applications such as remote sensing, audio processing, computer vision, human-machine interfaces, human-robot interaction, etc. Although thoroughly investigated in computer vision, tracking a time-varying number of persons remains a challenging open problem. In this paper, we propose an on-line variational Bayesian model for multi-person tracking from cluttered visual observations provided by person detectors. The paper has the following contributions. We propose a variational Bayesian framework for tracking an unknown and varying number of persons. Our model results in a variational expectation-maximization (VEM) algorithm with closed-form expressions both for the posterior distributions of the latent variables and for the estimation of the model parameters. The proposed model exploits observations from multiple detectors, and it is therefore multimodal by nature. Finally, we propose to embed both object-birth and object-visibility processes in an effort to robustly handle temporal appearances and disappearances. Evaluated on classical multiple person tracking datasets, our method shows competitive results with respect to state-of-the-art multiple-object tracking algorithms, such as the probability hypothesis density (PHD) filter, among others. @@ARTICLES{bA2016cviu, title = {An On-line Variational Bayesian Model for Multi-Person Tracking from Cluttered Scenes}, author = {Ba, S. and Alameda-Pineda, X. and Xompero, A. and Horaud, R.}, journal = {Computer Vision and Image Understanding}, vol={153}, num={}, pages={64--76}, month = apr, year = {2016}, }

Conferences

	Audio-Visual Object Classification for Human-Robot Collaboration A. Xompero, Y. L. Pang, T. Patten, A. Prabhakar, B. Calli, A. Cavallaro IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP), Singapore and Virtual, 22-27 May 2022. [abstract] [paper] [arxiv] [challenge webpage] [bibtex] Human-robot collaboration requires the contactless estimation of the physical properties of containers manipulated by a person, for example while pouring content in a cup or moving a food box. Acoustic and visual signals can be used to estimate the physical properties of such objects, which may vary substantially in shape, material and size, and also be occluded by the hands of the person. To facilitate comparisons and stimulate progress in solving this problem, we present the CORSMAL challenge and a dataset to assess the performance of the algorithms through a set of well-defined performance scores. The tasks of the challenge are the estimation of the mass, capacity, and dimensions of the object (container), and the classification of the type and amount of its content. A novel feature of the challenge is our real-to-simulation framework for visualising and assessing the impact of estimation errors in human-to-robot handovers. @InProceedings{Xompero2022ICASSP, title = {Audio-Visual Object Classification for Human-Robot Collaboration}, author = {Xompero, A. and Pang, Y. L. and Patten, T. and Prabhakar, A. and Calli, B. and Cavallaro, A.}, booktitle = {IEEE International Conference on Acoustic, Speech and Signal Processing}, address={Singapore}, month="22--27~" # MAY, year = {2022}, }
	Towards safe human-to-robot handovers of unknown containers Y. L. Pang, A. Xompero, C. Oh, and A. Cavallaro IEEE International Conference on Robot and Human Interactive Communication (RO-MAN), Virtual, 8-12 Aug 2021. [abstract] [paper] [arxiv] [webpage] [bibtex] Safe human-to-robot handovers of unknown containers require accurate estimation of the human and object properties, such as hand pose, object shape, trajectory, and weight. However, accurate estimation requires the use of expensive equipment, such as motion capture systems, markers, and 3D object models. Moreover, developing and testing on real hardware can be dangerous for the human and object. We propose a real-to-simulation framework to conduct safe human-to-robot handovers with visual estimations of the physical properties of unknown cups or drinking glasses and of the human hands from videos of a person manipulating the object. We complete the handover in simulation and we quantify the safeness of the human and object using annotations of the real object properties. To further increase the safety for the human, we estimate a safe grasp region, i.e. not occluded by the hand of the person holding the object, for the robot to grasp the object during the handover. We validate the framework using public recordings of objects manipulated before a handover and show the safeness of the handover when using noisy estimates from a range of perceptual algorithms. @Article{Pang2021ROMAN, title = {Towards safe human-to-robot handovers of unknown containers}, author = {Yik Lang Pang and Alessio Xompero and Changjae Oh and Andrea Cavallaro}, booktitle = {IEEE International Conference on Robot and Human Interactive Communication}, address = {Virtual}, month = "8-12~" # aug, year = {2021}, }
	Improving filling level classification with adversarial training A. Modas, A. Xompero, R. Sanchez-Matilla, P. Frossard, and A. Cavallaro IEEE International Conference on Image Processing (ICIP), Anchorage, Alaska, USA, 19-22 Sep 2021. [abstract] [paper] [arxiv] [webpage] [bibtex] [data] [models] We investigate the problem of classifying -- from a single image -- the level of content in a cup or a drinking glass. This problem is made challenging by several ambiguities caused by transparencies, shape variations and partial occlusions, and by the availability of only small training datasets. In this paper, we tackle this problem with an appropriate strategy for transfer learning. Specifically, we use adversarial training in a generic source dataset and then refine the training with a task-specific dataset. We also discuss and experimentally evaluate several training strategies and their combination on a range of container types of the CORSMAL Containers Manipulation dataset. We show that transfer learning with adversarial training in the source domain consistently improves the classification accuracy on the test set and limits the overfitting of the classifier to specific features of the training data. @Article{Modas2021ICIP, title = {Improving filling level classification with adversarial training}, author = {Apostolos Modas and Alessio Xompero and Ricardo Sanchez-Matilla and Pascal Frossard and Andrea Cavallaro}, booktitle = {IEEE International Conference on Image Processing}, address = {Anchorage, Alaska, USA}, month = "19-22~" # sep, year = {2021}, }
	Audio classification of the content of food containers and drinking glasses S. Donaher, A. Xompero, and A. Cavallaro European Signal Processing Conference (EUSIPCO), Dublin, Ireland, 23-27 August 2021. [abstract] [paper] [arxiv] [webpage] [bibtex] [code] [data] [data2] [models] Food containers, drinking glasses and cups handled by a person generate sounds that vary with the type and amount of their content. In this paper, we propose a new model for sound-based classification of the type and amount of content in a container. The proposed model is based on the decomposition of the problem into two steps, namely action recognition and content classification. We consider the scenario of the recent CORSMAL Containers Manipulation dataset and consider two actions (shaking and pouring), and seven material and filling level combinations. The first step identifies the action a person performs while manipulating a container. The second step is an appropriate classifier trained for the specific interaction identified by the first step to classify the amount and type of content. Experiments show that the proposed model achieves 76.02, 78.24, and 41.89 weighted average F1 score on the three test sets, respectively, and outperforms baselines and existing approaches that classify either independently content level and content type or directly the combination of content type and level together. @Article{Donhaer2021EUSIPCO, title={Audio classification of the content of food containers and drinking glasses}, author={Santiago Donhaer and Alessio Xompero and Andrea Cavallaro}, booktitle={European Signal Processing Conference}, address={Dubline, Ireland}, month = "23-27~" # aug, year={2021}, }
	Multi-view shape estimation of transparent containers A. Xompero, R. Sanchez-Matilla, A. Modas, P. Frossard, and A. Cavallaro IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4-8 May 2020 [abstract] [paper] [arxiv] [webpage] [bibtex] [data] [code] [video] The 3D localisation of an object and the estimation of its properties, such as shape and dimensions, are challenging under varying degrees of transparency and lighting conditions. In this paper, we propose a method for jointly localising container-like objects and estimating their dimensions using two wide-baseline, calibrated RGB cameras. Under the assumption of circular symmetry along the vertical axis, we estimate the dimensions of an object with a generative 3D sampling model of sparse circumferences, iterative shape fitting and image re-projection to verify the sampling hypotheses in each camera using semantic segmentation masks. We evaluate the proposed method on a novel dataset of objects with different degrees of transparency and captured under different backgrounds and illumination conditions. Our method, which is based on RGB images only, outperforms in terms of localisation success and dimension estimation accuracy a deep-learning based approach that uses depth maps. @InProceedings{Xompero2020ICASSP_LoDE, title={Multi-view shape estimation of transparent containers}, author={Alessio Xompero and Ricardo Sanchez-Matilla and Apostolos Modas and Pascal Frossard and Andrea Cavallaro}, booktitle = {IEEE International Conference on Acoustics, Speech and Signal Processing}, address = {Barcelona, Spain}, month = "4-8~" # may, year = {2020} }
	Accurate Target Annotation in 3D from Multimodal Streams O. Lanz, A. Brutti, A. Xompero, X. Qian, M. Omologo, and A. Cavallaro International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton (UK), May, 12-17, 2019 [abstract] [paper] [preprint] [bibtex] [CAV3D] Accurate annotation is fundamental to quantify the performance of multi-sensor and multi-modal object detectors and trackers. However, invasive or expensive instrumentation is needed to automatically generate these annotations. To mitigate this problem, we present a multi-modal approach that leverages annotations from reference streams (e.g. individual camera views) and measurements from unannotated additional streams (e.g. audio) to infer 3D trajectories through an optimization. The core of our approach is a multi-modal extension of Bundle Adjustment with a cross-modal correspondence detection that selectively uses measurements in the optimization. We apply the proposed approach to fully annotate a new multi-modal and multi-view dataset for multi-speaker 3D tracking. @INPROCEEDINGS{Lanz2019ICASSP, title = {Accurate target annotation in 3D from multimodal streams}, author = {Lanz, Oswald and Brutti, Alessio, and Xompero, Alessio and Xinyuan, Qian and Omologo, Maurizio and Cavallaro, Andrea}, booktitle = ICASSP, address = {Brighton, UK}, month = "12--17~" # may, year = {2019} }
	MORB: a multi-scale binary descriptor A. Xompero, O. Lanz, and A. Cavallaro IEEE International Conference on Image Processing (ICIP), Athens (Greece), October, 7-10, 2018 [abstract] [paper] [preprint] [bibtex] [poster] [code] Local image features play an important role in matching images under different geometric and photometric transformations. However, as the scale difference across views increases, the matching performance may considerably decrease. To address this problem we propose MORB, a multi-scale binary descriptor that is based on ORB and that improves the accuracy of feature matching under scale changes. MORB describes an image patch at different scales using an oriented sampling pattern of intensity comparisons in a predefined set of pixel pairs. We also propose a matching strategy that estimates the cross-scale match between MORB descriptors across views. Experiments show that MORB outperforms state-of-the-art binary descriptors under several transformations. INPROCEEDINGS{Xompero2018ICIP_MORB, title = {{MORB: a multi-scale binary descriptor}}, author = {Xompero, Alessio and Lanz, Oswald and Cavallaro, Andrea}, booktitle = ICIP, address = {Athens, Greece}, month = "7--10~" # oct, year = {2018} }
	Multi-camera Matching of Spatio-Temporal Binary Features A. Xompero, O. Lanz, and A. Cavallaro International Conference on Information Fusion (FUSION), Cambridge (United Kingdom), July, 10-13, 2018 [abstract] [paper] [preprint] [bibtex] [slides] Local image features are generally robust to different geometric and photometric transformations on planar surfaces or under narrow baseline views. However, the matching performance decreases considerably across cameras with unknown poses separated by a wide baseline. To address this problem, we accumulate temporal information within each view by tracking local binary features, which encode intensity comparisons of pixel pairs in an image patch. We then encode the spatio-temporal features into fixed-length binary descriptors by selecting temporally dominant binary values. We complement the descriptor with a binary vector that identifies intensity comparisons that are temporally unstable. Finally, we use this additional vector to ignore the corresponding binary values in the fixed-length binary descriptor when matching the features across cameras. We analyse the performance of the proposed approachand compare it with baselines. @INPROCEEDINGS{Xompero2018FUSION, title = {{Multi-camera Matching of Spatio-Temporal Binary Features}}, author = {Xompero, Alessio and Lanz, Oswald and Cavallaro, Andrea}, booktitle = FUSION, address = {Cambridge, UK}, month = "10--13~" # jul, year = {2018} }
	3D Mouth Tracking from a Compact Microphone Array Co-located with a Camera X. Qian, A. Xompero, A. Brutti, O. Lanz, M. Omologo, and A. Cavallaro IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary (Canada), April, 15-20, 2018 [abstract] [paper] [preprint] [bibtex] We address the 3D audio-visual mouth tracking problem when using a compact platform with co-located audio-visual sensors, without a depth camera. In particular, we propose a multi-modal particle filter that combines a face detector and 3D hypothesis mapping to the image plane. The audio likelihood computation is assisted by video, which relies on a GCC-PHAT based acoustic map. By combining audio and video inputs, the proposed approach can cope with a reverberant and noisy environment, and can deal with situations when the person is occluded, outside the Field of View (FoV), or not facing the sensors. Experimental results show that the proposed tracker is accurate both in 3D and on the image plane. @INPROCEEDINGS{Qian2018ICASSP, title = {{3D Mouth Tracking from a Compact Microphone Array Co-located with a Camera}}, author = {Qian, Xinyuan and Xompero, Alessio and Brutti, Alessio and Lanz, Oswald and Omologo, Maurizio and Cavallaro, Andrea}, booktitle = ICASSP, address = {Calgary, Canada}, month = "15--20~" # apr, year = {2018} }

Workshops

Peer-reviewed workshop papers published in the corresponding conference workshop proceedings.

	Segmenting Object Affordances: Reproducibility and Sensitivity to Scale T. Apicella, A. Xompero, P. Gastaldo, A. Cavallaro Twelth International Workshop on Assistive Computer Vision and Robotics (ACVR) European Conference on Computer Vision (ECCV), Milan (Italy), 29 September 2024. [abstract] [arxiv] [bibtex] [webpage] [poster] [code] [trained models] [eval-toolkit] Visual affordance segmentation identifies image regions of an object an agent can interact with. Existing methods re-use and adapt learning-based architectures for semantic segmentation to the affordance segmentation task and evaluate on small-size datasets. However, experimental setups are often not reproducible, thus leading to unfair and inconsistent comparisons. In this work, we benchmark these methods under a reproducible setup on two single objects scenarios, tabletop without occlusions and hand-held containers, to facilitate future comparisons. We include a version of a recent architecture, Mask2Former, re-trained for affordance segmentation and show that this model is the best-performing on most testing sets of both scenarios. Our analysis show that models are not robust to scale variations when object resolutions differ from those in the training set. @InProceedings{Apicella2024ACVR_ECCVW, title = {Segmenting Object Affordances: Reproducibility and Sensitivity to Scale}, author = {Apicella, T. and Xompero, A. and Gastaldo, P. and Cavallaro, A.}, booktitle = {Proceedings of the European Conference on Computer Vision Workshops}, note = {Twelfth International Workshop on Assistive Computer Vision and Robotics}, address={Milan, Italy}, month="29" # SEP, year = {2024}, }
	Explaining models relating objects and privacy A. Xompero, M. Bontonou, J. Arbona, E. Benetos, A. Cavallaro The 3rd Explainable AI for Computer Vision (XAI4CV) Workshop IEEE/CVF International Conference on Computer Vision and Pattern Recognition (CVPR), Seattle (US), 18 June 2024. [abstract] [paper] [supp] [arxiv] [bibtex] [open access] [poster] [video] [code] [webpage] [XAI4CV] Accurately predicting whether an image is private before sharing it online is difficult due to the vast variety of content and the subjective nature of privacy itself. In this paper, we evaluate privacy models that use objects extracted from an image to determine why the image is predicted as private. To explain the decision of these models, we use feature-attribution to identify and quantify which objects (and which of their features) are more relevant to privacy classification with respect to a reference input (i.e., no objects localised in an image) predicted as public. We show that the presence of the person category and its cardinality is the main factor for the privacy decision. Therefore, these models mostly fail to identify private images depicting documents with sensitive data, vehicle ownership, and internet activity, or public images with people (e.g., an outdoor concert or people walking in a public space next to a famous landmark). As baselines for future benchmarks, we also devise two strategies that are based on the person presence and cardinality and achieve comparable classification performance of the privacy models. @Article{Xompero2024CPVRW_XAI4CV, title = {Explaining models relating objects and privacy}, author={Xompero, A. and Bontonou, M. and Arbona, J. and Benetos, E. and Cavallaro, A.}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision and Pattern Recognition Workshops}, note = {The 3rd Explainable AI for Computer Vision (XAI4CV) Workshop}, month="18" # JUN, year = {2024}, }
	Affordance segmentation of hand-occluded containers from exocentric images T. Apicella, A. Xompero, E. Ragusa, R. Berta, A. Cavallaro, P. Gastaldo Eleventh International Workshop on Assistive Computer Vision and Robotics (ACVR) International Conference on Computer Vision (ICCV), Paris (France), 2 October 2023. [abstract] [paper] [arxiv] [bibtex] [webpage] [CHOC-AFF data] [code] [trained model] Visual affordance segmentation identifies the surfaces of an object an agent can interact with. Common challenges for the identification of affordances are the variety of the geometry and physical properties of these surfaces as well as occlusions. In this paper, we focus on occlusions of an object that is hand-held by a person manipulating it. To address this challenge, we propose an affordance segmentation model that uses auxiliary branches to process the object and hand regions separately. The proposed model learns affordance features under hand-occlusion by weighting the feature map through hand and object segmentation. To train the model, we annotated the visual affordances of an existing dataset with mixed-reality images of hand-held containers in third-person (exocentric) images. Experiments on both real and mixed-reality images show that our model achieves better affordance segmentation and generalisation than existing models. @Article{Apicella2023ICCVW, title = {Affordance segmentation of hand-occluded containers from exocentric images}, author={Apicella, T. and Xompero, A., and Ragusa, E. and Berta, R. and Cavallaro, A. and Gastaldo, P.}, booktitle = {Proceedings of the International Conference on Computer Vision Workshops}, note = {International Workshop on Assistive Computer Vision and Robotics}, month="2" # OCT, year = {2023}, }
	Cross-Camera View-Overlap Recognition A. Xompero, A. Cavallaro International Workshop on Smart Distributed Cameras (IWDSC) at European Conference on Computer Vision (ECCV), October 2022. [abstract] [paper] [arxiv] [bibtex] [webpage] [data] [code] [video] We propose a decentralised view-overlap recognition framework that operates across freely moving cameras without the need of a reference 3D map. Each camera independently extracts, aggregates into a hierarchical structure, and shares feature-point descriptors over time. A view overlap is recognised by view-matching and geometric validation to discard wrongly matched views. The proposed framework is generic and can be used with different descriptors. We conduct the experiments on publicly available sequences as well as new sequences we collected with hand-held cameras. We show that Oriented FAST and Rotated BRIEF (ORB) features with Bags of Binary Words within the proposed framework lead to higher precision and a higher or similar accuracy compared to NetVLAD, RootSIFT, and SuperGlue. @Article{Xompero2022IWDSC, title = {Cross-Camera View-Overlap Recognition}, author = {Xompero, A. and and Cavallaro, A.}, booktitle = {Proceedings of the European Conference on Computer Vision Workshops}, note = {International Workshop on Distributed Smart Cameras}, address={Online}, month="24" # OCT, year = {2022}, }

Pre-prints

	Visual Affordances: Enabling robots to understand object functionality T. Apicella, A. Xompero, A. Cavallaro Under review, May 2025 [abstract] [arxiv] [bibtex] [webpage] [repo] Human-robot interaction for assistive technologies relies on the prediction of affordances, which are the potential actions a robot can perform on objects. Predicting object affordances from visual perception is formulated differently for tasks such as grasping detection, affordance classification, affordance segmentation, and hand-object interaction synthesis. In this work, we highlight the reproducibility issue in these redefinitions, making comparative benchmarks unfair and unreliable. To address this problem, we propose a unified formulation for visual affordance prediction, provide a comprehensive and systematic review of previous works highlighting strengths and limitations of methods and datasets, and analyse what challenges reproducibility. To favour transparency, we introduce the Affordance Sheet, a document to detail the proposed solution, the datasets, and the validation. As the physical properties of an object influence the interaction with the robot, we present a generic framework that links visual affordance prediction to the physical world. Using the weight of an object as an example for this framework, we discuss how estimating object mass can affect the affordance prediction. Our approach bridges the gap between affordance perception and robot actuation, and accounts for the complete information about objects of interest and how the robot interacts with them to accomplish its task. @misc{Apicella2025, title = {Visual Affordances: Enabling robots to understand object functionality}, author = {Apicella, T. and Xompero, A. and and Cavallaro, A.}, year=2025 }
	A Mixed-Reality Dataset for Category-level 6D Pose and Size Estimation of Hand-occluded Containers X. Weber, A. Xompero, A. Cavallaro [abstract] [arxiv] [bibtex] [webpage] [data] Estimating the 6D pose and size of household containers is challenging due to large intra-class variations in the object properties, such as shape, size, appearance, and transparency. The task is made more difficult when these objects are held and manipulated by a person due to varying degrees of hand occlusions caused by the type of grasps and by the viewpoint of the camera observing the person holding the object. In this paper, we present a mixed-reality dataset of hand-occluded containers for category-level 6D object pose and size estimation. The dataset consists of 138,240 images of rendered hands and forearms holding 48 synthetic objects, split into 3 grasp categories over 30 real backgrounds. We re-train and test an existing model for 6D object pose estimation on our mixed-reality dataset. We discuss the impact of the use of this dataset in improving the task of 6D pose and size estimation. @misc{Weber2023ArXiv, title = {A Mixed-Reality Dataset for Category-level 6D Pose and Size Estimation of Hand-occluded Containers}, author = {Weber, X. and Xompero, A. and and Cavallaro, A.}, eprint={2211.10470}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2211.10470} }

Thesis

Local features for view matching across independently moving cameras
Alessio Xompero, PhD Thesis, Queen Mary University of London, United Kingdom, September 2020

Supervisors:
Prof. Andrea Cavallaro, Queen Mary University of London, United Kingdom
Dr. Oswald Lanz, Fondazione Bruno Kessler, Italy

Examiners:
Dr. Stefan Leutenegger, Imperial College London, United Kingdom
Dr. Jean-Yves Guillemaut, University of Surrey, United Kingdom

[pdf]

ViProT: A Visual Probabilistic Model for Moving Interest Points Clusters Tracking
Alessio Xompero, Master's thesis, University of Trento, Italy, March 2015

Advisors:
Dr. Nicola Conci, University of Trento, Italy
Dr. Radu Patrice Horaud, INRIA Grenoble Rhône-Alpes, France
Dr. Xavier Alameda-Pineda (Co-advisor), INRIA Grenoble Rhône-Alpes, France
Dr. Sileye Ba (Co-advisor), INRIA Grenoble Rhône-Alpes, France

[pdf]

Publications