|
Audio-Visual Object Classification for Human-Robot Collaboration
A. Xompero, Y. L. Pang, T. Patten, A. Prabhakar, B. Calli, A. Cavallaro
IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP), Singapore and Virtual, 22-27 May 2022.
[abstract]
[paper]
[arxiv]
[challenge webpage]
[bibtex]
Human-robot collaboration requires the contactless estimation of the physical properties of containers manipulated by a person, for example while pouring content in a cup or moving a food box. Acoustic and visual signals can be used to estimate the physical properties of such objects, which may vary substantially in shape, material and size, and also be occluded by the hands of the person. To facilitate comparisons and stimulate progress in solving this problem, we present the CORSMAL challenge and a dataset to assess the performance of the algorithms through a set of well-defined performance scores. The tasks of the challenge are the estimation of the mass, capacity, and dimensions of the object (container), and the classification of the type and amount of its content. A novel feature of the challenge is our real-to-simulation framework for visualising and assessing the impact of estimation errors in human-to-robot handovers.
@InProceedings{Xompero2022ICASSP,
title = {Audio-Visual Object Classification for Human-Robot Collaboration},
author = {Xompero, A. and Pang, Y. L. and Patten, T. and Prabhakar, A. and Calli, B. and Cavallaro, A.},
booktitle = {IEEE International Conference on Acoustic, Speech and Signal Processing},
address={Singapore},
month="22--27~" # MAY,
year = {2022},
}
|
|
Towards safe human-to-robot handovers of unknown containers
Y. L. Pang, A. Xompero, C. Oh, and A. Cavallaro
IEEE International Conference on Robot and Human Interactive Communication (RO-MAN),
Virtual, 8-12 Aug 2021.
[abstract]
[paper]
[arxiv]
[webpage]
[bibtex]
Safe human-to-robot handovers of unknown containers require accurate estimation of the human and object properties, such as hand pose, object shape, trajectory, and weight. However, accurate estimation requires the use of expensive equipment, such as motion capture systems, markers, and 3D object models. Moreover, developing and testing on real hardware can be dangerous for the human and object. We propose a real-to-simulation framework to conduct safe human-to-robot handovers with visual estimations of the physical properties of unknown cups or drinking glasses and of the human hands from videos of a person manipulating the object. We complete the handover in simulation and we quantify the safeness of the human and object using annotations of the real object properties. To further increase the safety for the human, we estimate a safe grasp region, i.e. not occluded by the hand of the person holding the object, for the robot to grasp the object during the handover. We validate the framework using public recordings of objects manipulated before a handover and show the safeness of the handover when using noisy estimates from a range of perceptual algorithms.
@Article{Pang2021ROMAN,
title = {Towards safe human-to-robot handovers of unknown containers},
author = {Yik Lang Pang and Alessio Xompero and Changjae Oh and Andrea Cavallaro},
booktitle = {IEEE International Conference on Robot and Human Interactive Communication},
address = {Virtual},
month = "8-12~" # aug,
year = {2021},
}
|
|
Improving filling level classification with adversarial training
A. Modas, A. Xompero, R. Sanchez-Matilla, P. Frossard, and A. Cavallaro
IEEE International Conference on Image Processing (ICIP),
Anchorage, Alaska, USA, 19-22 Sep 2021.
[abstract]
[paper]
[arxiv]
[webpage]
[bibtex]
[data]
[models]
We investigate the problem of classifying -- from a single image -- the level of content in a cup or a drinking glass. This problem is made challenging by several ambiguities caused by transparencies, shape variations and partial occlusions, and by the availability of only small training datasets. In this paper, we tackle this problem with an appropriate strategy for transfer learning. Specifically, we use adversarial training in a generic source dataset and then refine the training with a task-specific dataset. We also discuss and experimentally evaluate several training strategies and their combination on a range of container types of the CORSMAL Containers Manipulation dataset. We show that transfer learning with adversarial training in the source domain consistently improves the classification accuracy on the test set and limits the overfitting of the classifier to specific features of the training data.
@Article{Modas2021ICIP,
title = {Improving filling level classification with adversarial training},
author = {Apostolos Modas and Alessio Xompero and Ricardo Sanchez-Matilla and Pascal Frossard and Andrea Cavallaro},
booktitle = {IEEE International Conference on Image Processing},
address = {Anchorage, Alaska, USA},
month = "19-22~" # sep,
year = {2021},
}
|
|
Audio classification of the content of food containers and drinking glasses
S. Donaher, A. Xompero, and A. Cavallaro
European Signal Processing Conference (EUSIPCO),
Dublin, Ireland, 23-27 August 2021.
[abstract]
[paper]
[arxiv]
[webpage]
[bibtex]
[code]
[data]
[data2]
[models]
Food containers, drinking glasses and cups handled by a person generate sounds that vary with the type and amount of their content. In this paper, we propose a new model for sound-based classification of the type and amount of content in a container. The proposed model is based on the decomposition of the problem into two steps, namely action recognition and content classification. We consider the scenario of the recent CORSMAL Containers Manipulation dataset and consider two actions (shaking and pouring), and seven material and filling level combinations. The first step identifies the action a person performs while manipulating a container. The second step is an appropriate classifier trained for the specific interaction identified by the first step to classify the amount and type of content. Experiments show that the proposed model achieves 76.02, 78.24, and 41.89 weighted average F1 score on the three test sets, respectively, and outperforms baselines and existing approaches that classify either independently content level and content type or directly the combination of content type and level together.
@Article{Donhaer2021EUSIPCO,
title={Audio classification of the content of food containers and drinking glasses},
author={Santiago Donhaer and Alessio Xompero and Andrea Cavallaro},
booktitle={European Signal Processing Conference},
address={Dubline, Ireland},
month = "23-27~" # aug,
year={2021},
}
|
|
Multi-view shape estimation of transparent containers
A. Xompero, R. Sanchez-Matilla, A. Modas, P. Frossard, and A. Cavallaro
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),
Barcelona, Spain, 4-8 May 2020
[abstract]
[paper]
[arxiv]
[webpage]
[bibtex]
[data]
[code]
[video]
The 3D localisation of an object and the estimation of its properties, such as shape and dimensions, are challenging under varying degrees of transparency and lighting conditions.
In this paper, we propose a method for jointly localising container-like objects and estimating their dimensions using two wide-baseline, calibrated RGB cameras. Under the assumption of circular symmetry along the vertical axis, we estimate the dimensions of an object with a generative 3D sampling model of sparse circumferences, iterative shape fitting and image re-projection to verify the sampling hypotheses in each camera using semantic segmentation masks.
We evaluate the proposed method on a novel dataset of objects with different degrees of transparency and captured under different backgrounds and illumination conditions. Our method, which is based on RGB images only, outperforms in terms of localisation success and dimension estimation accuracy a deep-learning based approach that uses depth maps.
@InProceedings{Xompero2020ICASSP_LoDE,
title={Multi-view shape estimation of transparent containers},
author={Alessio Xompero and Ricardo Sanchez-Matilla and Apostolos Modas and Pascal Frossard and Andrea Cavallaro},
booktitle = {IEEE International Conference on Acoustics, Speech and Signal Processing},
address = {Barcelona, Spain},
month = "4-8~" # may,
year = {2020}
}
|
|
Accurate target annotation in 3D from multimodal streams
O. Lanz, A. Brutti, A. Xompero, X. Qian, M. Omologo, and A. Cavallaro
International Conference on Acoustics, Speech and Signal Processing (ICASSP),
Brighton (UK), May, 12-17, 2019
[abstract]
[pdf]
[bibtex]
[CAV3D]
Accurate annotation is fundamental to quantify the performance of multi-sensor and multi-modal object detectors and trackers.
However, invasive or expensive instrumentation is needed to automatically generate these annotations. To mitigate this problem,
we present a multi-modal approach that leverages annotations from reference streams (e.g. individual camera views) and measurements
from unannotated additional streams (e.g. audio) to infer 3D trajectories through an optimization. The core of our approach
is a multi-modal extension of Bundle Adjustment with a cross-modal correspondence detection that selectively uses measurements in the
optimization. We apply the proposed approach to fully annotate a new multi-modal and multi-view dataset for multi-speaker 3D tracking.
@INPROCEEDINGS{Lanz2019ICASSP,
title = {Accurate target annotation in 3D from multimodal streams},
author = {Lanz, Oswald and Brutti, Alessio, and Xompero, Alessio and Xinyuan, Qian and Omologo, Maurizio and Cavallaro, Andrea},
booktitle = ICASSP,
address = {Brighton, UK},
month = "12--17~" # may,
year = {2019}
}
|
|
MORB: a multi-scale binary descriptor
A. Xompero, O. Lanz, and A. Cavallaro
IEEE International Conference on Image Processing, Athens (Greece), October, 7-10, 2018
[abstract]
[pdf]
[bibtex]
[poster]
Local image features play an important role in matching images under different geometric and photometric transformations. However, as the scale difference across views increases, the matching performance may considerably decrease. To address this problem we propose MORB, a multi-scale binary descriptor that is based on ORB and that improves the accuracy of feature matching under scale changes. MORB describes an image patch at different scales using an oriented sampling pattern of intensity comparisons in a predefined set of pixel pairs. We also propose a matching strategy that estimates the cross-scale match between MORB descriptors across views. Experiments show that MORB outperforms state-of-the-art binary descriptors under several transformations.
INPROCEEDINGS{Xompero2018ICIP_MORB,
title = {{MORB: a multi-scale binary descriptor}},
author = {Xompero, Alessio and Lanz, Oswald and Cavallaro, Andrea},
booktitle = ICIP,
address = {Athens, Greece},
month = "7--10~" # oct,
year = {2018}
}
|
|
Multi-camera Matching of Spatio-Temporal Binary Features
A. Xompero, O. Lanz, and A. Cavallaro
International Conference on Information Fusion, Cambridge (United Kingdom), July, 10-13, 2018
[abstract]
[pdf]
[bibtex]
[slides]
Local image features are generally robust to different geometric and
photometric transformations on planar surfaces or under narrow baseline views.
However, the matching performance decreases considerably across cameras with
unknown poses separated by a wide baseline. To address this problem, we
accumulate temporal information within each view by tracking local binary
features, which encode intensity comparisons of pixel pairs in an image patch.
We then encode the spatio-temporal features into fixed-length binary
descriptors by selecting temporally dominant binary values. We complement the
descriptor with a binary vector that identifies intensity comparisons that are
temporally unstable. Finally, we use this additional vector to ignore the
corresponding binary values in the fixed-length binary descriptor when
matching the features across cameras. We analyse the performance of the
proposed approachand compare it with baselines.
@INPROCEEDINGS{Xompero2018FUSION,
title = {{Multi-camera Matching of Spatio-Temporal Binary Features}},
author = {Xompero, Alessio and Lanz, Oswald and Cavallaro, Andrea},
booktitle = FUSION,
address = {Cambridge, UK},
month = "10--13~" # jul,
year = {2018}
}
|
|
3D Mouth Tracking from a Compact Microphone Array Co-located with a Camera
X. Qian, A. Xompero, A. Brutti, O. Lanz, M. Omologo, and A. Cavallaro
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary (Canada), April, 15-20, 2018
[abstract]
[pdf]
[bibtex]
We address the 3D audio-visual mouth tracking problem when using a compact platform with co-located audio-visual sensors, without a
depth camera. In particular, we propose a multi-modal particle filter that combines a face detector and 3D hypothesis mapping to the
image plane. The audio likelihood computation is assisted by video, which relies on a GCC-PHAT based acoustic map. By combining
audio and video inputs, the proposed approach can cope with a reverberant and noisy environment, and can deal with situations when
the person is occluded, outside the Field of View (FoV), or not facing the sensors. Experimental results show that the proposed tracker is
accurate both in 3D and on the image plane.
@INPROCEEDINGS{Qian2018ICASSP,
title = {{3D Mouth Tracking from a Compact Microphone Array Co-located with a Camera}},
author = {Qian, Xinyuan and Xompero, Alessio and Brutti, Alessio and Lanz, Oswald and Omologo, Maurizio and Cavallaro, Andrea},
booktitle = ICASSP,
address = {Calgary, Canada},
month = "15--20~" # apr,
year = {2018}
}
|