A spatio-temporal multi-scale binary descriptor

Queen Mary University of London - Centre for Intelligent Sensing1
Fondazione Bruno Kessler - Technologies of Vision2

Abstract

Binary descriptors are widely used for multi-view matching and robotic navigation. However, their matching performance decreases considerably under severe scale and viewpoint changes in non-planar scenes. To overcome this problem, we propose to encode the varying appearance of selected 3D scene points tracked by a moving camera with compact spatio-temporal descriptors. To this end, we first track interest points and capture their temporal variations at multiple scales. Then, we validate feature tracks through 3D reconstruction and compress the temporal sequence of descriptors by encoding the most frequent and stable binary values. Finally, we determine multi-scale correspondences across views with a matching strategy that handles severe scale differences. The proposed spatio-temporal multi-scale approach is generic and can be used with a variety of binary descriptors. We show the effectiveness of the joint multi-scale extraction and temporal reduction through comparisons of different temporal reduction strategies and the application to several binary descriptors.

TIP2020

Dataset

We use pairs of sequences, captured with hand-held cameras, from publicly available datasets: TUM-RGB-D SLAM; courtyard; and gate. From TUM-RGB-D SLAM we use two clips of 50 frames (640x480 pixels) with sufficient overlap from desk (with similar motion) and office (cameras move in opposite directions). From the first and fourth video of courtyard, we select the first 50 frames (800x450 pixels) after sub-sampling the videos from 50 to 25 fps. We select the first 100 frames (1280x720 pixels) of the four sequences of gate after down-sampling the video to 10 fps from 30 fps. We pair the first sequence with each of the other three sequences and we refer to each pair as gate-1, gate-2, gate-3, respectively.

desk office courtyard gate

A sample frame for each sequence of the four sets. Note the differences in viewpoint and/or scale between sequences within the same set.

Results

Matching results with the nearest neighbour strategy and Lowe's ratio test using ORB features on the six testing sequence pairs. Best results in bold, second best in italic.

Legend: M: number of matches. P: precision. R: recall. F1: F1 score.


Method desk office courtyard gate-1 gate-2 gate-3
M P R F1 M P R F1 M P R F1 M P R F1 M P R F1 M P R F1
SetDesc 444 54.28 18.84 27.97 560 38.21 8.16 13.45 853 73.27 18.13 29.06 895 57.21 21.10 30.82 338 19.23 4.79 7.67 880 52.05 19.26 28.12
LMED 321 47.04 11.81 18.87 453 27.37 4.73 8.06 632 57.44 10.53 17.79 693 46.18 13.19 20.51 282 14.18 2.95 4.88 668 43.41 12.20 19.04
T-D 328 46.65 11.96 19.04 454 33.48 5.79 9.88 671 63.64 12.38 20.73 700 50.00 14.42 22.39 265 16.23 3.17 5.30 741 45.34 14.13 21.55
T-DS 481 44.28 16.65 24.20 692 26.45 6.98 11.04 1021 50.93 15.08 23.27 1036 41.89 17.88 25.06 508 11.42 4.27 6.22 1112 36.33 16.99 23.15
MST-S 388 44.85 13.60 20.88 541 43.44 8.96 14.85 1214 85.17 29.99 44.36 892 56.50 20.77 30.37 319 18.81 4.42 7.16 1095 55.07 25.36 34.73
MST 533 42.40 17.67 24.94 834 36.57 11.63 17.65 1610 74.91 34.98 47.69 1293 48.18 25.67 33.49 584 14.55 6.26 8.76 1562 46.09 30.28 36.55

Comparison of multi-scale temporal feature with Bag of Binary Visual Words

F1 score comparison between MST and the BoW approach. For each sequence pair, we show three cases for BoW. BoW-based matching results are obtained by running ORB-SLAM over 30 runs for both camera sequences simultaneously.

Legend
- MST: multi-scale temporal features.
- BoW-A: the best image match is estimated among all keyframes of both cameras.
- BoW-L2: the best image match is estimated between the last keyframe of the second camera against all the keyframes of the first camera.
- BoW-L1: the best image match is estimated between the last keyframe of the first camera against all the keyframes of the second camera.

Comparison of the spatio-temporal approaches applied to a range of different image-based binary descriptors

Average F1 score and standard deviation across all sequence pairs, targeting a maximum of 2000 local features per frame during localisation. Comparison between binary (ORB), histogram-based (SIFT), and CNN-based (DeepBit) descriptors. Note that for SIFT, we compute only the set of SIFTs over time (SetDesc) and the average within the set as reduction (T-D).

Reference

Plain text format
  A. Xompero, O. Lanz, A. Cavallaro, A spatio-temporal multi-scale binary descriptor,
  IEEE Transactions on Image Processing, vol. 29, no. 1, pp. 4362-4375, Dec. 2020
Bibtex format
 
  @Article{Xompero2020TIP_MST,
    title={A spatio-temporal multi-scale binary descriptor},
    author={Alessio Xompero and Oswald Lanz and Andrea Cavallaro},
    journal={IEEE Transactions on Image Processing},
    volume={29},
    number={1},
    pages={4362--4375},
    month={Dec},
    year={2020},
    issn={1057-7149},
    doi={10.1109/TIP.2020.2965277}
  }