A spatio-temporal multi-scale binary descriptor

Binary descriptors are widely used for multi-view matching and robotic navigation. However, their matching performance decreases considerably under severe scale and viewpoint changes in non-planar scenes. To overcome this problem, we propose to encode the varying appearance of selected 3D scene points tracked by a moving camera with compact spatio-temporal descriptors. To this end, we first track interest points and capture their temporal variations at multiple scales. Then, we validate feature tracks through 3D reconstruction and compress the temporal sequence of descriptors by encoding the most frequent and stable binary values. Finally, we determine multi-scale correspondences across views with a matching strategy that handles severe scale differences. The proposed spatio-temporal multi-scale approach is generic and can be used with a variety of binary descriptors. We show the effectiveness of the joint multi-scale extraction and temporal reduction through comparisons of different temporal reduction strategies and the application to several binary descriptors.

Dataset

We use pairs of sequences, captured with hand-held cameras, from publicly available datasets: TUM-RGB-D SLAM; courtyard; and gate. From TUM-RGB-D SLAM we use two clips of 50 frames (640x480 pixels) with sufficient overlap from desk (with similar motion) and office (cameras move in opposite directions). From the first and fourth video of courtyard, we select the first 50 frames (800x450 pixels) after sub-sampling the videos from 50 to 25 fps. We select the first 100 frames (1280x720 pixels) of the four sequences of gate after down-sampling the video to 10 fps from 30 fps. We pair the first sequence with each of the other three sequences and we refer to each pair as gate-1, gate-2, gate-3, respectively.

Results

Matching results with the nearest neighbour strategy and Lowe's ratio test using ORB features on the six testing sequence pairs. Best results in bold, second best in italic.

Legend: M: number of matches. P: precision. R: recall. F1: F1 score.

Method	desk				office				courtyard				gate-1				gate-2				gate-3
	M	P	R	F1	M	P	R	F1	M	P	R	F1	M	P	R	F1	M	P	R	F1	M	P	R	F1
SetDesc	444	54.28	18.84	27.97	560	38.21	8.16	13.45	853	73.27	18.13	29.06	895	57.21	21.10	30.82	338	19.23	4.79	7.67	880	52.05	19.26	28.12
LMED	321	47.04	11.81	18.87	453	27.37	4.73	8.06	632	57.44	10.53	17.79	693	46.18	13.19	20.51	282	14.18	2.95	4.88	668	43.41	12.20	19.04
T-D	328	46.65	11.96	19.04	454	33.48	5.79	9.88	671	63.64	12.38	20.73	700	50.00	14.42	22.39	265	16.23	3.17	5.30	741	45.34	14.13	21.55
T-DS	481	44.28	16.65	24.20	692	26.45	6.98	11.04	1021	50.93	15.08	23.27	1036	41.89	17.88	25.06	508	11.42	4.27	6.22	1112	36.33	16.99	23.15
MST-S	388	44.85	13.60	20.88	541	43.44	8.96	14.85	1214	85.17	29.99	44.36	892	56.50	20.77	30.37	319	18.81	4.42	7.16	1095	55.07	25.36	34.73
MST	533	42.40	17.67	24.94	834	36.57	11.63	17.65	1610	74.91	34.98	47.69	1293	48.18	25.67	33.49	584	14.55	6.26	8.76	1562	46.09	30.28	36.55

Comparison of multi-scale temporal feature with Bag of Binary Visual Words

F1 score comparison between MST and the BoW approach. For each sequence pair, we show three cases for BoW. BoW-based matching results are obtained by running ORB-SLAM over 30 runs for both camera sequences simultaneously.

Legend
- MST: multi-scale temporal features.
- BoW-A: the best image match is estimated among all keyframes of both cameras.
- BoW-L2: the best image match is estimated between the last keyframe of the second camera against all the keyframes of the first camera.
- BoW-L1: the best image match is estimated between the last keyframe of the first camera against all the keyframes of the second camera.

Comparison of the spatio-temporal approaches applied to a range of different image-based binary descriptors

Average F1 score and standard deviation across all sequence pairs, targeting a maximum of 2000 local features per frame during localisation. Comparison between binary (ORB), histogram-based (SIFT), and CNN-based (DeepBit) descriptors. Note that for SIFT, we compute only the set of SIFTs over time (SetDesc) and the average within the set as reduction (T-D).

Plain text format

  A. Xompero, O. Lanz, A. Cavallaro, A spatio-temporal multi-scale binary descriptor,
  IEEE Transactions on Image Processing, vol. 29, no. 1, pp. 4362-4375, Dec. 2020

Bibtex format

 
  @Article{Xompero2020TIP_MST,
    title={A spatio-temporal multi-scale binary descriptor},
    author={Alessio Xompero and Oswald Lanz and Andrea Cavallaro},
    journal={IEEE Transactions on Image Processing},
    volume={29},
    number={1},
    pages={4362--4375},
    month={Dec},
    year={2020},
    issn={1057-7149},
    doi={10.1109/TIP.2020.2965277}
  }

A spatio-temporal multi-scale binary descriptor

Abstract

Dataset

Results

Comparison of multi-scale temporal feature with Bag of Binary Visual Words

Comparison of the spatio-temporal approaches applied to a range of different image-based binary descriptors

Reference