Image Retrieval based Robot Global Localization

In this post, we would like to share our experience in image retrieval problem for robot global localization with machine learning. We adopted the method proposed by HF-Net [2]. Our work was mainly to apply and test this HF-Net on our environment. Our test pipeline implementation can be found in [17].

Before we start, we’d like to explain “global localization” with a simple example. Assume the situation that you fall asleep on the subway, and suddenly wake up and need to get off. The first thing you would do is looking for the signs that indicate where you are, then find your place in a subway map.

In navigation field, this kind of problem is called “global localization” or “relocalization”. Global localization is localization with a given map, but does not use recent location records. Typical localization, on the other hand, can use both on determining current location [1].

Global localization is generally considered to be a more difficult problem than localization because recent location records are not available [1]. For example, if a person was in Seoul one minute ago, his current location is likely to still be Seoul. He can’t move to Paris in one minute. However, if there is no recent records, we have to estimate his current location across all cities around the globe. Search area is far bigger.

This problem is more frequent for robots. Robots can be easily turned on and off. If the robot has been moved by person while it was turned off, (this happens a lot in the real world) the robot must determine its current location without recent location records(=global localization). This function is essential for robots to continue its work without human intervention.

  • Figure 1. Where am I?

Problem formulation

First, our target is an indoor environment, such as a house, an office and a shopping mall.

Second, a map and a current RGB camera image are given as inputs. We choose camera as our primary sensor because we believe that global localization should utilize vision data. In global localization, due to the shortage of available information, a robot has to utilize various information at current status. For various information, vision data is essential.

Third, we assume that the map has the form of a graph, which is similar to a subway map. We adopt a graph map representation from “topological navigation” [3],[14]. In this representation, each node represents a point on a 2D space. A single or multiple RGB camera images on each point are assigned to each node.

Fourth, the output is the ID of the nearest node from the graph map. Typical global localization algorithms output numerical 6 DOF pose of the robot, such as [2],[4]. In “topological navigation”, the localization algorithm outputs the nearest node ID rather than 6 DOF pose, like in [3].

But why topological representation? Topological representations of maps help to handle vision information relatively easily. We don’t have to care about kinematics, coordinates or bundle adjustment. In the future, we should consider every information available. But for now, we will focus on vision data as our first step. We still store 2D (x, y) pose information at each node, so we can get approximate pose information when we find the closest node ID. Even this can greatly improve the performance of following localization process.

Finally, here is an example of the pipeline.

  1. place a robot in a random position
  2. update a graph map for current space
  3. update a RGB camera image as current observation
  4. run global localization process with the map and the current observation
  5. get the closest node ID among the map nodes
  6. get pose information assigned to the closest node
  7. start following localization or navigation process with the pose information

Problem summary

As you may have already noticed, this simplification of the problem makes it very similar to “visual place recognition” or “image retrieval" problem [5]. It can be converted to the problem finding the most similar (=closest in distance) image (=node) to the query image (=current observation) in the image database (=graph map).

  • Figure 2. Problem formulation
  • Figure 3. Example of graph map & included information

Hierarchical localization [2]

As mentioned above, we simplified the problem so that it is similar to image retrieval problem. Therefore, we looked for a suitable method in the field of image retrieval.

Traditional image retrieval approaches include methods using global descriptor and methods using a local descriptor. Global descriptor methods use feature that describes the whole image. HOG (histogram of oriented gradients) and HOF (histogram of optical flow) are examples of this category. Local descriptor methods use feature that describes a part of an image, a patch. Therefore, it is possible to extract several dozen or even thousands of feature vectors from a single image. Representative methods are ORB [11], SIFT [12], etc.

These two methods have opposite advantages and disadvantages. Global descriptor methods cost relatively small computing resources because there is a single feature vector per image. However, it is not robust to figure out fine or partial changes. Local descriptor methods typically have better image retrieval performance. In contrary, since a number of features need to be compared, the calculation takes a long time.

In image retrieval using deep learning, there are also two similar approaches, and characteristics tend to be similar. Global descriptor methods, commonly referred to as the end-to-end method in the deep learning field, include NetVLAD [5], Siamese network [10], etc. Local descriptor methods, represented by methods such as Superpoint [15], also exist.

We searched for the method that combines the advantages of these two approaches, and finally adopted Hierarchical localization (HF-Net) [2]. This method combines two approaches in hierarchical manner.

  • Figure 4. HF-Net pipeline (This picture is from HF-Net paper [2])

To briefly explain HF-Net, first, “prior retrieval” stage uses the global descriptor method to select k nearest neighbor images. Then, at “local feature matching” stage, the pose of the robot is estimated by applying the local descriptor method (Superpoint) to the k images. (There is an intermediate stage called “SfM reconstruction” between “prior retrieval” and “local feature matching”, but it is excluded because it is not necessary in our problem formulation.)

In the original, localization stage solves “perspective-n-point”(PnP) problem with RANSAC [13], and outputs numeric 6 DOF pose. In our article, we don’t need 6 DOF pose information, so we only compared matching scores between k images and current observation image. The matching score is obtained with local descriptors(Superpoint). The image that has maximum matching score is our final answer.

  • Figure 5. Image retrieval version of Hierarchical localization for this test



We chose the simulation environment as the first environment for testing global localization methods. We want our applying method to work well in various environments. For this purpose, simulation is essential.

We used AI Habitat [6], developed by FAIR, as the simulator. For the indoor environment dataset, we used Matterport3D dataset [7] released by Matterport, Inc. This dataset includes 90 buildings consisting of 1 to 3 floors, so we can test indoor environments on a total of 120 floors.

  • Figure 6. AI Habitat & Matterport3D dataset (This picture is from Habitat paper [6])


We also tested the methods in real-world environment. Due to practical limitations, we chose one level of office building. The real robot used for this experiment is a typical two wheel mobile robot made by CalmanTech [18]. Images are acquired through Intel Realsense D435 camera. Wheel odometry and Hokuyo lidar sensor are utilized for building top-down grid map. This grid map is only used for evaluating global localization performance.

  • Figure 7. Real world top-down map & Observation trajectory


Building graph map

Before we test global localization, we have to generate graph maps first.

In previous studies, graph maps are generated from pre-recorded exploration trajectories [8] or series of random actions [3]. However, the exploration trajectories available in our accessible dataset did not cover whole area of total 120 floors. In case of the random action method, the robot could not easily get out of a room because of narrow doorways.

Due to this issue, we extracted our own exploration trajectories from 2D top-down map. We got the 2D top-down maps of the entire space using the function in simulator, and applied skeletonization method [9] on the map image to get the graph that covers all paths in the space (blue dots in figure 8). Each dot is a single node for the map.

Then, we stored camera observations at each node location on this graph to build the map. We captured four images per node with 90 degree intervals as shown in figure 3 and figure 9. To emphasize “global” localization problem, the direction of camera view is not fixed. The observation of each node has a different yaw angle that is generated randomly.

For the real world test, we generated a relatively simple exploration trajectory by our own hands (red dots in figure 7). We captured three images per node with 90 degree intervals as shown in figure 10.

For the implementation for building graph map and test, you can see our repository [17].

  • Figure 8. Graph map(blue dots) from simulation top-down map & Node distance metric


  • Accuracy: Is the predicted node located within 1.0 m from the ground-truth nearest node?
    • if (error ≤ 1.0 m): True, else: False
    • error: Euclidean distance between predicted node and ground truth node
  • Node distance: How far is the predicted node from the ground-truth nearest node? (fig. 8)
    • Euclidean distance
    • Average distance of all samples

Test detail

Current position sampling. For simulation, position for test is sampled randomly in the entire gray area in fig.8 where the robot can explore. It is not limited to the nodes on the graph map. For real-world, position is sampled among the explored trajectory (green dots in figure 7).

Statistics. For simulation, 100 current positions are sampled for test at each level. We used total 37 spaces in Matterport dataset, so the number of total position samples(=query) is 3,700. The number of nodes in the graph map per level varies from 150 to 3200. The accumulated number of total map nodes(=database) is 53,439. For real-world, there are 539 current position samples, and 891 nodes in the graph map.

Observation. An observation is a concatenated image of multiple images obtained from multiple cameras looking at different directions. For simulation, four cameras are used. For real-world, three cameras are used due to the shortage of camera installation space. The observation images are obtained in the same shape and manner as the observations in the map. The cameras have 69 FOV and the size of the single image is 640 X 480 in both simulation and real-world.

Algorithm implementation. We used Hierarchical localization implementation from [16]. This repository is maintained by the authors of HF-Net.

  • Figure 9. Observation for simulation test
  • Figure 10. Observation for real-world test

Result & Conclusion

Environment Method Accuracy Node distance [m] (std)
simulator ORB (brute force match) 0.930 0.338 (0.574)
simulator NetVLAD 0.967 0.234 (0.383)
simulator NetVLAD + Superpoint 0.982 0.174 (0.289)
real world ORB (brute force match) 0.799 1.373 (3.714)
real world NetVLAD 0.839 0.596 (0.838)
real world NetVLAD + Superpoint 0.892 0.465 (0.766)

In addition to the excellent accuracy result, the average error distance is within 0.2 m in simulation and within 0.5 m in real-world. This result is acceptable for global localization. As mentioned in the HF-Net paper, the result from the method combined with Superpoint performed better than NetVLAD alone. Considering that the largest simulation level has a space of 100 m in diameter and the real-world environment has a diameter of 40 m, this is a performance that can sufficiently assist the subsequent localization process.

One thing to note is that we did not perform any fine-tuning or few-shot training with our dataset. We applied NetVLAD and Superpoint as off-the-shelf weights. It worked well even though the resolution and ratio of the images were different. Another surprising thing is that NetVLAD was trained using dataset acquired in outdoor environment (Google roadview) rather than indoor space. In addition to NetVLAD, we also trained Siamese network with indoor dataset from Habitat, and we found out that off-the-shelf NetVLAD performs better than our trained network.



Sebastian Thrun. Probabilistic robotics. Communications of the ACM, 2002.


Paul-Edouard Sarlin, Cesar Cadena, Roland Siegwart, and Marcin Dymczyk. From coarse to fine: Robust hierarchical localization at large scale. In Computer Vision and Pattern Recognition (CVPR), 2019.


Nikolay Savinov, Alexey Dosovitskiy, and Vladlen Koltun. Semi-parametric topological memory for navigation. In International Conference on Learning Representations (ICLR), 2018.


A. Kendall, M. Grimes, and R. Cipolla. PoseNet: a convolutional network for real-time 6-DOF camera relocalization. In International Conference on Computer Vision (ICCV), 2015.


R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic. NetVLAD: CNN architecture for weakly supervised place recognition. In Computer Vision and Pattern Recognition (CVPR), 2016.


Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, et al. Habitat: A platform for embodied ai research. In International Conference on Computer Vision (ICCV), 2019.


A. Chang et al. Matterport3d: Learning from rgb-d data in indoor environments. In International Conference on 3D Vision (3DV), 2017.


X. Meng, N. Ratliff, Y. Xiang, and D. Fox. Scaling local control to large-scale topological navigation. In IEEE The International Conference on Robotics and Automation (ICRA), 2020.


T. Y. Zhang and C. Y. Suen. A fast parallel algorithm for thinning digital patterns. Communications of the ACM, Volume 27, Number 3, 1984.


Sergey Zagoruyko and Nikos Komodakis. Learning to compare image patches via convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), 2015.


R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos. ORB-SLAM: a versatile and accurate monocular SLAM system. IEEE Transactions on Robotics, 31(5):1147–1163, 2015.


David G Lowe. Distinctive image features from scaleinvariant keypoints. IJCV, 2004.


O. Chum and J. Matas. Optimal Randomized RANSAC. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(8):1472 –1482, 2008.


D. S. Chaplot, R. Salakhutdinov, A. Gupta, and S. Gupta. Neural topological slam for visual navigation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.


Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Superpoint: Self-supervised interest point detection and description. In CVPR Workshop, 2018.


Paul-Edouard Sarlin et al. Hierarchical localization [Source code].


C. Lee, J. Choi, C. Park. Image Retrieval Testbed for Indoor Robot Global Localization [Source code].


“Calman Tech homepage”,



Chaehyeuk LEE

Research Scientist

github   github