Skip to content

2D Modalities

Sasha Sax edited this page Feb 17, 2017 · 3 revisions

Intro

The dataset contains densely sampled RGB images per scan location. These images were sampled from equirectangular images that were generated per scan location and modality using the raw data captured by the scanner. All images in the dataset are stored in full high-definition at 1080x1080 resolution. For more details on the random sampling of RGB images read section 4.2 in the paper. We provide the camera metadata for each generated image.

We also provide depth images that were computed on the 3D mesh instead of directly on the 3D mesh, as well as surface normal images. 2D semantic annotations are computed for each image by projecting the 3D mesh labels on the image plane. Due to certain geo- metric artifacts present at the mesh model mainly because of the level of detail in the reconstruction, the 2D annotations occasionally present small local misalignment to the underlying pixels, especially for points that have a short distance to the camera. This issue can be easily addressed by fusing image content with the projected annotations using graphical models. The dataset also includes 3D coordinate encoded images where each pixel encodes the X, Y, Z location of the point in the world coordinate system. Last, an equirectangular projection is also provided per scan location and modality.

Technical Details

Pose: The pose files contain camera metadata for each image and are given in the /pose subdirectories. They have filenames which are globally unique due to the fact that camera uuids are not shared between areas. They are stored in json files, and contain

{
  "camera_k_matrix":  # The 3x3 camera K matrix. Stored as a list-of-lists, 
  "field_of_view_rads": #  The Camera's field of view, in radians, 
  "camera_original_rotation": #  The camera's initial XYZ-Euler rotation in the .obj, 
  "rotation_from_original_to_point": 
  #  Apply this to the original rotation in order to orient the camera for the corresponding picture, 
  "point_uuid": #  alias for camera_uuid, 
  "camera_location": #  XYZ location of the camera, 
  "frame_num": #  The frame_num in the filename, 
  "camera_rt_matrix": #  The 4x3 camera RT matrix, stored as a list-of-lists, 
  "final_camera_rotation": #  The camera Euler in the corresponding picture, 
  "camera_uuid": #  The globally unique identifier for the camera location, 
  "room": #  The room that this camera is in. Stored as roomType_roomNum_areaNum 
}

Note that in equirectangular images the FOV will be 90 degrees, but the image is actually panoramic and therefore a 360-degree view.

RGB: RGB images are in the /rgb folder and contain synthesized but accurate RGB images of the scene.

Depth: Depth images are stored as 16-bit PNGs and have a maximum depth of 128m and a sensitivity of 1/512m. Missing values are encoded with the value 2^16 - 1. Note that while depth is defined relative to the plane of the camera in the data (z-depth), it is defined as the distance from the point-center of the camera in the panoramics.

Global XYZ: Global XYZ images contain the ground-truth location of each pixel in the mesh. They are stored as 16-bit 3-channel OpenEXR files and a convenience readin function is provided in /assets/utils.py. These can be used for generating point correspondences, e.g. for optical flow. Missing values are encoded as #000000.

Normal: Normals are 127.5-centered per-channel surface normal images. For panoramic images, these normals are relative to the global corodinate system. Since the global coordinate system is impossible to determine from a sampled image, the normal images in /data have their normals defined relative to the direction the camera is facing. The normals axis-color convention is the same one used by NYU RGB-D. Areas where the mesh is missing have pixel color #808080.

Semantic: Semantic images come in two variants, semantic and semantic_pretty. They both include information from the point cloud annotations, but only the semantic version should be used for learning. The labels can be found in assets/semantic_labels.json, and images can be parsed using some of the convenience functions in utils.py. Specifically: The semantic images are encoded as 3-channel 8-bit PNGs which are interpreted as 24-bit base-256 integers which are an index into the labels array in semantic_labels.json.

To make this concrete, take the following semantic panorama:

Semantic image from teaser

Let's say that you've loaded the image into memory and it's stored as a numpy array called img and want the label for the pixel at (1500, 2000) which is the leftmost sofa chair in this image. utils.py provides get_index, load_labels and parse_labels for extracting the label information. Here is what your code might look like:

from scipy.misc import imread
from assets.utils import *  # Assets should be downloaded from this repo
labels = load_labels( '/path/to/assets/semantic_labels.json' )

img = imread(  '/path/to/image.png' )
pix = img[ 1500,2000 ]
instance_label = labels[ get_index( pix ) ]
instance_label_as_dict = parse_label( instance_label )
print instance_label_as_dict

Gives {'instance_num': 5, 'instance_class': u'sofa', 'room_num': 3, 'room_type': u'office', 'area_num': 3} Here we can see that this is the 5th instance of class 'sofa' in area 3.

Finally, note that pixels where the data is missing are encoded with the color #0D0D0D which is larger than the len( labels ).

Clone this wiki locally