We propose a robust approach to annotating independently moving objects captured by head mounted stereo cameras that are worn by an ambulatory (and visually impaired) user. Initially, sparse optical flow is extracted from a single image stream, in tandem with dense depth maps. Then, using the assumption that apparent movement generated by camera egomotion is dominant, flow corresponding to independently moving objects (IMOs) is robustly segmented using MLESAC. Next, the mode depth of the feature points defining this flow (the foreground) are obtained by aligning them with the depth maps. Finally, a bounding box is scaled proportionally to this mode depth and robustly fit to the foreground points such that the number of inliers is maximised.