Thanks to the ultra-reliable low-latency communication (URLLC) capability of the emergent 5G mobile networks, the information derived from the roadside static surveillance or on-board moving IoT sensors (e.g., video cameras, Radars and Lidars), which can be jointly explored by the mobile edge computing (MEC) and real-time shared by all the local connected users for various smart city applications. To achieve this goal of coordinated mining of different modalities of IoT data, all of the detected/segmented and tracked human/vehicle objects need to be 3D localized in the world coordinate for effective 3D understanding of local dynamic evolutions. In this talk I will mainly talk about some challenges and potential solutions, more specifically, a robust tracking and 3D localization of detected objects, from either static/moving monocular video cameras, is proposed based on a variant of the Cascade R-CNN detector trained with triplet loss to obtain the accurate localization and the corresponding discriminating identity-aware features for tracking association, even with long-term occlusion, of each detected object in one-shot. When the cameras fail to reliably achieve these tasks due to poor lighting or adverse weather conditions, Radars and Lidars can offer more robust localization than the monocular cameras. However, the semantic information provided by the radio or point cloud data is limited and dif?cult to extract. In this talk, I will also introduce a radio object detection network (RODNet) to detect objects purely from radio signals captured by Radar based on an innovative cross-modal supervision framework, which utilizes the rich information extracted from the camera to teach object detection for Radar without tedious and laborious human labelling of ground truth on the Radar signals. Moreover, to compensate the disadvantage of Lidar detection on far-away small objects, effective integration of Lidar based detections, along with 2D object detections and 3D localization from monocular images based on 3D tracking associations, to achieve superior tracking and 3D localization performance. Finally, an efficient 3D human pose estimation for action description of detected human in natural monocular videos is also presented for finer-grained 3D scene understanding for smart city applications.