CVPR 2022 Recap; Meituan Proposes YOLOv6; Tencent Invests in Data Processing Unit Firm
Weekly China AI News from June 20 to June 26
In this week’s issue, we will zero in on the CVPR 2022 and research highlights from Chinese institutes, companies, and universities. Also, meet VOLOv6 proposed by Chinese food delivery giant Meituan.
News of the Week
CVPR 2022 Recap
What’s new: The premier computer vision conference CVPR 2022 is back in a hybrid form with both in-person and virtual attendance allowed, for the first time since 2019. Over 6,000 participants flocked to New Orleans.
Stats: The conference set a new record of paper submissions — 8,161 papers, among which 2,064 papers have been accepted, yielding an acceptance rate of 25.28%. 44.59% of 23,389 authors come from China, followed by the U.S. and Korea.
R.I.P.: The CVPR committee also commemorated Dr. Jian Sun, Chief Scientist at Megvii Technology and Dean of Megvii Research Center, who passed away last week due to sudden illness. For more about Dr. Sun, check out our last issue.
Best Papers: The paper Learning to Solve Hard Minimal Problems from ETH Zurich, University of Washington, Georgia Institute of Technology, Czech Technical University was crowned the Best Paper Award. The paper EPro-PnP: Generalized End-to-End Probabilistic Perspective-n-Points for Monocular Object Pose Estimation from Tongji University and Alibaba won the Best Student Paper award.
Locating 3D objects from a single RGB image via Perspective-n-Points (PnP) is a long-standing problem in computer vision. Driven by end-to-end deep learning, recent studies suggest interpreting PnP as a differentiable layer, so that 2D-3D point correspondences can be partly learned by backpropagating the gradient w.r.t. object pose. Yet, learning the entire set of unrestricted 2D-3D points from scratch fails to converge with existing approaches, since the deterministic pose is inherently non-differentiable. In this paper, we propose the EPro-PnP, a probabilistic PnP layer for general end-to-end pose estimation, which outputs a distribution of pose on the SE(3) manifold, essentially bringing categorical Softmax to the continuous domain. The 2D-3D coordinates and corresponding weights are treated as intermediate variables learned by minimizing the KL divergence between the predicted and target pose distribution. The underlying principle unifies the existing approaches and resembles the attention mechanism. EPro-PnP significantly outperforms competitive baselines, closing the gap between PnP-based method and the task-specific leaders on the LineMOD 6DoF pose estimation and nuScenes 3D object detection benchmarks.
Dr. Fei-Fei Li from Stanford University won the Thomas Huang Memorial Prize. An AI visionary, Dr. Li has set an admirable example with her outstanding efforts in education, research & services in our AI community and beyond.
Chinese CV leader SenseTime has a strong presence at CVPR as usual, with 71 publications accepted. One research team from Nanyang Technological University, Sun Yat-Sen University, UCLA, and SenseTime Research proposed a novel music-to-dance framework named Bailando, which consists of two powerful components: 1) a choreographic memory that learns to summarize meaningful dancing units from 3D pose sequence to a quantized codebook, 2) an actor-critic Generative Pre-trained Transformer (GPT) that composes these units to a fluent dance coherent to the music.
With the learned choreographic memory, dance generation is realized on the quantized units that meet high choreography standards, such that the generated dancing sequences are confined within the spatial constraints. To achieve synchronized alignment between diverse motion tempos and music beats, we introduce an actor-critic-based reinforcement learning scheme to the GPT with a newly-designed beat-align reward function. Extensive experiments on the standard benchmark demonstrate that our proposed framework achieves state-of-the-art performance both qualitatively and quantitatively. Notably, the learned choreographic memory is shown to discover human-interpretable dancing-style poses in an unsupervised manner.
Papers & Projects
YOLOv6: a single-stage object detection framework dedicated to industrial applications
Researchers from Meituan introduced (MT-)YOLOv6, a single-stage object detection framework dedicated to industrial applications, with hardware-friendly efficient design and high performance. An improved object detector of YOLOv5 and YOLOX, YOLOv6-nano achieves 35.0 mAP on the COCO val2017 dataset with 1242 FPS on T4 using TensorRT FP16 for bs32 inference, and YOLOv6-s achieves 43.1 mAP on COCO val2017 dataset with 520 FPS on T4 using TensorRT FP16 for bs32 inference. Though it’s unclear whether Meituan is legitimate to name their models after the YOLO series.
Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts
Researchers ByteDance AI Lab proposed a new method called X-VLM to perform multi-grained vision language pre-training. The key to learning multi-grained alignments is to locate visual concepts in the image given the associated texts, and in the meantime align the texts with the visual concepts, where the alignments are in multi-granularity. Experimental results show that X-VLM effectively leverages the learned multi-grained alignments to many downstream vision language tasks and consistently outperforms state-of-the-art methods.
Jaguar Microsystems, a Tencent-backed data processing unit developer, has raised hundreds of millions of RMB in a new financing round, valuing the company at RMB 9 billion. Founded in 2020, the Shenzhen-based company develops a new generation of DPUs and advanced silicon solutions for modern data centers.
XYZ Robotics, an AI-powered robotic technology company, has raised almost $40 million in its Series B+ funding round led by Capital Today, followed by Gaorong Capital, 5Y Capital, and Source Code Capital. Founded in 2018, the Shanghai-Massachusetts-based company has developed leading 3D vision, motion planning, and end-of-the-arm tooling technologies. Based on them, XYZ provides turnkey solutions for piece picking, palletizing and depalletizing, deep bin picking, and assembly for logistics and manufacturing customers.
Helixon (华深智药), a startup that builds next-generation AI solutions for protein-based therapeutics, has raised RMB 500 million in its Series A funding round led by 5Y Capital. Founded in 2021, the company aims to develop AI tools that empower scientists to decipher protein function and interaction, interrogate large-scale genomic datasets for target identification and design therapeutics such as antibodies and cell therapies.