Baidu Proposes Largest Text-to-Image Model; XPeng Doubles Down on Self-Driving Tech; AI Firms Reportedly Laying Off Employees
Weekly China AI News from Oct 24 to Oct 30
Dear readers, the text-to-image model party is joined by a new face: ERNIE-ViLG 2.0, which claims better zero-shot performance than DALL-E 2 and Stable Diffusion. Xpeng flexed its muscles in the advanced driver assistance system (ADAS) and revealed robotaxi plans in 2023. Would it pull the embattled EV maker out of the slough? A few major Chinese AI firms reportedly carried out massive layoffs. What happened?
Weekly News Roundup
Meet ERNIE-ViLG 2.0, Baidu’s 24B Text-to-Image Model
What’s new: Last week, Baidu researchers proposed ERNIE-ViLG 2.0, a Chinese text-to-image (TTI) model with 24 billion parameters, making it the largest TTI model ever. In contrast, OpenAI’s DALL-E 2 only has 3.5 billion parameters.
By scaling its size, ERNIE-ViLG 2.0 produces better image quality than Dall-E 2 and Stable Diffusion in human evaluations and achieves the best performance in high-quality image generation on the popular MS-COCO dataset.
Contributions: How is ERNIE-ViLG 2.0 that big? Baidu researchers adopted a technique called “Mixture-of-Denoising-Experts” with the motivation to employ different expert networks to deal with different tasks. The model consists of 10 expert networks — specifically denoising U-net with 2.2 billion parameters each — and a 1.3 billion Transformer text encoder.
What distinguishes ERNIE-ViLG 2.0 from other TTI models is knowledge enhancement. Instead of training models directly on datasets, researchers apply both text and visual knowledge to guide the model to better understand semantics in the text prompts and detect objects in the scene.
Papers and demo: The paper ERNIE-ViLG 2.0: Improving Text-to-Image Diffusion Model with Knowledge-Enhanced Mixture-of-Denoising-Experts can be found on arXiv and the demo ERNIE-ViLG is available on HuggingFace.
XPeng Unveils Vision-Based Neural Nets for Autonomous Driving and Robotaxi Plan
What’s new: XPeng has long been advertising its self-driving capabilities to stand out in an increasingly fierce EV battleground. On October 24, known as China’s annual 1024 programmer’s day as well as XPeng’s tech day, the Guangzhou-based EV maker revealed a series of innovations in self-driving tech including neural networks, robotaxis, and flying cars.
Full-scenario ADAS: XPeng’s ADAS is named XNGP, branded as China’s most advanced ADAS without relying on HD Maps. XNGP can navigate from highways and parking lots to complex city roads and urban expressways.
XNGP is now available on XPeng’s newest premium SUV G9, which offers 508 TOPS of computing power, a dual-LiDAR system, 8-megapixel HD cameras, and a new software architecture XNet.
XNet: XPeng introduced the fundamental software architecture of XNGP, named XNet. As a vision-based perception architecture, XNet highly assembles Tesla FSD’s neural network revealed in the first Tesla AI day — both feeding models with raw images collected from multiple cameras, passing data through a bunch of modules like BiFPN and Transformers, and fitting into different tasks like lane detections. XNet is being trained at Fuyao, Alibaba’s supercomputing center that provides 600PFLOPS of computing power.
Robotaxi: XPeng said its G9 SUV has passed an autonomous driving test, ready for a robotaxi rollout. XPeng plans to bring a robotaxi fleet on public roads as early as in 2023.
Chinese AI firms Brace for Massive Layoffs
What’s new: Last week, MiningLamp, an AI and big data company known as “China’s Palantir”, is reportedly firing 70% of its staff, or thousands of employees. The company denied the news report in July. However, posts about layoffs, cancellation of bonuses, and hiring freeze flooded China’s recruitment platforms and social media.
Who’s MiningLamp? Founded in 2014, the Beijing-based MiningLamp provides enterprises with big data solutions and transforms data into insights. The company has bagged $786.6M in funding from big-name investors like Sequoia China and Tencent.
Chinese media speculate that an aggressive expansion without sustainable revenue growth and a worsening economic environment constitute layoffs. US restrictions on advanced AI technologies and semiconductors also cast a cloudy future for Chinese AI firms.
Other layoffs: MiningLamp is not the only Chinese AI firm that guts the workforce. SCMP last week reported that Megvii, a Chinese facial tech developer, started a fresh round of job cuts at multiple departments. Biren Technology, a Chinese leading GPU maker that rivals Nvidia, also reportedly cut off one-third of its employees after the US imposed new chip export control and TSMC suspended silicon production for Chinese startup Biren Technology.
Trending Research
Towards artificial general intelligence via a multimodal foundation model
Researchers from the Renmin University of China, Beijing Key Laboratory of Big Data Management and Analysis Methods, King Abdullah University of Science and Technology, and the University of Surrey develop a foundation model pre-trained with huge multimodal data, which can be quickly adapted for various downstream cognitive tasks. The foundation model possesses strong imagination ability. The paper has been accepted by Nature Communications.
RCareWorld: A Human-centric Simulation World for Caregiving Robots
Researchers from Shanghai Jiao Tong University, Cornell University, and Columbia University present RCareWorld, a human-centric simulation world for physical and social robotic caregiving designed with inputs from stakeholders. Researchers said RCareWorld takes the first step towards building a realistic simulation world for robotic caregiving that would enable researchers worldwide to contribute to this impactful field. The paper won the Best RoboCup Paper award at the premier robotic conference IROS 2022.
DetCLIP: Dictionary-Enriched Visual-Concept Paralleled Pre-training for Open-world Detection
Researchers from Hong Kong University of Science and Technology, Huawei Noah’s Ark Lab, and Sun Yat-Sen University proposed DetCLIP, a paralleled visual-concept pre-training method for open-world detection by resorting to knowledge enrichment from a designed concept dictionary. The proposed framework demonstrates strong zero-shot detection performances on multiple datasets. The paper has been accepted at NeurIPS 2022.