🐧 China’s Capital Proposes AGI Measures; Midjourney Enters Chinese Market; Tsinghua University Open Sources Multimodal VisualGLM-6B

Weekly China AI News from May 15 to May 21

7 min readMay 23, 2023

Dear readers, in this week’s issue, I will discuss a new draft measure by Beijing authorities aimed at promoting the growth of AGI. Midjouney is teaming up with Tencent’s QQ to capitalize on the Chinese market. After gaining huge popularity with ChatGLM-6B, Tsinghua University has now open-sourced another cutting-edge multimodal model, VisualGLM-6B.

China’s Capital Proposes Draft Measures for AGI Development

What’s new: China’s capital is seeking public feedback on a proposal aimed at fostering the growth of Artificial General Intelligence (AGI). On May 12, the Beijing Municipal Science and Technology Commission, along with the Zhongguancun Administrative Committee and others, released a draft document entitled Beijing Measures for Facilitating the Innovative Development of Artificial General Intelligence (2023–2025).

Key Highlights: This draft comprises 21 suggested actions designed to develop large AI models in Beijing. These actions span five fundamental areas: computing, data, technology, applications, and regulations. The full document can be accessed here.

Computing: The draft recommends building robust partnerships with leading public cloud service providers, tapping into their computational capabilities, and facilitating the construction of new data centers.
Data: The draft recognizes the scarcity of high-quality Chinese text corpora, thus integrating existing open-source Chinese data and internet data. Furthermore, it proposes the creation of new multimodal databases that include text, images, audio, and video.
Technology: The draft notably recommends exploring new avenues for AGI. These include embodied intelligence, general AI entities, and neuromorphic (brain-like) intelligence. This is in addition to following the scaling approach of OpenAI.
Applications: The document suggests applications for AGI across various sectors, including government services, healthcare, scientific research, autonomous driving, finance, and urban governance.
Regulations: The draft encourages the creation of an inclusive and balanced regulatory environment. What is noteworthy is the draft advocates for proactive solicitation of support from CAC, China’s top internet regulatory authority, to establish pilot programs in Zhongguancun.

Midjourney Collaborates with Tencent’s QQ for Entry into the Chinese Market

What’s new: Midjourney, the popular text-to-image software, is making its initial stride into China’s vast generative art market.

On May 17, a WeChat account named “Midjourney AI” announced the beta testing of the Chinese version of Midjourney, starting from 6 pm on May 15. The test, running on Tencent’s QQ platform, is open on Mondays and Fridays at 6 pm and closes to new participants once a maximum number is reached.

However, the “Midjourney AI” account quickly removed its official announcement blog.

Real or not: Chinese tech media 36Kr reported that the beta test for the Chinese version of Midjourney was indeed real. An industry insider told 36Kr that Midjourney headquarters had previously mentioned the beta test of the Chinese version during a user meeting, and QQ is fully supporting the commercialization of Midjourney.

How it works: Each user participating in the beta test can generate 25 images for free in the Midjourney QQ channel. Similar to Midjourney Discord, users can type “/imagine + Prompt” in the general channel to call a Midjourney bot to generate a digital painting.

In addition to generating images, users can also input commands to adjust the generated images, including Upscale, Variation, and Remix. Other features, such as Image Prompt, DM to Bot, and Gallery for mobile, are open to members only.

Why it matters: Midjourney marks its entrance into the Chinese market as the first Western generative art company, challenging domestic alternatives like Baidu’s Wenxin Yige. Before, users could only gain access to Midjourney by using a VPN. This move has also spotlighted Tencent’s QQ. Users are now revisiting the previously fading social platform. If successful, QQ could potentially make a comeback, akin to China’s version of Discord.

Meet VisualGLM-6B, Tsinghua University’s Open-Source Multimodal Model

What’s new: ChatGLM-6B, an open-source dialog language model developed by Tsinghua University, has gained huge popularity in the AI research community. Built on this model, Tsinghua University researchers further open-sourced VisualGLM-6B, a multimodal dialog language model that can understand both text and image inputs.

How it works: In an effort to bridge the gap between visual and language models, VisualGLM-6B incorporates the training of the BLIP2-Qformer, bringing the total parameters up to 78 billion.

This model utilizes 30 million high-quality Chinese image-text pairs and 300 million filtered English image-text pairs from the CogView dataset for pre-training. The equal weighting of Chinese and English ensures optimal alignment of visual information with ChatGLM’s semantic space. In the fine-tuning phase, VisualGLM-6B was trained on extensive visual question-answering data to generate answers that align with human preferences.

The model’s training takes place using the SwissArmyTransformer (or SAT) library, which facilitates the flexible modification and training of the Transformer. This library also supports efficient fine-tuning methods, including Lora and P-tuning. In an effort to be more user-friendly, the project provides a HuggingFace interface, along with an interface based on SAT.

Despite these impressive features, VisualGLM-6B is still in its early v1 stage, with several known limitations. Users may experience factual inaccuracy or model hallucination in image descriptions, a lack of detail in image capture, and limitations inherited from the language model. These issues will be addressed in future versions of VisualGLM, with a focus on optimization.

The application of model quantization technology allows users to deploy locally on consumer-grade graphics cards, requiring as little as 8.7G of memory under the INT4 quantization level. This speaks to the exciting accessibility of this ground-breaking model, even as it continues to develop and evolve.

Weekly News Roundup

⚡️ In Tencent’s Q1 2023 earnings call, Tencent CEO Pony Ma made his first remark on AI. “At first, we thought AI was a once-in-a-decade opportunity in the internet, but the more we thought about it, the more we felt that this is a once-in-a-hundreds-of-year opportunity, similar to the industrial revolution brought by the invention of electricity.” Ma also stated that Tencent is not in a rush to showcase semi-finished products. “The most important thing is scenario implementation, and we are currently contemplating it. Nowadays, many companies are in a hurry, and it feels like they are trying to boost their stock prices.”

🔍 In Baidu’s Q1 2023 earnings call, Baidu CEO Robin Li said the company is beta-testing an upgraded version of Baidu Search, which is powered by the ERNIE Bot. Li is confident that AI will create new job opportunities for humans, and believes the greatest threat to humanity is to stop innovation.

🚲 A Chinese media outlet reported that Meituan has been developing its foundation models for two months and expanding its algorithm team.

🏪 360 has unveiled the 360 AI Store, an AI hub that consolidates mainstream AI tools globally. The platform functions similarly to a web portal from the PC era. As it stands, the 360 AI Store houses hundreds of AI products across 18 diverse categories, including AI art, AI writing, and AI audio production. The platform features renowned products from industry giants such as Baidu’s ERNIE Bot, ByteDance’s Huoshan Writing, and iFlytek’s Xunfei ZhiZuo.

Trending Research

X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages

Affiliations: Institute of Automation at the Chinese Academy of Sciences, School of Future Technology, University of Chinese Academy of Sciences
X-LLM is a proposed model that empowers Large Language Models (LLMs) with multi-modal capabilities by converting various modalities, such as images, speech, and videos, into languages through X2L interfaces. The model architecture involves training X2L interfaces to align multimodal information, aligning these representations with the LLM, and integrating multimodal abilities into the LLM. X-LLM shows impressive multimodal chat abilities, with experimental results yielding an 84.5% relative score compared to GPT-4 on a synthetic multimodal instruction-following dataset. The model could propel the advent of LLM-based speech recognition.

REASONER: An Explainable Recommendation Dataset with Multi-aspect Real User Labeled Ground Truths

Affiliations: Beijing Key Laboratory of Big Data Management and Analysis Methods, Gaoling School of Artificial Intelligence at Renmin University of China, Huawei Noah’s Ark Lab, Department of Computer Science at Hong Kong Baptist University
Researchers have developed an explainable recommendation dataset that addresses current limitations in evaluating recommender models. They have created a video recommendation platform and have collected feedback from around 3000 users of diverse backgrounds. This novel dataset consists of multi-aspect real user-labeled ground truths. Alongside the dataset, the researchers have built a library that implements ten renowned explainable recommender models in a unified framework. This initiative offers new opportunities for the explainable recommendation field. The dataset, library, and related documents are accessible on the given website.

HuaTuo (华驼): Tuning LLaMA Model with Chinese Medical Knowledge

Affiliations: Research Center for Social Computing and Information Retrieval at the Harbin Institute of Technology
Large Language Models (LLMs), such as the LLaMA model, have demonstrated their effectiveness in various general-domain natural language processing (NLP) tasks. Nevertheless, LLMs have not yet performed optimally in biomedical domain tasks due to the need for medical expertise in the responses. In response to this challenge, we propose HuaTuo, a LLaMA-based model that has been supervised-fine-tuned with generated QA (Question-Answer) instances. The experimental results demonstrate that HuaTuo generates responses that possess more reliable medical knowledge. Our proposed HuaTuo model is accessible at this https URL.