This article is based on a recent lecture I gave at Peking University to a large group of European students studying in the dual degree programs between PKU and LSE / Sciences Po. My lecture centers on the US-China tech competition, covering 3 vectors: AI, data, and venture capital. My thesis on data and venture capital has been well-covered by our previous articles (on data, on venture capital). This article will flesh out my thesis on AI.
GPT-4, Sora, GPT-4o - ever since ChatGPT stormed into being in late 2022, the world has been mesmerized by miracle after miracle. For China observers, a question looms large: will China ever be able to catch up with the US in this brave new world?
Of course, there have been a string of successful LLM startups in China. There is Moonshot AI, whose star product, Kimi Chat, specializes in long-context conversations and has been wildly successful, making the company reach a $3 billion valuation in the recent all-star funding round. Besides Moonshot AI, there are also Baichuan AI, Zhipu AI, and Minimax, already collectively dubbed as China’s “Four New AI Tigers”.
But will the success of individual companies necessarily mean China will be able to collectively catch up with the US, especially given the poor quality of Chinese language training data as well as the US embargo on advanced chips?
I am not an AI expert, but I hope the first-principle reasoning below will form some baseline understanding for this topic.
To start with, I’d like to point out that there are 3 key things that make an LLM-based AI technology possible: algorithm, compute, and data.
The algorithm
The algorithm is the "how".
How does something as miraculous as ChatGPT emerge from meaningless bits? In terms of algorithms, China does lag behind US firms like OpenAI. But still, the basic concept of an LLM model is publicly available knowledge.
LLM is not a scientific revolution. It's not as if the US has discovered the Law of Relativity while China is kept in the dark. As long as there is not a step-change in terms of basic understanding of science, there is no hard reason why technological know-how won't converge.
(In fact, we know very little about the science behind LLM technology. There still hasn't been a scientific explanation for the magic that the LLMs have achieved. It is, as far as everyone is concerned, still a black box. It's a black box for OpenAI. It’s a black box for everyone else. Everyone is equal in our ignorance in this sense.)
What everyone knows is that the LLMs today are 1) all based on ground-breaking Transformers technology, and that 2) scaling is the most crucial. If you scale up enough, you will somehow, magically, reach "emergence". Organizations like OpenAI, as a pioneer, are much more sophisticated than Chinese companies in the specific ways of making this emergence happen. But any sophistication can be learned. Although it's true that China's innovation to date has not proven itself to invent anything substantially new yet, there are ample examples of occasions when, after a specific technological path is knowably proven, Chinese companies will catch up, learn, iterate, reverse-engineer and even improve upon. It's China's "second mover advantage". There are no hard barriers preventing Chinese companies to do the same this time.
There are already early signs of this. For instance, this detailed article by
newsletter listed several sophisticated open-sourced LLM models built by China-based teams.The compute
In terms of compute, there seems to be a major hurdle. Training LLMs relies on the massive deployment of advanced chips, especially advanced GPUs. It's true that, despite many loopholes, the embargo of advanced chips in China is debilitating to China's AI development. So the question in this regard is the old question about whether China can catch up with the US in terms of chip manufacturing.
But here, again, there is no fundamental step-change difference in human understanding of the basic science of the world. Everyone knows how a chip works in theory. What China lacks is only the technological sophistication necessary to make the most cutting-edge chips. In fact, we argued before the so-called “Chip War” might end up giving the Chinese chip industry the thing they did not have previously but have long wanted: market demand.
Accumulation of technological know-how always takes time, through numerous trials and errors. After all, China has been connected to the global industrial supply chain for no more than 4 decades and has only recently been forced to reduce the dependence on cutting-edge technologies of the West. It's only natural for China to take some further time to catch up with the world in terms of attaining the remaining crown jewels of contemporary technology.
But it's only a matter of time, and it may happen faster than many people imagine. The collaboration between Huawei and SMIC in 5nm chip manufacturing is just a prelude to the things to come.
The data
Data represents a more intractable problem. This is because the machine-readable Chinese language corpus is far smaller than the English one. We touched on this topic before. But you can think about a question as simple as this one: what's China's equivalent of Wikipedia? You might be tempted to say Baidu Baike, but the content over there is several magnitudes weaker than Wikipedia in terms of both breadth and depth. Another example is that in the US, there will be long transcripts of politicians making never-ending speeches, but in China, most of the speeches are done in private, and only a few of those make it into public view, and usually only in a highly stylized format.
Last week, there was an interesting article by a Chinese blogger named 何加盐 He Jiayan titled “中文互联网正在加速崩塌 The Chinese-language internet’s collapse is accelerating”. In the article, He Jiayan detailed how difficult it is now to find any useful information in the pre-mobile internet age. For example, when searching for Jack Ma’s news between 1998 and 2005, he was shocked to find that he could not find a few thousand articles that he imagined, but only one article now. (Ironically but fittingly, the original article has disappeared, but you can still check a copy of it here.)
Why? Censorship certainly contributes its part to this, but He argued it has more to do with business incentives as well as widespread “self-censorship” on the part of internet platforms to avoid hitting controversies.
But I'd argue there is a deeper reason here. It’s the culture.
The Chinese culture in general places lower recognition over public writing and public speaking, and places more value on practical concerns. Deeds are more important than words. Getting rich is more important than getting informed. So the upside of speaking out loud is limited.
On the other hand, there are way more lessons warning people against the downside risk of speaking, encapsulated in the idiom “言多必失 Speaking too much will invariably lead to mistakes“. Chinese people live within deeply interconnected social webs. What if in the process of making a point, I end up hurting some specific individual or a specific group of people? The comedian whose "dog jokes” incurred the wrath of the PLA and many veterans is such a case in point.
In the West, there is a term called “the cancel culture”. Such a term exists within the Western context as something to be talked about and as something deserving a special name, because it is legitimate to some members of the society but somewhat conflicting with the overall principle of freedom of expression. But there is no clear corresponding term in China. This is not because “cancel culture” doesn’t exist in China, but because it is the norm here. For many people, before they say something, they will already think of all the potential scenarios in which they can get “canceled”, which will in turn stop them from speaking in the first place.
Here, censorship in the political sense is only the tip of the iceberg, but the greatest censorship only comes in the social and cultural sense.
Since the reason is cultural, it wouldn't change in the foreseeable future. So if you wonder if China can produce an LLM as powerful in Chinese as ChatGPT in English, I think the answer is a decisive no. It's the equivalent of asking if the Chinese language internet is as rich and informative as the English language internet or the equivalent of asking if Baidu Baike can ever as good as Wikipedia. No, and No. LLMs can't invent stuff. The internet represents the full limit of an LLM’s knowledge base.
Still, this limitation won’t stop 2 things from happening.
First, the “Baidu model”, meaning Chinese companies can still build products used by billions of people in the Chinese language world, the same way that Baidu can act as a go-to search engine within China. Even though Baidu does not have any global competitiveness and is less dominant in China’s search engine space than Google is in the global search engine space, it does not stop Baidu from becoming a multi-billion dollar business in itself.
The second likelihood is the “Bytedance model”: Having poor Chinese-language training material will not prevent Chinese companies from making a globally competitive LLM product, catering to the global audience. After all, the global internet is freely accessible to everyone. Also, many training data, such as coding languages, logical statements, images, videos, and private-domain materials are not language-specific. It's not inconceivable that a Chinese company can train its models in non-Chinese language material and create a powerful product for global consumers, just like what ByteDance does with TikTok on the mobile internet.
The longer-term worries
In summary, in terms of algorithms and compute, I do not see a fundamental limit preventing China from catching up. While data represents a deeper problem, if the goal is to build a globally competitive product, I can't see a hard limit here as well.
But despite my optimism, there is still something I see China will continuously lag behind the US.
[The content below is not touched upon during my PKU lecture, and is only reserved for paid subscribers of Baiguan]