Is Manus a technological breakthrough or a marketing bubble?
And how far away is a universal AI agent from maturity?
Since last Thursday, the tech community, as well as many investors in China, have been enthralled by Manus, a new AI agent product. It is almost obligatory for anyone to ask around for the invitation code. People whispered among themselves, is this the “2nd DeepSeek” out of China?
We have to be totally frank here. Our team has not yet got our hands on the “invitation code.” (Manus team, if you can hear us, we are still waiting.) But thanks to many good and impartial studies by many third parties, we think we can safely answer that, no, Manus is not a “2nd DeepSeek”. If you are an investor, do not expect a shock as huge as DeepSeek.
Nor does Manus seem like a marketing scam, and nor can it be written off simply as a “wrapper” around foundational models. Manus is a genuinely powerful product with a good user experience, made by a highly transparent team ready to share their recipes and engage with users, and yet there is also ample room for improvement.
For today’s emergency issue, we translated an article we found to be the most informative on this subject to date. It was originally written by Yuan Gan, an engineer/researcher/product manager/investor in the AI field, first published in his WeChat blog 甘源有话说, and translated by Baiguan with the author’s permission. What I love about Yuan’s article are the sharp and succinct questions he raised and the thorough review of foundational bottlenecks that we are yet to clear to achieve a genuinely universal AI agent at the end.
Before reading Yuan’s article, here is one final comment we at Baiguan have on this topic: Manus may not be the new DeepSeek, but the level of creativity and transparency exhibited by its team proves that DeepSeek is not an isolated event, but that a whole community of internationally competitive creators and innovators are coming of age in China.
The world should stop constantly responding with oohs and ahhs about this.
Is Manus a Technological Breakthrough or a Marketing Bubble? I Ran a Score with 467 Questions
By Yuan Gan
With the release of the first demo video by Manus on Thursday night, the entire Chinese-speaking world seemed to ignite, discussing how the arrival of a universal agent meant the end for any vertical agent startups. However, just 48 hours later, public opinion seemed to quickly reverse. Not only were there doubts about Manus' product experience being restricted by invitation codes as a form of showmanship, but various "open-source versions of Manus" also quickly emerged. As of the time of writing this article, the standout among them, Owl, which went live on Friday, has received 3.7k stars on GitHub, while OpenManus has garnered 15k stars, both becoming some of the fastest-rising projects on GitHub.
So, is Manus an empty marketing frenzy or a technological breakthrough comparable to DeepSeek? To answer this question, I reviewed every official example provided by Manus, the GAIA benchmark paper with its 467 questions, compared Manus' performance with competitors on GAIA, and looked at the code of the "open-source versions of Manus." Here are my findings:
Is Manus an empty marketing frenzy or a technological breakthrough comparable to DeepSeek? Neither. Manus is not a marketing gimmick nor a fundamental technological revolution; it represents a breakthrough at the product level. Unlike DeepSeek, which focuses on a fundamental breakthrough in foundational model capabilities, Manus has made significant achievements in the direction of AI agents—reaching SOTA (State of the Art) levels on the authoritative evaluation metric GAIA, significantly ahead of peer products.
Can the various open-source alternatives emerging these past few days replace Manus? No. The current open-source alternatives have a clear gap compared to Manus. Actual testing and data metrics show that Manus' success rate in executing complex tasks far exceeds that of various open-source versions, by several times. Moreover, Manus has been specifically optimized for specific application scenarios, a fine-tuning that simple open-source replication cannot achieve.
Is Manus a mature universal agent? No, Manus has not yet become a truly universal agent. To achieve this goal, it still needs to overcome three "major mountains": a fundamental improvement in foundational model capabilities, a rich and diverse ecosystem of partners, and a solid and scalable engineering infrastructure.
Next, I will elaborate on how I arrived at these three conclusions.
Is Manus a marketing scam?
After Manus AI has been quietly building for the past year, doubts in the AI field continue to emerge. Many professionals in the field can quickly deduce the working principle behind Manus. Each task is divided into three stages to execute:
Planning phase: Use a long-thought model like OpenAI's O1 to run a planning prompt, break down the user's input into execution steps, and determine the final output. For example, if the user wants to analyze Tesla's stock, the final product should be a webpage containing recent stock prices, market share, SWOT analysis, etc., and work backward to determine what needs to be done to get this information.
Execution phase: Use Claude 3.7's computer use capabilities to obtain this information step by step according to the breakdown from the previous stage. For example, in the Tesla example, Manus would write a piece of code to get Tesla's historical stock price through an API.
Induction phase: Use Claude 3.7 Extended's capabilities to summarize and induce all the information collected in the second step and produce the final product. For example, in the Tesla example, it would be a webpage containing various pieces of information.
However, Manus's value lies in being a product, not a scientific concept. The success of a product depends on actual effects and user experience, not the concept itself. From a creative perspective, whether it's Manus or overseas OpenAI Operator, Khoj, Jace.ai, and other agent systems, none have surpassed the scope explored by Andrew Ng in his talk at Sequoia Capital 11 months ago.
In this speech titled "What's Next for Agentic Reasoning," Andrew Ng demonstrated how large language models could double the success rate of deterministic tasks (such as programming) and achieve tasks that models could not originally complete, such as processing images in a specific style, through planning, execution, reflection, and multi-agent collaboration. However, what fewer people know is that in a subsequent speech, Andrew Ng showed that his team found that the greatest ability of large language models lies in their ability to break down problems, solve these small problems, and then combine the answers to these small problems to solve the original big problem, which is divide and conquer. In this speech, Andrew Ng demonstrated how his team used the planning-execution-summary approach of large language models to measure the distance between a surfer and a shark in a video. This is exactly the same as Manus' approach.
Now that we know the principles shown by Manus are not new, how about their effectiveness?
According to the performance data they provide, Manus has indeed reached SOTA levels on the GAIA metric. From this perspective, Manus is indeed one of the best-performing products on the market in terms of current AI agent technology. Wait, what is GAIA?
GAIA: A Benchmark for General AI Assistants
GAIA is an abbreviation for General Artificial Intelligence Assistant, a benchmark published by Yann LeCun's Facebook AI team (FAIR) together with researchers from Hugging Face and AutoGPT in January 2024, used to measure the extent to which an artificial intelligence agent can act as a human intelligent assistant.
Unlike traditional benchmarks like MMLU, GAIA consists of some real-world, open-ended questions with correct answers that humans can do, such as
calculating the minimum cost of buying ingredients based on supermarket flyers,
answering how many transfers are needed from station A to station B on a subway map,
and inferring the position of the night sky observation based on constellation maps and latitude and longitude data.
GAIA consists of 467 such "real-world comprehensive application questions." When GAIA was first released, the most advanced large language model at the time, GPT-4, could only answer 15% of the questions correctly. In contrast, humans could answer 92% of the questions correctly.
GAIA divides the questions into three levels of complexity:
Level 1 (Beginner): Usually does not require tools, or at most requires one tool and no more than five steps.
Level 2 (Intermediate): Requires about 5-10 steps and must combine different tools.
Level 3 (Advanced): Requires AI to perform arbitrary-length action sequences, use any number of tools, and have broad access to world knowledge.
I went to the question bank to find some questions to let everyone feel the difficulty from Level 1 to Level 3:
Level 1 (Beginner): What is the name of the only winner of the Malko Competition whose nationality was recorded as "a country that no longer exists" after the 20th century (after 1977)?
Level 2 (Intermediate): How many animals in the "Twelve Zodiac Signs" exhibition named after the Chinese zodiac in the Metropolitan Museum of Art in 2015 have visible hands?
Level 3 (Advanced): In a YouTube 360-degree VR video uploaded in March 2018, the narrator, voiced by the actor who played Gollum in "The Lord of the Rings," mentioned what number immediately after the first appearance of a dinosaur?
With a computer connected to the internet, how many questions do you think you can answer correctly?
According to the data provided by Manus, Manus scored as follows on GAIA, where the black bars represent the percentage of questions answered correctly by Manus in these three types of problems: 86.5% for Level 1 questions, 70.1% for Level 2 questions, and 57.7% for Level 3 questions, far higher than OpenAI's Deep Research.
Can the various open-source versions emerging these past few days replace Manus?
Since there is an objective evaluation standard like GAIA, we can now use data to answer a key question: Can the open-source versions of Manus that appeared within a day pull Manus off its pedestal? Data-wise, the answer is no.
The above is the official leaderboard on GAIA, listing the top 3 publicly tested on Hugging Face, which currently does not include Manus. They are
Trase Agent: Developed by the American startup Trase Systems, it is closed-source and not open to public trials, with an accuracy rate of 83% on Level 1, 69% on Level 2, and 46% on Level 3.
H2O GPT Agent: Developed by the American AI company H2O.ai, also closed-source and not open to trials, with an accuracy rate of 67% on Level 1, 67% on Level 2, and 42% on Level 3.
Owl: A product built by the Chinese-led open-source community Camel AI based on its open-source agent framework, with an accuracy rate of 81% on Level 1, 54% on Level 2, and 23% on Level 3.
Comparing the open-source solution Owl with Manus, we can see that
on basic tasks (Level 1), Owl's accuracy rate is 81%, relatively close to Manus' 86.5%.
On high-difficulty tasks (Level 3), Owl only got 23% correct, while Manus achieved 57.7%, showing a huge advantage of 34.7%.
Even the top-ranked Trase Agent has an 11.6% gap with Manus on Level 3 questions.
In fact, Manus not only has a gap with other solutions in the success rate of high-difficulty tasks but is also the only closed-source product on the leaderboard that allows public trials. This means that Manus is not only leading in technical indicators but also taking a step ahead in product implementation. Therefore, although the open-source community reacted quickly, there is an insurmountable qualitative gap between Manus and various rapidly developed "open-source versions" in terms of handling complex, multi-step tasks. This gap is particularly evident in the success rate of handling high-difficulty, multi-step, multimodal problems.
So, the third question is, how far away is Manus as a universal agent from being fully mature and usable? How far away is a universal agent from maturity?
Whether it's from the test experiences of netizens over the past few days or from the scores Manus has obtained on the GAIA benchmark, we can see that Manus and other universal AI agents are not yet mature.
So, how far away is a universal agent from being mature and commercially available?
I believe that to be mature and usable, it still needs to overcome three major challenges: foundational model capabilities, partner ecosystems, and engineering infrastructure.
Foundational model capabilities
Currently, universal agents still rely on foundational large language models for task decomposition and execution, especially in the execution phase. In this phase, large language models face significant challenges in utilizing web information and computer operations. The ability to use web information determines the agent's capacity to gather information, while the ability to use computers determines its capacity to process information.
In the 8 selected examples provided by Manus, 7 require obtaining internet information through a browser, and all examples involve using computers to create files, analyze data, and export data.
The two most well-known benchmarks for evaluating the use of web information and computer operations are WebArena and OSWorld.
WebArena is a benchmark proposed by a team from Carnegie Mellon University to measure AI's ability to use the web. It includes four categories of application scenarios: e-commerce, social forum discussions, collaborative software development (such as using GitHub), and content management (such as posting on Weibo and commenting).
WebArena simulates a real web environment in its task design, requiring agents to:
Understand various web page information in real environments, such as forums, shopping websites, and GitHub.
Perform multi-step web tasks, such as registering an account, placing an order, and posting content.
Deal with dynamically changing web content and states, such as refreshing the shopping cart after placing an order or reading the distance after entering content on a map.
Unfortunately, in the official leaderboard of WebArena, the current top-ranked IBM CUGA can only complete 61.7% of tasks. This means that even the most advanced agents still fail to complete nearly 40% of tasks in real-world web environments.
OSWorld is a benchmark jointly released by the University of Hong Kong, Salesforce, Carnegie Mellon University, and the University of Waterloo to measure an agent's computer usage capabilities. It covers the ability to use productivity tools such as Excel and VSCode, simulating a real operating system environment.
OSWorld mainly measures the agent's ability in the following areas:
File system operations, such as creating, moving, and deleting files.
Using productivity tools, such as editing text and using Excel.
Editing and executing code, such as code editing, compiling, and running.
Multi-application collaboration, such as switching between different applications and transferring data.
From the official leaderboard, the current top-ranked OpenAI CUA (another "Open" AI product that is closed-source and does not offer an open API) can only complete 38.1% of tasks. This indicates that even the most advanced agents have significant limitations in handling basic computer operations.
Of course, from the examples, it can be seen that Manus makes more use of the large language model's code generation capabilities to avoid direct computer operations, thereby increasing the success rate of tasks related to computer interaction.
Ecosystem of collaborators
In the actual experience of OpenAI Operator, a significant issue is the restricted interaction between agents and external services. For example, when Operator accesses Reddit, GitHub, or other websites to complete tasks, it is often identified as abnormal traffic and blocked.
Currently, most agents access network services anonymously or with a generic identity, lacking a clear identity marker, leading to:
Being identified and blocked by websites' anti-crawling mechanisms, including search engines like Google.
Inability to access services that require account login, such as obtaining information from Twitter and Facebook.
Inability to access personalized content and services, such as letting the agent view one's own email.
Of course, there are proposals in the industry to push forward, such as providing parameters like "AI-agent" when accessing various websites' services to identify the visitor as an agent from a specific company rather than blocking it as abnormal traffic. However, this also means that the future usability of AI agents largely depends on who can quickly establish an agent cooperation ecosystem. Once the foundational model capabilities are ready, whoever can establish cooperation with the majority of mainstream network service providers the fastest will be able to dominate the market through the richness of agent scenarios.
Engineering Infrastructure
Unlike traditional internet services, which can often abstract services into "calling a microservice" as an instantaneous stateless operation, agent services are almost all long-duration, multi-state conversational interactions. After the product capabilities reach maturity, how to efficiently provide agent services to millions, or even tens of millions, of users is a significant engineering challenge.
Conclusion
It is foreseeable that the rapid response of the media and the open-source community to the release of Manus has caught its team off guard. Even though it is not a perfect product, it is more “sincere” than the OpenAI Operator, which costs $200 per month, and the products on the GAIA leaderboard that cannot be tested.
The road is tortuous, but the future is bright.
I hope everyone can spend more time making good products.