This week, while the tech world is buzzing about Chinese AI lab DeepSeek, its significant domestic competitor, Alibaba, is also making strides.
On Monday, Alibaba’s Qwen team unveiled a new collection of AI models, dubbed Qwen2.5-VL, capable of executing various text and image analysis functions. These models are able to examine documents, interpret videos, count items in images, and even interact with a computer, akin to OpenAI’s recently introduced Operator model.
According to benchmarks provided by the Qwen team, the leading model in the Qwen2.5-VL series outperforms OpenAI’s GPT-4o, Anthropic’s Claude 3.5 Sonnet, and Google’s Gemini 2.0 Flash in assessments related to video comprehension, mathematical reasoning, document analysis, and answering questions.

Available for testing in Alibaba’s Qwen Chat application and for download on the AI development platform Hugging Face, Qwen2.5-VL can analyze graphs and images, extract information from invoice scans and forms, and “understand” videos that are several hours long, as claimed by the Qwen team. The model also has the capacity to identify “IPs from movies and TV shows, along with numerous products,” according to the team, implying that it may have been partially trained on copyrighted materials.
However, Qwen2.5-VL, developed by a Chinese company, comes with certain constraints regarding the subjects it is permitted to discuss, particularly within Qwen Chat. When prompted to address “Xi Jinping’s mistakes” using the most advanced Qwen2.5-VL model, Qwen2.5-VL-72B, the service returned an error message.
China’s internet regulatory body evaluates many domestically developed models to ensure that their outputs align with “core socialist values.” Numerous Chinese AI systems refrain from discussing controversial topics that could provoke regulatory scrutiny, such as Taiwan’s independence.
Among the notable capabilities of Qwen2.5-VL is its interaction with software across both PCs and mobile devices. A video shared on X by Philipp Schmid, a technical leader at Hugging Face, demonstrated Qwen2.5-VL initiating the Booking.com application on Android and proceeding to book a flight from Chongqing to Beijing.
Don’t Miss @Alibaba_Qwen 2.5 VL! Despite all the Deepseek Hype, Qwen just dropped the best open Multimodal! Qwen 2.5 VL is a Vision Language Model that can control your computer, similar to the @OpenAI operator, extract structured information from charts, and more!!
TL;DR;
3️⃣… pic.twitter.com/GeEGVdl0tI— Philipp Schmid (@_philschmid) January 27, 2025
In another video, a Qwen2.5-VL model is shown controlling applications on a Linux desktop but does not seem to achieve much beyond simply switching between tabs. Interestingly, Qwen’s own benchmarking indicates that Qwen2.5-VL performs poorly on OSWorld—a benchmark designed to replicate an actual computing environment.
LMAO Qwen 2.5 VL can perform Computer Use, out of the box, taking on OpenAI Operator HEAD ON! 🐐 pic.twitter.com/lwMECXzNSu
— Vaibhav (VB) Srivastav (@reach_vb) January 27, 2025
The Qwen2.5-VL series consists of two smaller and less advanced models, Qwen2.5-VL-3B and Qwen2.5-VL-7B, which are offered under a permissive license. In contrast, the flagship model, Qwen2.5-VL-72B, is governed by Alibaba’s proprietary licensing terms, requiring firms and developers with over 100 million monthly active users to seek approval from Qwen/Alibaba prior to deploying the model for commercial use.
Compiled by Techarena.au.
Fanpage: TechArena.au
Watch more about AI – Artificial Intelligence


