Milestone Systems has announced the release of an advanced vision language model (VLM) designed specifically for traffic understanding, powered by NVIDIA Cosmos Reason. The new VLM underpins two offerings: a Video Summarization tool for XProtect Video Management Software and a Vision Language Model as a Service (VLMaaS) for third-party integrations.
The Video Summarization tool, a generative AI-powered plug-in for the XProtect Smart Client, is designed to help operators quickly extract insights from large volumes of video data. Instead of manually reviewing footage, users can submit a short video clip along with a prompt, and the system generates a text-based summary within seconds. Early reports indicate that video summarization could reduce operator false alarm fatigue by up to 30 percent.
Key capabilities of the Video Summarization tool include converting video segments into structured text summaries, searching video content based on descriptions rather than timestamps, bookmarking and filtering summaries to streamline reviews, and triggering automated summaries through existing XProtect events and rules. The tool also enables users to focus on valid incidents by filtering out irrelevant motion or noise. Sovereign, region-specific VLMs are available starting with the US and EU, with additional regions planned.
The Video Summarization plug-in is free to download and can be installed directly within the XProtect Smart Client in minutes. Customers only incur costs when actively prompting the VLM. Alongside this launch, Milestone introduced Hafnia Vision Language Model as a Service, providing developers, integrators, and partners with API access to production-ready video intelligence built on NVIDIA technology and fine-tuned on responsibly sourced data. The service enables organizations to add generative AI capabilities to existing applications without building or managing their own AI infrastructure, accelerating development from minimum viable products to large-scale deployments.
According to Milestone, the use of VLMaaS can reduce the effort required to develop advanced video analytics by up to 70 times compared to fine-tuning a model independently. The service offers API-first integration via HTTPS, supports prompt-based traffic-related operations, and includes fine-tuned models for the US and EU markets, with more regions planned. Pricing follows a pay-per-use model based on API calls, avoiding large upfront investments.
Andrew Burnett, Acting Chief Technology Officer at Milestone Systems, said, “With the Vision Language Model as a Service and Video Summarization for XProtect, we’re tackling some of the most challenging bottlenecks: video overload and time-consuming manual work. Operators get immediate insight directly within XProtect; builders get API-first access to production-ready intelligence without bespoke training or heavy infrastructure. Because this model is specialized for real-world traffic video and fine-tuned on responsibly sourced data, customers can trust the results, deploy with confidence, and enhance all existing solutions in place. It’s the fastest, most advanced and impactful path to turning video into actionable outcomes.”
Milestone said early adopters, including the cities of Genoa in Italy and Dubuque in the US state of Iowa, are already preparing to use the new capabilities to enhance traffic management. Both offerings are built on Milestone’s Hafnia VLM, which has been fine-tuned using 75,000 hours of responsibly sourced, real-world video data from Europe and the United States. Data preparation is handled using NVIDIA Cosmos Curator, with deployment available via cloud infrastructure or regional data centers. By combining NVIDIA Cosmos Reason with Milestone’s domain-specific data, the company says the platform represents one of the most advanced video AI solutions currently available.
