Meta公司：DINOv3是以前所未有的規(guī)模進行視覺自我監(jiān)督學習

2025-08-15 10:20:57　來源: 親愛的數(shù)據(jù)

內(nèi)蒙古舉報

分享至

Meta公司網(wǎng)站原文，請享用，
譚老師我看完感慨一句：性能確實很棒，但 Apache 許可證已改商業(yè)許可證了。換句話說，原來可以免費使用、修改甚至商用的 Apache 許可證被換成了需要付費或受更多限制的商業(yè)許可證，想繼續(xù)用就得按新規(guī)矩來。

Open Source 開源

DINOv3: Self-supervised learning for vision at unprecedented scaleDINOv3：以前所未有的規(guī)模進行視覺自我監(jiān)督學習

August 14, 2025 2025年8月14日

Takeaways: 要點：

We’re introducing DINOv3, which scales self-supervised learning for images to create universal vision backbones that achieve absolute state-of-the-art performance across diverse domains, including web and satellite imagery.
我們正在推出 DINOv3，它擴展圖像的自監(jiān)督學習，以創(chuàng)建通用視覺主干，從而在不同領(lǐng)域（包括網(wǎng)絡(luò)和衛(wèi)星圖像）實現(xiàn)絕對最先進的性能。
DINOv3 backbones produce powerful, high-resolution image features that make it easy to train lightweight adapters. This leads to exceptional performance on a broad array of downstream vision tasks, including image classification, semantic segmentation, and object tracking in video.
DINOv3 主干可生成強大的高分辨率圖像功能，使訓(xùn)練輕量級適配器變得容易。這導(dǎo)致了廣泛的下游視覺任務(wù)的卓越性能，包括圖像分類，語義分割和視頻中的對象跟蹤。
We’ve incorporated valuable community feedback, enhancing the versatility of DINOv3 by shipping smaller models that outperform comparable CLIP-based derivatives across a broad evaluation suite, as well as alternative ConvNeXt architectures for resource-constrained use cases.
我們已經(jīng)整合了寶貴的社區(qū)反饋，通過在廣泛的評估套件中提供比基于 CLIP 的衍生產(chǎn)品性能更好的小型模型，以及用于資源受限用例的替代 ConvNeXt 架構(gòu)，增強了 DINOv3 的多功能性。
We’re releasing the DINOv3 training code and pre-trained backbones under a commercial license to help drive innovation and advancements in the computer vision and multimodal ecosystem.
我們將在商業(yè)許可證下發(fā)布 DINOv 3 訓(xùn)練代碼和預(yù)先訓(xùn)練的骨干，以幫助推動計算機視覺和多模式生態(tài)系統(tǒng)的創(chuàng)新和進步。

Self-supervised learning (SSL) —the concept that AI models can learn independently without human supervision—has emerged as the dominant paradigm in modern machine learning. It has driven the rise of large language models that acquire universal representations by pre-training on massive text corpora. However, progress in computer vision has lagged behind, as the most powerful image encoding models still rely heavily on human-generated metadata, such as web captions, for training.
自監(jiān)督學習（SSL）-AI 模型可以在沒有人類監(jiān)督的情況下獨立學習的概念-已成為現(xiàn)代機器學習的主導(dǎo)范式。它推動了大型語言模型的興起，這些模型通過在大量文本語料庫上進行預(yù)訓(xùn)練來獲得通用表示。然而，計算機視覺的進展卻落后了，因為最強大的圖像編碼模型仍然嚴重依賴于人類生成的元數(shù)據(jù)，例如網(wǎng)絡(luò)標題。

Today, we’re releasing DINOv3, a generalist, state-of-the-art computer vision model trained with SSL that produces superior high-resolution visual features. For the first time, a single frozen vision backbone outperforms specialized solutions on multiple long-standing dense prediction tasks including object detection and semantic segmentation.
今天，我們發(fā)布了 DINOv3，這是一個通用的、最先進的計算機視覺模型，使用 SSL 進行訓(xùn)練，可以產(chǎn)生上級高分辨率的視覺特征。這是第一次，單一的凍結(jié)視覺骨干在多個長期存在的密集預(yù)測任務(wù)（包括對象檢測和語義分割）上的表現(xiàn)優(yōu)于專業(yè)解決方案。

DINOv3’s breakthrough performance is driven by innovative SSL techniques that eliminate the need for labeled data—drastically reducing the time and resources required for training and enabling us to scale training data to 1.7B images and model size to 7B parameters. This label-free approach enables applications where annotations are scarce, costly, or impossible.

For example, our research shows that DINOv3 backbones pre-trained on satellite imagery achieve exceptional performance on downstream tasks such as canopy height estimation.
DINOv3 的突破性性能是由創(chuàng)新的 SSL 技術(shù)驅(qū)動的，該技術(shù)消除了對標記數(shù)據(jù)的需求，大大減少了訓(xùn)練所需的時間和資源，使我們能夠?qū)⒂?xùn)練數(shù)據(jù)擴展到 1.7 B 圖像，并將模型大小擴展到 7 B 參數(shù)。這種無標簽的方法使應(yīng)用程序能夠在注釋稀缺、昂貴或不可能的情況下使用。例如，我們的研究表明，在衛(wèi)星圖像上預(yù)訓(xùn)練的 DINOv3 骨干在下游任務(wù)（如冠層高度估計）上實現(xiàn)了卓越的性能。

We believe DINOv3 will help accelerate existing use cases and also unlock new ones, leading to advancements in industries such as healthcare, environmental monitoring, autonomous vehicles, retail, and manufacturing—enabling more accurate and efficient visual understanding at scale.
我們相信，DINOv3 將有助于加速現(xiàn)有的用例，并解鎖新的用例，從而推動醫(yī)療保健、環(huán)境監(jiān)測、自動駕駛汽車、零售和制造等行業(yè)的進步，從而實現(xiàn)更準確、更高效的大規(guī)模視覺理解。

We’re releasing DINOv3 with a comprehensive suite of open sourced backbones under a commercial license, including a satellite backbone trained on MAXAR imagery. We’re also sharing a subset of our downstream evaluation heads, enabling the community to reproduce our results and build upon them. Additionally, we’re providing sample notebooks so the community has detailed documentation to help them start building with DINOv3 today.
我們將在商業(yè)許可下發(fā)布 DINOv3，其中包含一套全面的開源主干，包括一個在 MAXAR 圖像上訓(xùn)練的衛(wèi)星主干。我們還共享了下游評估負責人的子集，使社區(qū)能夠復(fù)制我們的結(jié)果并在此基礎(chǔ)上進行構(gòu)建。此外，我們還提供了示例筆記本，以便社區(qū)擁有詳細的文檔，幫助他們立即開始使用 DINOv3 進行構(gòu)建。

Unlocking high-impact applications with self-supervised learning
通過自我監(jiān)督學習解鎖高影響力的應(yīng)用程序

DINOv3 achieves a new milestone by demonstrating, for the first time, that SSL models can outperform their weakly supervised counterparts across a wide range of tasks.

While previous DINO models set a significant lead in dense prediction tasks, such as segmentation and monocular depth estimation, DINOv3 surpasses these accomplishments.

Our models match or exceed the performance of the strongest recent models such as SigLIP 2 and Perception Encoder on many image classification benchmarks, and at the same time, they drastically widen the performance gap for dense prediction tasks.

DINOv3 實現(xiàn)了一個新的里程碑，首次證明 SSL 模型可以在廣泛的任務(wù)中優(yōu)于弱監(jiān)督模型。雖然以前的 DINO 模型在密集預(yù)測任務(wù)（如分割和單目深度估計）方面取得了顯著領(lǐng)先，但 DINOv3 超越了這些成就。我們的模型在許多圖像分類基準測試中的性能與最近最強的模型（如 SigLIP 2 和 Perception Encoder）相匹配或超過，同時，它們大大擴大了密集預(yù)測任務(wù)的性能差距。

DINOv3 builds on the breakthrough DINO algorithm, requiring no metadata input, consuming only a fraction of the training compute compared to prior methods, and still delivering exceptionally strong vision foundation models.

The novel refinements introduced in DINOv3 lead to state-of-the-art performance on competitive downstream tasks such as object detection under the severe constraint of frozen weights. This eliminates the need for researchers and developers to fine-tune the model for specific tasks, enabling broader and more efficient application.
DINOv3 建立在突破性的 DINO 算法之上，不需要元數(shù)據(jù)輸入，與以前的方法相比，只消耗一小部分訓(xùn)練計算，并且仍然提供非常強大的視覺基礎(chǔ)模型。DINOv3 中引入的新改進導(dǎo)致競爭性下游任務(wù)的最新性能，例如在凍結(jié)權(quán)重的嚴格約束下的對象檢測。這消除了研究人員和開發(fā)人員為特定任務(wù)微調(diào)模型的需要，從而實現(xiàn)更廣泛和更有效的應(yīng)用。

Finally, because the DINO approach is not specifically tailored to any image modality, the same algorithm can be applied beyond web imagery to other domains where labeling is prohibitively difficult or expensive. DINOv2 already leverages vast amounts of unlabeled data to support diagnostic and research efforts in histology, endoscopy, and medical imaging. In satellite and aerial imagery, the overwhelming volume and complexity of data make manual labeling impractical.

With DINOv3, we make it possible for these rich datasets to be used to train a single backbone that can then be used across satellite types, enabling general applications in environmental monitoring, urban planning, and disaster response.
最后，由于 DINO 方法不是專門針對任何圖像模態(tài)定制的，因此相同的算法可以應(yīng)用于 Web 圖像之外的其他領(lǐng)域，這些領(lǐng)域的標記非常困難或昂貴。DINOv2 已經(jīng)利用大量未標記的數(shù)據(jù)來支持組織學、內(nèi)窺鏡檢查和醫(yī)學成像方面的診斷和研究工作。在衛(wèi)星和航空圖像中，數(shù)據(jù)的巨大數(shù)量和復(fù)雜性使得手動標記不切實際。通過 DINOv3，我們可以使用這些豐富的數(shù)據(jù)集來訓(xùn)練單個骨干，然后可以跨衛(wèi)星類型使用，從而實現(xiàn)環(huán)境監(jiān)測，城市規(guī)劃和災(zāi)害響應(yīng)中的一般應(yīng)用。

DINOv3 is already having real-world impact.

The World Resources Institute (WRI) is using our latest model to monitor deforestation and support restoration, helping local groups protect vulnerable ecosystems. WRI uses DINOv3 to analyze satellite images and detect tree loss and land-use changes in affected ecosystems. The accuracy gains from DINOv3 support automating climate finance payments by verifying restoration outcomes, reducing transaction costs, and accelerating funding to small, local groups.

For example, compared to DINOv2, DINOv3 trained on satellite and aerial imagery reduces the average error in measuring tree canopy height in a region of Kenya from 4.1 meters to 1.2 meters. WRI is now able to scale support for thousands of farmers and conservation projects more efficiently.

DINOv3 已經(jīng)對現(xiàn)實世界產(chǎn)生了影響。世界資源研究所（WRI）正在使用我們的最新模型來監(jiān)測森林砍伐和支持恢復(fù)，幫助當?shù)貓F體保護脆弱的生態(tài)系統(tǒng)。世界資源研究所使用 DINOv3 分析衛(wèi)星圖像，并檢測受影響生態(tài)系統(tǒng)中的樹木損失和土地使用變化。

DINOv3 帶來的準確性收益通過驗證恢復(fù)結(jié)果、降低交易成本和加速向小型地方團體提供資金，支持氣候融資支付的自動化。例如，與 DINOv2 相比，在衛(wèi)星和航空圖像上訓(xùn)練的 DINOv3 將測量肯尼亞地區(qū)樹冠高度的平均誤差從 4.1 米降低到 1.2 米。世界資源研究所現(xiàn)在能夠更有效地擴大對數(shù)千名農(nóng)民和保護項目的支持。

Scalable and efficient visual modeling without fine-tuning
可擴展且高效的可視化建模，無需微調(diào)

We built DINOv3 by training a 7x larger model on a 12x larger dataset than its predecessor, DINOv2. To showcase the model’s versatility, we evaluate it across 15 diverse visual tasks and more than 60 benchmarks. The DINOv3 backbone particularly shines on all dense prediction tasks, showing an exceptional understanding of the scene layout and underlying physics.
我們通過在比其前身 DINOv2 大 12 倍的數(shù)據(jù)集上訓(xùn)練 7 倍大的模型來構(gòu)建 DINOv3。為了展示該模型的多功能性，我們在 15 個不同的視覺任務(wù)和 60 多個基準測試中對其進行了評估。DINOv3 主干在所有密集預(yù)測任務(wù)中表現(xiàn)出色，表現(xiàn)出對場景布局和底層物理的卓越理解。

The rich, dense features capture measurable attributes or characteristics of each pixel in an image and are represented as vectors of floating-point numbers. These features are capable of parsing objects into finer parts, even generalizing across instances and categories. This dense representation power makes it easy to train lightweight adapters with minimal annotations on top of DINOv3, meaning a few annotations and a linear model are sufficient to obtain robust dense predictions.

Pushing things further and using a more sophisticated decoder, we show that it’s possible to achieve state-of-the-art performance on long-standing core computer vision tasks without fine-tuning the backbone.

We show such results on object detection, semantic segmentation, and relative depth estimation.
豐富、密集的特征捕捉圖像中每個像素的可測量屬性或特征，并表示為浮點數(shù)向量。這些功能能夠?qū)ο蠼馕鰹楦毜牟糠?，甚至跨實例和類別進行概括。這種密集表示能力使得在 DINOv3 之上使用最少的注釋來訓(xùn)練輕量級適配器變得很容易，這意味著一些注釋和線性模型就足以獲得強大的密集預(yù)測。通過進一步推進并使用更復(fù)雜的解碼器，我們證明了在無需微調(diào)主干的情況下，可以在長期的核心計算機視覺任務(wù)上實現(xiàn)最先進的性能。我們展示了這樣的結(jié)果，對象檢測，語義分割和相對深度估計。

Because state-of-the-art results can be achieved without fine-tuning the backbone, a single forward pass can serve multiple applications simultaneously.

This enables the inference cost of the backbone to be shared across tasks, which is especially critical for edge applications that often require running many predictions at once.

DINOv3’s versatility and efficiency make it the perfect candidate for such deployment scenarios, as demonstrated by NASA’s Jet Propulsion Laboratory (JPL), which is already using DINOv2 to build exploration robots for Mars, enabling multiple vision tasks with minimal compute.
由于無需微調(diào)主干即可實現(xiàn)最先進的結(jié)果，因此單個前向通道可以同時服務(wù)于多個應(yīng)用。這使得骨干網(wǎng)的推理成本能夠在任務(wù)之間共享，這對于經(jīng)常需要同時運行許多預(yù)測的邊緣應(yīng)用程序尤其重要。DINOv3 的多功能性和效率使其成為此類部署場景的完美候選者，正如 NASA 噴氣推進實驗室（JPL）所證明的那樣，該實驗室已經(jīng)使用 DINOv2 為火星建造探測機器人，以最小的計算實現(xiàn)多個視覺任務(wù)。

A family of deployment-friendly models一系列部署友好型模型

Scaling DINOv3 to 7B parameters shows SSL’s full potential. However, a 7B model is impractical for many downstream applications. Following feedback from the community, we built a family of models spanning a large range of inference compute requirements to empower researchers and developers across diverse use cases.

By distilling the ViT-7B model into smaller, high-performing variants like ViT-B and ViT-L, DINOv3 outperforms comparable CLIP-based models across a broad evaluation suite.

Additionally, we introduce alternative ConvNeXt architectures (T, S, B, L) distilled from ViT-7B, that can accommodate varying compute constraints. We’re also releasing our distillation pipeline to enable the community to build upon this foundation.
將 DINOv 3 參數(shù)擴展到 7 B 顯示了 SSL 的全部潛力。然而，7 B 模型對于許多下游應(yīng)用是不切實際的。根據(jù)社區(qū)的反饋，我們構(gòu)建了一系列涵蓋大量推理計算需求的模型，以支持研究人員和開發(fā)人員跨各種用例。通過將 ViT-7 B 模型提煉成更小的高性能變體，如 ViT-B 和 ViT-L，DINOv 3 在廣泛的評估套件中優(yōu)于基于 CLIP 的同類模型。此外，我們介紹了替代 ConvNeXt 架構(gòu)（T，S，B，L）從 ViT-7 B，可以適應(yīng)不同的計算約束。我們還發(fā)布了我們的蒸餾管道，以使社區(qū)能夠在此基礎(chǔ)上再接再厲。

聲明：個人原創(chuàng)，僅供參考

特別聲明：以上內(nèi)容(如有圖片或視頻亦包括在內(nèi))為自媒體平臺“網(wǎng)易號”用戶上傳并發(fā)布，本平臺僅提供信息存儲服務(wù)。

Notice: The content above (including the pictures and videos if any) is uploaded and posted by a user of NetEase Hao, which is a social media platform and only provides information storage services.