ORIGINAL RESEARCH article

Front. Plant Sci.

Sec. Technical Advances in Plant Science

Volume 16 - 2025 | doi: 10.3389/fpls.2025.1546756

Reducing Annotation Effort in Agricultural Data: Simple and Fast Unsupervised Coreset Selection with DINOv2 and K-Means

Provisionally accepted
Laura  Gómez-ZamanilloLaura Gómez-Zamanillo1,2*Nagore  PortillaNagore Portilla1Artzai  PiconArtzai Picon1,2Itziar  EgusquizaItziar Egusquiza1,2Ramon  NavarraRamon Navarra3Andoni  ElolaAndoni Elola2Arantza  Bereciartua-PerezArantza Bereciartua-Perez1
  • 1Tecnalia Research & Innovation, San Sebastian, Spain
  • 2University of the Basque Country, Bilbao, Basque Country, Spain
  • 3BASF (Germany), Ludwigshafen am Rhein, Germany

The final, formatted version of the article will be published soon.

The need for large amounts of annotated data is a major obstacle to adopting deep learning in agricultural applications, where annotation is typically time-consuming and requires expert knowledge. To address this issue, methods have been developed to select data for manual annotation that represents the existing variability in the dataset, thereby avoiding redundant information. Coreset selection methods aim to choose a small subset of data samples that best represents the entire dataset. These methods can therefore be used to select a reduced set of samples for annotation, optimizing the training of a deep learning model for the best possible performance.In this work, we propose a simple yet effective coreset selection method that combines the recent foundation model DINOv2 as a powerful feature selector with the well-known K-Means clustering method. Samples are selected from each calculated cluster to form the final coreset. The proposed method is validated by comparing the performance metrics of a multiclass classification model trained on datasets reduced randomly and using the proposed method. This validation is conducted on two different datasets, and in both cases, the proposed method achieves better results, with improvements of up to 0.15 in the F1 score for significant reductions in the training datasets. Additionally, the importance of using DINOv2 as a feature extractor to achieve these good results is studied.

Keywords: label-efficient learning, coreset selection, Foundation models, Agriculture, unsupervised clustering

Received: 17 Dec 2024; Accepted: 22 Apr 2025.

Copyright: © 2025 Gómez-Zamanillo, Portilla, Picon, Egusquiza, Navarra, Elola and Bereciartua-Perez. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: Laura Gómez-Zamanillo, Tecnalia Research & Innovation, San Sebastian, Spain

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.