Self-Supervised Vision Transformers for Agricultural Disease Detection in Unstructured Field Environments

Otis A. Powell; Beremy Thornton; Kdwin Moran; ZhenTian Tang

Authors

Otis A. Powell Department of Computer Science, University of Alabama at Birmingham, Birmingham, AL, USA. Author
Beremy Thornton School of Information Technology, University of Cincinnati, Cincinnati, OH, USA. Author
Kdwin Moran Department of Computer Science, University of New Hampshire, Durham, NH, USA. Author
ZhenTian Tang School of Computing, Clemson University, Clemson, SC, USA. Author

Keywords:

self-supervised learning, vision transformers, agricultural disease detection, unstructured environments, deep learning deployment, socio-technical systems

Abstract

The detection of crop diseases in unstructured field environments remains a critical bottleneck for global food security, as traditional supervised deep learning models require extensive labeled datasets that are costly to obtain and often fail to generalize across diverse agronomic conditions. This paper investigates the application of self-supervised vision transformers (ViTs) as a foundational architecture for agricultural disease detection, emphasizing system-level design, deployment trade-offs, and socio-technical implications. We argue that self-supervised pretraining on unlabeled field imagery enables ViTs to learn robust visual representations that are invariant to lighting, occlusion, and scale variations inherent in real-world agriculture. The paper examines the architectural trade-offs between convolutional neural networks and vision transformers, the computational infrastructure required for large-scale pretraining, and the governance challenges of deploying such systems in low-resource settings. We further explore the sustainability of self-supervised pipelines in terms of energy consumption, data privacy, and model fairness across different crop types and geographic regions. By drawing on cross-domain comparisons with medical imaging and autonomous driving, we highlight how self-supervised ViTs can reduce annotation burdens while maintaining high diagnostic accuracy. Policy implications for open-data agricultural frameworks and equitable access to AI-driven diagnostics are discussed. The paper concludes with forward-looking recommendations for integrating self-supervised vision transformers into national agricultural extension systems and precision farming initiatives.

References

1. Mohanty, S. P., Hughes, D. P., & Salathé, M. (2016). Using deep learning for image-based plant disease detection. Frontiers in Plant Science, 7, 1419.

2. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., ... & Houlsby, N. (2021). An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations.

3. Hughes, D. P., & Salathé, M. (2015). An open access repository of images on plant health to enable the development of mobile disease diagnostics. arXiv preprint arXiv:1511.08060.

4. Barbedo, J. G. A. (2018). Factors influencing the use of deep learning for plant disease recognition. Biosystems Engineering, 172, 84-91.

5. Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020). A simple framework for contrastive learning of visual representations. In International Conference on Machine Learning (pp. 1597-1607).

6. He, K., Fan, H., Wu, Y., Xie, S., & Girshick, R. (2020). Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 9729-9738).

7. Grill, J. B., Strub, F., Altché, F., Tallec, C., Richemond, P. H., Buchatskaya, E., ... & Valko, M. (2020). Bootstrap your own latent: A new approach to self-supervised learning. In Advances in Neural Information Processing Systems (Vol. 33, pp. 21271-21284).

8. Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., & Joulin, A. (2021). Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 9650-9660).

9. Zhou, J., Wei, C., Wang, H., Shen, W., Xie, C., Yuille, A., & Kong, T. (2022). iBOT: Image BERT pre-training with online tokenizer. In International Conference on Learning Representations.

10. He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2022). Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 16000-16009).

11. Bao, H., Dong, L., Piao, S., & Wei, F. (2022). BEiT: BERT pre-training of image transformers. In International Conference on Learning Representations.

12. Wang, W., Zhou, T., Yu, F., Dai, J., Konukoglu, E., & Van Gool, L. (2023). Exploring cross-image contrastive learning for masked visual modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 14484-14494).

13. Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L., Rothchild, D., ... & Dean, J. (2021). Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350.

14. Tzachor, A., Devare, M., King, B., Avin, S., & Ó hÉigeartaigh, S. (2022). Responsible artificial intelligence in agriculture requires systemic understanding of risks and externalities. Nature Machine Intelligence, 4(2), 104-109.

15. Hendrycks, D., Liu, X., Schmidt, F., Steinhardt, J., & Song, D. (2021). On the (in)effectiveness of image rotation for self-supervised learning. In International Conference on Machine Learning (pp. 4226-4236).

16. Veale, M., & Binns, R. (2017). Fairer machine learning in the real world: Mitigating discrimination without collecting sensitive data. Big Data & Society, 4(2), 2053951717743530.

17. Azizi, S., Mustafa, B., Ryan, F., Beaver, Z., Freyberg, J., Deaton, J., ... & Natarajan, S. (2021). Big self-supervised models advance medical image classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 3478-3488).

18. Wu, C. Y., Krähenbühl, P., & Darrell, T. (2021). Learning to see in the dark: Self-supervised pretraining for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 11549-11558).

19. Nuske, S., Achar, S., Bates, T., Narasimhan, S., & Singh, S. (2011). Yield estimation in vineyards by visual grape detection. In 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems (pp. 2352-2358).

20. Food and Agriculture Organization of the United Nations. (2021). The state of food and agriculture 2021: Making agrifood systems more resilient to shocks. FAO.

Self-Supervised Vision Transformers for Agricultural Disease Detection in Unstructured Field Environments

Authors

Keywords:

Abstract

References

Downloads

Published

Issue

Section

License

How to Cite

Journal Information

Latest publications

Make a Submission

Information