An Erudite Fine-Grained Visual Classification Model

Dongliang Chang¹ Yujin Tong¹ Ruoyi Du¹ Timothy Hospedales² Yi-Zhe Song³ Zhanyu Ma^1*

¹Beijing University of Posts and Telecommunications, CN ²University of Edinburgh, UK
³SketchX, CVSSP, University of Surrey, UK

CVPR 2023

PDF

Code

Figure 1. How to identify the fine-grained labels of an object? Current paradigms require two stages: coarse-grained visual classification and fine-grained visual classification. This paper transforms the two stages of recognition into an erudite fine-grained visual classification model, which can directly recognise the fine-grained labels of objects across different coarse-grained label spaces..

Introduction

Current fine-grained visual classification (FGVC) models are isolated. In practice, we first need to identify the coarse-grained label of an object, then select the corresponding FGVC model for recognition. This hinders the application of FGVC algorithms in real-life scenarios. In this paper, we propose an erudite FGVC model jointly trained by several different datasets, which can efficiently and accurately predict an object’s fine-grained label across the combined label space. We found through a pilot study that positive and negative transfers co-occur when different datasets are mixed for training, i.e., the knowledge from other datasets is not always useful. Therefore, we first propose a feature disentanglement module and a feature re-fusion module to reduce negative transfer and boost positive transfer between different datasets. In detail, we reduce negative transfer by decoupling the deep features through many dataset-specific feature extractors. Subsequently, these are channel-wise re-fused to facilitate positive transfer. Finally, we propose a meta-learning based dataset-agnostic spatial attention layer to take full advantage of the multi-dataset training data, given that localisation is dataset-agnostic between different datasets. Experimental results across 11 different mixed-datasets built on four different FGVC datasets demonstrate the effectiveness of the proposed method. Furthermore, the proposed method can be easily combined with existing FGVC methods to obtain state-of-the-art results.

Pilot Study

Figure 2. Performance differences (∆ (%)) between multi-dataset training vs. training each dataset alone. Subplots indicate target datasets and bars correspond to extra data used for training.

Table 1. Evaluation of the feature distribution of test samples after joint training. ♭: denotes the distribution of samples within each individual dataset, and †: represents the distribution of samples between any two datasets. Underline indicates the best results.

Our Solution

Figure 3. A schematic illustration of the proposed methods. The input x contains data belonging to multiple datasets. Here is a mix of 3 datasets as an example (i.e., N = 3). The dataset-specific classifiers (G1, G2, and G3) are only used in the training stage.

Results

Table 2. Comparisons with different baselines. Underlining indicates the best results..

Visualization

Figure 4. We highlight the supporting visual regions for attention layers of two compared models. The red circles denote the exclusive visual regions that Ours focus on.

Bibtex

If this work is useful for you, please cite it:

@inproceedings{Chang2023Erudite,
    title={An Erudite Fine-Grained Visual Classification Model},
    author={Chang, Dongliang and Tong, Yujun and Du, Ruoyi and Timothy, Hospedales and Song, Yi-Zhe and Ma, Zhanyu},
    booktitle={CVPR},
    year={2023}
}