An Erudite Fine-Grained Visual Classification Model

Dongliang Chang1      Yujin Tong1      Ruoyi Du1      Timothy Hospedales2      Yi-Zhe Song3      Zhanyu Ma1*

1Beijing University of Posts and Telecommunications, CN       2University of Edinburgh, UK
3SketchX, CVSSP, University of Surrey, UK

CVPR 2023



Figure 1. How to identify the fine-grained labels of an object? Current paradigms require two stages: coarse-grained visual classification and fine-grained visual classification. This paper transforms the two stages of recognition into an erudite fine-grained visual classification model, which can directly recognise the fine-grained labels of objects across different coarse-grained label spaces..


Current fine-grained visual classification (FGVC) models are isolated. In practice, we first need to identify the coarse-grained label of an object, then select the corresponding FGVC model for recognition. This hinders the application of FGVC algorithms in real-life scenarios. In this paper, we propose an erudite FGVC model jointly trained by several different datasets, which can efficiently and accurately predict an object’s fine-grained label across the combined label space. We found through a pilot study that positive and negative transfers co-occur when different datasets are mixed for training, i.e., the knowledge from other datasets is not always useful. Therefore, we first propose a feature disentanglement module and a feature re-fusion module to reduce negative transfer and boost positive transfer between different datasets. In detail, we reduce negative transfer by decoupling the deep features through many dataset-specific feature extractors. Subsequently, these are channel-wise re-fused to facilitate positive transfer. Finally, we propose a meta-learning based dataset-agnostic spatial attention layer to take full advantage of the multi-dataset training data, given that localisation is dataset-agnostic between different datasets. Experimental results across 11 different mixed-datasets built on four different FGVC datasets demonstrate the effectiveness of the proposed method. Furthermore, the proposed method can be easily combined with existing FGVC methods to obtain state-of-the-art results.

Pilot Study


Figure 2. Performance differences (∆ (%)) between multi-dataset training vs. training each dataset alone. Subplots indicate target datasets and bars correspond to extra data used for training.


Table 1. Evaluation of the feature distribution of test samples after joint training. ♭: denotes the distribution of samples within each individual dataset, and †: represents the distribution of samples between any two datasets. Underline indicates the best results.

Our Solution


Figure 3. A schematic illustration of the proposed methods. The input x contains data belonging to multiple datasets. Here is a mix of 3 datasets as an example (i.e., N = 3). The dataset-specific classifiers (G1, G2, and G3) are only used in the training stage.



Table 2. Comparisons with different baselines. Underlining indicates the best results..



Figure 4. We highlight the supporting visual regions for attention layers of two compared models. The red circles denote the exclusive visual regions that Ours focus on.


If this work is useful for you, please cite it:
    title={An Erudite Fine-Grained Visual Classification Model},
    author={Chang, Dongliang and Tong, Yujun and Du, Ruoyi and Timothy, Hospedales and Song, Yi-Zhe and Ma, Zhanyu},

Proudly created by Dongliang Chang @ BUPT