Your "Flamingo" is My "Bird": Fine-Grained, or Not

Dongliang Chang¹ Kaiyue Pang² Yixiao Zheng¹ Zhanyu Ma^1* Yi-Zhe Song² Jun Guo¹

¹Beijing University of Posts and Telecommunications, CN ²SketchX, CVSSP, University of Surrey, UK

CVPR 2021

arXiv

Code

Figure 1. Definition of what is fine-grained is subjective. Your “flamingo” is my “bird”.

Introduction

Whether what you see in Figure 1 is a "flamingo" or a "bird", is the question we ask in this paper. While fine-grained visual classification (FGVC) strives to arrive at the former, for the majority of us non-experts just "bird" would probably suffice. The real question is therefore -- how can we tailor for different fine-grained definitions under divergent levels of expertise. For that, we re-envisage the traditional setting of FGVC, from single-label classification, to that of top-down traversal of a pre-defined coarse-to-fine label hierarchy -- so that our answer becomes "bird"-->"Phoenicopteriformes"-->"Phoenicopteridae"-->"flamingo". To approach this new problem, we first conduct a comprehensive human study where we confirm that most participants prefer multi-granularity labels, regardless whether they consider themselves experts. We then discover the key intuition that: coarse-level label prediction exacerbates fine-grained feature learning, yet fine-level feature betters the learning of coarse-level classifier. This discovery enables us to design a very simple albeit surprisingly effective solution to our new problem, where we (i) leverage level-specific classification heads to disentangle coarse-level features with fine-grained ones, and (ii) allow finer-grained features to participate in coarser-grained label predictions, which in turn helps with better disentanglement. Experiments show that our method achieves superior performance in the new FGVC setting, and performs better than state-of-the-art on traditional single-label FGVC problem as well. Thanks to its simplicity, our method can be easily implemented on top of any existing FGVC frameworks and is parameter-free.

Human Study

Figure 2. Human study on CUB-200-2011 bird dataset. Order, family, species are three coarse-to-fine label hierarchy for a bird image. A higher group id represents a group of people with better domain knowledge of birds, with group 5 interpreted as domain experts. (a) Human preference between single and multiple labels. (b) Impact of human familiarity with birds on single-label choice. (c) Impact of human familiarity with birds on multi-label choice.

Cooperation or Confrontation?

To explore the transfer effect in the joint learning of multi-granularity labels, we design an image classification task for predicting two labels at different granularities.

Figure 3. Joint learning of two-granularity labels under different weighting strategy on CUB-200-2011 bird dataset. (a) x-axis: β value that controls the relative importance of a fine-grained classifier; y axis: performance of the coarse-grained classifier. (b) x-axis: α value that controls the relative importance of a coarse-grained classifier; y axis: performance of the fine-grained classifier.

Our Solution

Figure 4. A schematic illustration of our FGVC model with multi-granularity label output. BP: backpropagation.

Results

Table 1. Comparisons with different baselines for FGVC task under multi-granularity label setting.

Table 2. Performance comparisons on traditional FGVC setting with single fine-grained label output.

Visualization

Figure 5. We highlight the supporting visual regions for classifiers at different granularity of two compared models. Order, Family, Species represent three coarse-to-fine classifiers trained on CUB-200-2011 bird dataset.

Bibtex

If this work is useful for you, please cite it:

@inproceedings{dongliang2021flamingo,
    title={Your "Flamingo" is My "Bird": Fine-Grained, or Not},
    author={Dongliang Chang, Kaiyue Pang, Yixiao Zheng, Zhanyu Ma, Yi-Zhe Song, Jun Guo},
    booktitle={CVPR},
    year={2021}
}