As the dataset contains 80000 images, the entire dataset is not including in this repo. The dataset can trivially be generated by installing the requirements.txt and running main.py. It will then be loaded into memory as Pytorch Dataset objects which can be used for downstream tasks.
Large scale image datasets, such as ImageNet, almost always contain spurious correlations between the object of interest and the background of the image. As these datasets are used for (pre)-training of large foundation models, it is of interest to research whether image networks learn from these correlations to make their predictions. To make this explicit with an example, if we assume that it is likely that an image of a dog will also contain a background that contains grass, will a CNN rely on the presence of grass in the background to predict the presense of a dog?
This blog introduces a control dataset designed to test this property. The dataset combines two image datasets, Describable Textures Dataset (DTD) and MNIST, by replacing the MNIST backgrounds with textures from DTD.
There have been many recent studies that have shown the tendency for CNNs to use background or texture signals for predictions, rather than only using object features. I list some of these papers below:
- Noise or Signal: The Role of Image Backgrounds in Object Recognition: This paper examines the reliance on background features for classification on the ImageNet dataset. They propose a framework for separating foreground from background on ImageNet, which they use to reach several conclusions. They show that models can achieve better than random accuracy with only the background as well as the fact that adversarial backgrounds can lower classification accuracy.
- ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness: Similarly, although slightly differently, this paper shows that CNNs rely on recognising textures more than they use shape features. This is another property that would be tested by this control dataset, as the backgrounds from DTD are strongly textured rather than having "higher-level" shape features.
- Impacts of Background Removal on Convolutional Neural Networks for Plant Disease Classification In-Situ: Lastly, this is another interesting paper that investigates the impact of removing the background of images for image classification tasks. This can greatly improve sample efficiency as the model does not need to learn to overcome or rely on spurious background correlations. My control dataset will test a similar property, as I will have a test dataset which removes the background and only contains the object under test. My dataset will be slightly different however, as my training dataset will have backgrounds present, which was not the case for this dataset.
These papers motivate the need for highly controlled datasets that test a single property. Even though there has been lots of research on the effect of the background on image classification, each paper has slightly different hypotheses which they often test on the highly diverse ImageNet dataset. Creating a simpler and more controlled dataset can be useful to more accurately evaluate a single hypothesis without leading to confounding explanations.
This section provides an outline of the controlled dataset and how it was generated. The dataset combines MNIST for the class labels with DTD as the background noise. I created 3 splits for the dataset, one training set and three test sets to test different properties. For the training set, each digit is assigned a texture class from DTD and for each element in the MNIST train set a random texture from the corresponding DTD class is added as background to the digit. For the test sets there are 3 different background splits for the digits: 1) one with the correct DTD background class, 2) one with an uncorrelated DTD background class, 3) and one without a background. The splits aim to test the following properties:
- Tests whether the model exploits the foreground background correlations for its predictions.
- Tests how robust the model is by breaking the foreground background correlation.
- Tests if the model has learned to use the foreground digit features for classification.
Some samples from each split each visualized below.
The concrete steps of how the dataset was generated are listed below:
- Download MNIST handwritten digit dataset (28×28 grayscale images, 10 classes) and Describable Textures Dataset (larger images)
- Assign each of the 10 digit classes to a specific texture category from DTD with
- Extract the digit foreground using pixel intensity threshold (>0.1), then resize the selected texture images to 28×28 pixels and convert to grayscale
- Composite images by overlaying digit pixels onto texture background using the formula:
composite = mask × digit + (1 - mask) × texture - Generate four dataset splits: training set with fixed digit-texture correlations, test with same correlations as training, test with random texture backgrounds, and test with original MNIST digits without textures
Further work could concretely test this dataset with different image classification architectures. Beyond just an analysis of the accuracy scores on the different splits, it would be interesting to use techniques such as saliency maps to find out which parts of the input contribute the most to the gradients/predictions of the network. Some research (e.g. the "Noise or Signal" paper) has shown that models that achieve higher accuracy focus less on the background features, so investigating the correlation between background exploitation and model complexity could be an interesting research direction. Additionally, another control dataset to test could be overlaying the texture on the foreground of the pixel rather than background to test how much models rely on the texture of the class rather than the shape, although this is very similar to the current setup so may not result in much difference.
[1] Dataset Link (GitHub) [2] Describable Textures Dataset (DTD) [3] MNIST [4] Noise or Signal: The Role of Image Backgrounds in Object Recognition [5] ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness [6] Impacts of Background Removal on Convolutional Neural Networks for Plant Disease Classification In-Situ
