Skip to content

kkzhang95/ESL

Repository files navigation

Enhanced Semantic Similarity Learning Framework for Image-Text Matching

License: MIT

Official PyTorch implementation of the paper Enhanced Semantic Similarity Learning Framework for Image-Text Matching.

Please use the following bib entry to cite this paper if you are using any resources from the repo.

@article{zhang2023enhanced,
  title={Enhanced Semantic Similarity Learning Framework for Image-Text Matching},
  author={Zhang, Kun and Hu, Bo and Zhang, Huatian and Li, Zhe and Mao, Zhendong},
  journal={IEEE Transactions on Circuits and Systems for Video Technology},
  year={2023},
  publisher={IEEE}
}

We referred to the implementations of X-Pool to build up our codebase.

Motivation

Squares denote local dimension elements in a feature. Circles denote the measure-unit, i.e., the minimal basic component used to examine semantic similarity. Compared with (a) existing methods typically default to a static mechanism that only examines the single-dimensional cross-modal correspondence, (b) our key idea is to dynamically capture and learn multi-dimensional enhanced correspondence. That is, the number of dimensions constituting the measure-units is changed from existing only one to hierarchical multi-levels, enabling their examining information granularity to be enriched and enhanced to promote a more comprehensive semantic similarity learning.

Introduction

In this paper, different from the single-dimensional correspondence with limited semantic expressive capability, we propose a novel enhanced semantic similarity learning (ESL), which generalizes both measure-units and their correspondences into a dynamic learnable framework to examine the multi-dimensional enhanced correspondence between visual and textual features. Specifically, we first devise the intra-modal multi-dimensional aggregators with iterative enhancing mechanism, which dynamically captures new measure-units integrated by hierarchical multi-dimensions, producing diverse semantic combinatorial expressive capabilities to provide richer and discriminative information for similarity examination. Then, we devise the inter-modal enhanced correspondence learning with sparse contribution degrees, which comprehensively and efficiently determines the cross-modal semantic similarity. Extensive experiments verify its superiority in achieving state-of-the-art performance.

Image-text Matching Results

The following tables show partial results of image-to-text retrieval on COCO and Flickr30K datasets. In these experiments, we use BERT-base as the text encoder for our methods. This branch provides our code and pre-trained models for using CLIP Encoders as the backbone. Please check out to the BERT-based branch for the code and pre-trained models.

The pre-trained models for MS-COCO can be found model_best_heuristic_coco_clip_based.pth and model_best_adaptive_coco_clip_based.pth. The pre-trained models for Flick30K are lost due to not saving in time. You can train the model yourself to produce the results.

Preparation

Environment

  • The specific required environment can be found here Using conda env create -f ESL_CLIP_based.yaml to create the corresponding environments.

Data

The required files dataset_flickr30k.json, train_coco.json, testall_coco.json, and dev_coco.json can be found here.

You can download the raw image dataset through Flick30k and MS-COCO.

Training

sh  train_clip_based_f30k.sh
sh  train_clip_based_coco.sh

For the dimensional selective mask, we design both heuristic and adaptive strategies. You can use the flag in ./modules/transformer.py (line 32)

heuristic_strategy = False

to control which strategy is selected. True -> heuristic strategy, False -> adaptive strategy.

Evaluation

Test on Flickr30K and MSCOCO

python test.py

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published