What is Image Captioning in Computer Vision?

What is Image Captioning in Computer Vision?

Deep learning is a core branch of AI, and just like a human brain has a million neurons to achieve different sensing mechanisms, Deep Learning is also a mechanism where we as humans try to make our machine mimic a “human brain.” Image captioning, as the term suggests, is” image captioning.” It involves describing the objects and, in turn, returning the text captions for the same.

Some language processing techniques, along with computer vision, are involved in generating captions. To achieve the desired captions, a dataset of images and the corresponding output text captions are required initially. This model consists of Convolutional Neural Network(CNN) for feature extraction as well as a Recurrent Neural Network(RNN) for sentence generation. Computer vision has contributed to improved image captioning systems, whereas language processing has resulted in more accurate caption generation. Image captioning has various applications such as editing applications, virtual assistants, image indexing, aid for visually impaired persons, social media, and many more natural language processing applications.

Convolutional Neural Networks

CNN is a field of Deep learning and neural networks used for the recognition of images and can be used to process data represented as a 2D matrix like the dataset of images given as input here. It can be used for scaled, translated, and rotated imagery. It focuses on analyzing images by scanning them from every direction and then giving out relevant features. The output is a combination of features for image classification.

What is LSTM?

It is a type of RNN and is capable of working with predictions that are sequenced. Just like google shows the next word based on our initial text. It helps in filtering out the data by carrying out relevant information and discarding the non-relevant one.

These encoders do not use information about spatial relations between objects, such as position and size. This information is, however, critical to understanding the content within an image and is a crucial factor when we talk about the physical world. Positioning like “boy standing beside a girl” or “book lying under the table” can aid in differentiating between the positions. Similarly, “the woman behind an elephant” or “woman behind a dog” are some examples to cite the relative sizes. In the case of the transformer, the use of these positions and sizes benefits in generating ideal captions for the images and also helps the visual encoders as well.

Working of the Model Architecture for Image Captioning

A popular dataset called the Flickr8 dataset contains 8092 images and is just a 1 GB file. It contains day-to-day related images with captions. The model being used in the course of this article is utilising the Flickr8 dataset. The machine first recognizes the objects, and then the description is added based on the objects. These images are also split into three disjoint sets. The training set contains 6000 images, and the rest images are contained by the test dataset.

To evaluate the captions based on images, we evaluate their ability to associate the images with the respective captions. This is achieved with the help of the BLEU (Bilingual Evaluation Understudy) Score. It checks the distinction between natural sentences and human-generated sentences. This is also a really helpful tool to evaluate the performance of a machine’s translation capability.

The model consists of 3 phases:

  1. Image Feature Extraction – The nature of the images from the dataset is extracted using a model called VGG 16 for object identification. It is based on CNN and has 16 layers which have a pattern following two convolutions layers and one dropout layer until the last connected layer. This helps in reducing overfitting, as this model learns quickly. These are also composed of a dense layer for a 4096 vector element of the photo and then proceed to the LSTM layer.
  2. Sequence Processor – This processor is helpful for handling the text input by acting as a tool to embed the word layer. This embed layer has the rules to extract the features of the captions and has a mask to ignore padded values. This processor is then connected to LSTM for the final phase of generating the captions.
  3. Decoder – The final phase includes the combination of input from the image extractor phase and processor phase using an operation and is then fed to a 256-neuron layer and then generates the final output Dense layer, which helps in the production of a softmax prediction of the next word in caption over the entire vocabulary which was formed from the initial data that was processed in the processor phase.

In the training phase, the input and output caption pair are provided to the image captioning model. The VGG model is specifically trained to identify all the objects in the image. LSTM is trained to mark every word in the sentence after it has seen an image as well as the words provided previously. For each caption, two symbols are added to denote the starting and ending point; whenever the stop word is encountered, it stops generating sentences and marks the end of the process.

The loss function is calculated by taking l, which represents the input image, and S, which represents the generated caption. N is used to represent the length of the produced caption, and St signifies probability and the generated word at the time t, respectively. The aim is to minimize the corresponding loss function.


The implementation is done using a Python SciPy environment. Keras 2.0 is used to implement the deep learning model because of the VGG net, which is used for identifying objects. Tensorflow is installed, and the library is used as the backend for the Keras framework for the creation and training of deep neural networks. Tensorflow is also a deep learning library that provides a heterogenous platform for the execution of algorithms and can be run on low-power devices like mobile phones and can be extended to large-scale systems containing hundreds of GPUs.

In order to define the same, we use graph definition. Once the graph is defined, it can be executed and supported on any platform. The features of the photos are pre-computed and saved. These are also loaded as the interpretation of a given photo in the dataset to reduce the repetition of going through each photo through the network each time we test a language model configuration. The preloading is also done for real-time implementation of the captioning model. The VGG model assigns probabilities to the objects first that are most possibly present in the image. The model then converts the image into a vector. This vector is thus formed into a sentence by using LSTM cells.


The API of Keras is used with Tensorflow, which is used as a backend to implement the deep learning architecture to achieve a BLEU score of 0.683. This is a metric for evaluating a sentence that is generated to a reference sentence. If a perfect match is found, a score of 1.0 is produced, whereas if none of the words match the reference sentence, then 0.0 is produced. However, the photo models are still being worked upon to improve the performance by using word vectors on a larger portion of data, such as articles and some sources from the web.

In this article, the author has tried to provide information for all kinds of readers – beginners and experienced alike. With the popularisation of tools like Dall-E, people have become curious about the intertwining of words, sentences, and images. Once you understand how images can be defined in words, it is clear that you can also set on a path to create the next Dall-E. The article provides a clear insight into the working of the LSTM architecture for image captioning. The author urges the readers to implement the image captioning model in their preferred language – be it Python or R.

Read more about learning paths at codedamn here.

Happy Learning!

Sharing is caring

Did you like what Pooja Gera wrote? Thank them for their work by sharing it on social media.