Image Captioning

Published: 09 Oct 2015 Category: deep_learning

Im2Text: Describing Images Using 1 Million Captioned Photographs

Show and Tell: A Neural Image Caption Generator(Google)

arXiv: http://arxiv.org/abs/1411.4555
github: https://github.com/karpathy/neuraltalk
GitXiv: http://gitxiv.com/posts/7nofxjoYBXga5XjtL/show-and-tell-a-neural-image-caption-nic-generator
github: https://github.com/apple2373/chainer_caption_generation
blog(“Image caption generation by CNN and LSTM”): http://t-satoshi.blogspot.com/2015/12/image-caption-generation-by-cnn-and-lstm.html

Deep Visual-Semantic Alignments for Generating Image Descriptions

intro: “propose a multimodal deep network that aligns various interesting regions of the image, represented using a CNN feature, with associated words. The learned correspondences are then used to train a bi-directional RNN. This model is able, not only to generate descriptions for images, but also to localize different segments of the sentence to their corresponding image regions.”
arxiv: http://arxiv.org/abs/1412.2306
slides: http://www.cs.toronto.edu/~vendrov/DeepVisualSemanticAlignments_Class_Presentation.pdf

Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN)

intro: “combines the functionalities of the CNN and RNN by introducing a new multimodal layer, after the embedding and recurrent layers of the RNN.”
arxiv: http://arxiv.org/abs/1412.6632
homepage: http://www.stat.ucla.edu/~junhua.mao/m-RNN.html
github: https://github.com/mjhucla/mRNN-CR

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

arXiv: http://arxiv.org/abs/1502.03044
github: https://github.com/kelvinxu/arctic-captions

Describing Videos by Exploiting Temporal Structure

arXiv: http://arxiv.org/abs/1502.08029
github: https://github.com/yaoli/arctic-capgen-vid

Learning like a Child: Fast Novel Visual Concept Learning from Sentence Descriptions of Images

arxiv: http://arxiv.org/abs/1504.06692
homepage: http://www.stat.ucla.edu/~junhua.mao/projects/child_learning.html
github: https://github.com/mjhucla/NVC-Dataset

Learning FRAME Models Using CNN Filters for Knowledge Visualization(CVPR 2015)

arXiv: http://arxiv.org/abs/1509.08379
project page: http://www.stat.ucla.edu/~yang.lu/project/deepFrame/main.html
code+data: http://www.stat.ucla.edu/~yang.lu/project/deepFrame/doc/deepFRAME_1.1.zip

Generating Images from Captions with Attention

arXiv: http://arxiv.org/abs/1511.02793
github: https://github.com/emansim/text2image
demo: http://www.cs.toronto.edu/~emansim/cap2im.html

Order-Embeddings of Images and Language

arXiv: http://arxiv.org/abs/1511.06361
github: https://github.com/ivendrov/order-embedding

DenseCap: Fully Convolutional Localization Networks for Dense Captioning

arXiv: http://arxiv.org/abs/1511.07571

Expressing an Image Stream with a Sequence of Natural Sentences (CRCN. NIPS 2015)

Image Captioning with Deep Bidirectional LSTMs

arxiv: http://arxiv.org/abs/1604.00790

Video Captioning

Translating Videos to Natural Language Using Deep Recurrent Neural Networks

arxiv: http://arxiv.org/abs/1412.4729
project page: https://www.cs.utexas.edu/~vsub/naacl15_project.html

Sequence to Sequence – Video to Text(S2VT. ICCV 2015)

arXiv: http://arxiv.org/abs/1505.00487
project: http://vsubhashini.github.io/s2vt.html
github: https://github.com/vsubhashini/caffe/tree/recurrent/examples/s2vt
github: https://github.com/jazzsaxmafia/video_to_sequence

Jointly Modeling Embedding and Translation to Bridge Video and Language

arXiv: http://arxiv.org/abs/1505.01861

Tools

CaptionBot (Microsoft)

website: https://www.captionbot.ai/

About me

Hi world~

Recent Posts

Links