Long-Term Recurrent Convolutional Networks for Visual Recognition and Description

Overview

Journal IEEE Trans Pattern Anal Mach Intell

Specialties Biomedical Engineering
Medical Informatics

Date 2016 Sep 9

PMID 27608449

Citations 228

Authors

Jeff Donahue

Lisa Anne Hendricks

Marcus Rohrbach

Subhashini Venugopalan

Sergio Guadarrama

Kate Saenko

Trevor Darrell

Affiliations

Soon will be listed here.

Abstract

Models based on deep convolutional networks have dominated recent image interpretation tasks; we investigate whether models which are also recurrent are effective for tasks involving sequences, visual and otherwise. We describe a class of recurrent convolutional architectures which is end-to-end trainable and suitable for large-scale visual understanding tasks, and demonstrate the value of these models for activity recognition, image captioning, and video description. In contrast to previous models which assume a fixed visual representation or perform simple temporal averaging for sequential processing, recurrent convolutional models are "doubly deep" in that they learn compositional representations in space and time. Learning long-term dependencies is possible when nonlinearities are incorporated into the network state updates. Differentiable recurrent models are appealing in that they can directly map variable-length inputs (e.g., videos) to variable-length outputs (e.g., natural language text) and can model complex temporal dynamics; yet they can be optimized with backpropagation. Our recurrent sequence models are directly connected to modern visual convolutional network models and can be jointly trained to learn temporal dynamics and convolutional perceptual representations. Our results show that such models have distinct advantages over state-of-the-art models for recognition or generation which are separately defined or optimized.

Citing Articles

A deep learning framework for automated and generalized synaptic event analysis.

ONeill P, Baccino-Calace M, Rupprecht P, Lee S, Hao Y, Lin M Elife. 2025; 13.

PMID: 40042890 PMC: 11882139. DOI: 10.7554/eLife.98485.

Two-stream spatio-temporal GCN-transformer networks for skeleton-based action recognition.

Chen D, Chen M, Wu P, Wu M, Zhang T, Li C Sci Rep. 2025; 15(1):4982.

PMID: 39929951 PMC: 11811230. DOI: 10.1038/s41598-025-87752-8.

A New Deep Learning-Based Method for Automated Identification of Thoracic Lymph Node Stations in Endobronchial Ultrasound (EBUS): A Proof-of-Concept Study.

Ervik O, Rodde M, Hofstad E, Tveten I, Lango T, Leira H J Imaging. 2025; 11(1).

PMID: 39852323 PMC: 11766424. DOI: 10.3390/jimaging11010010.

Detection of Rat Pain-Related Grooming Behaviors Using Multistream Recurrent Convolutional Networks on Day-Long Video Recordings.

Lee C, Lui P, Gao W, Gao Z Bioengineering (Basel). 2025; 11(12.

PMID: 39767998 PMC: 11673758. DOI: 10.3390/bioengineering11121180.

AI-Driven Electrical Fast Transient Suppression for Enhanced Electromagnetic Interference Immunity in Inductive Smart Proximity Sensors.

Giangaspero S, Nicchiotti G, Venier P, Genilloud L, Pirrami L Sensors (Basel). 2024; 24(22).

PMID: 39599148 PMC: 11598082. DOI: 10.3390/s24227372.