A
Andrej Karpathy
Researcher at Stanford University
Publications - 21
Citations - 55755
Andrej Karpathy is an academic researcher from Stanford University. The author has contributed to research in topics: Recurrent neural network & Object detection. The author has an hindex of 20, co-authored 20 publications receiving 41085 citations. Previous affiliations of Andrej Karpathy include University of British Columbia.
Papers
More filters
Proceedings ArticleDOI
DenseCap: Fully Convolutional Localization Networks for Dense Captioning
TL;DR: In this paper, a Fully Convolutional Localization Network (FCLN) is proposed to address the localization and description task jointly, which can be trained end-to-end with a single round of optimization.
Posted Content
Visualizing and Understanding Recurrent Networks
TL;DR: This work uses character-level language models as an interpretable testbed to provide an analysis of LSTM representations, predictions and error types, and reveals the existence of interpretable cells that keep track of long-range dependencies such as line lengths, quotes and brackets.
Posted Content
Deep Visual-Semantic Alignments for Generating Image Descriptions
Andrej Karpathy,Li Fei-Fei +1 more
TL;DR: In this article, a multi-modal recurrent neural network (M-RNN) is used to align the two modalities through a multimodal embedding, and the inferred alignments are used to learn to generate novel descriptions of image regions.
Journal ArticleDOI
Grounded Compositional Semantics for Finding and Describing Images with Sentences
TL;DR: The DT-RNN model, which uses dependency trees to embed sentences into a vector space in order to retrieve images that are described by those sentences, outperform other recursive and recurrent neural networks, kernelized CCA and a bag-of-words baseline on the tasks of finding an image that fits a sentence description and vice versa.
Proceedings Article
Deep Fragment Embeddings for Bidirectional Image Sentence Mapping
TL;DR: The authors proposed a model for bidirectional retrieval of images and sentences through a deep, multi-modal embedding of visual and natural language data, which works on a finer level and embeds fragments of images (objects) and fragments of sentences (typed dependency tree relations) into a common space.