scispace - formally typeset
A

Andrej Karpathy

Researcher at Stanford University

Publications -  21
Citations -  55755

Andrej Karpathy is an academic researcher from Stanford University. The author has contributed to research in topics: Recurrent neural network & Object detection. The author has an hindex of 20, co-authored 20 publications receiving 41085 citations. Previous affiliations of Andrej Karpathy include University of British Columbia.

Papers
More filters
Proceedings ArticleDOI

DenseCap: Fully Convolutional Localization Networks for Dense Captioning

TL;DR: In this paper, a Fully Convolutional Localization Network (FCLN) is proposed to address the localization and description task jointly, which can be trained end-to-end with a single round of optimization.
Posted Content

Visualizing and Understanding Recurrent Networks

TL;DR: This work uses character-level language models as an interpretable testbed to provide an analysis of LSTM representations, predictions and error types, and reveals the existence of interpretable cells that keep track of long-range dependencies such as line lengths, quotes and brackets.
Posted Content

Deep Visual-Semantic Alignments for Generating Image Descriptions

TL;DR: In this article, a multi-modal recurrent neural network (M-RNN) is used to align the two modalities through a multimodal embedding, and the inferred alignments are used to learn to generate novel descriptions of image regions.
Journal ArticleDOI

Grounded Compositional Semantics for Finding and Describing Images with Sentences

TL;DR: The DT-RNN model, which uses dependency trees to embed sentences into a vector space in order to retrieve images that are described by those sentences, outperform other recursive and recurrent neural networks, kernelized CCA and a bag-of-words baseline on the tasks of finding an image that fits a sentence description and vice versa.
Proceedings Article

Deep Fragment Embeddings for Bidirectional Image Sentence Mapping

TL;DR: The authors proposed a model for bidirectional retrieval of images and sentences through a deep, multi-modal embedding of visual and natural language data, which works on a finer level and embeds fragments of images (objects) and fragments of sentences (typed dependency tree relations) into a common space.