Home
/
Authors
/
Annemarie Friedrich

Author

Annemarie Friedrich

Other affiliations: Saarland University, Heidelberg University, University of Stuttgart

Bio: Annemarie Friedrich is an academic researcher from Bosch. The author has contributed to research in topics: Parsing & Dependency grammar. The author has an hindex of 10, co-authored 34 publications receiving 490 citations. Previous affiliations of Annemarie Friedrich include Saarland University & Heidelberg University.

Papers

PDF

Open Access

More filters

Book Chapter•DOI•

Coherent Multi-sentence Video Description with Variable Level of Detail

[...]

Anna Rohrbach¹, Marcus Rohrbach², Marcus Rohrbach¹, Wei Qiu¹, Wei Qiu³, Annemarie Friedrich³, Manfred Pinkal³, Bernt Schiele¹ - Show less +4 more•Institutions (3)

Max Planck Society¹, University of California, Berkeley², Saarland University³

02 Sep 2014

TL;DR: This paper follows a two-step approach where it first learns to predict a semantic representation from video and then generates natural language descriptions from it, and model across-sentence consistency at the level of the SR by enforcing a consistent topic.

...read moreread less

Abstract: Humans can easily describe what they see in a coherent way and at varying level of detail. However, existing approaches for automatic video description focus on generating only single sentences and are not able to vary the descriptions’ level of detail. In this paper, we address both of these limitations: for a variable level of detail we produce coherent multi-sentence descriptions of complex videos. To understand the difference between detailed and short descriptions, we collect and analyze a video description corpus of three levels of detail. We follow a two-step approach where we first learn to predict a semantic representation (SR) from video and then generate natural language descriptions from it. For our multi-sentence descriptions we model across-sentence consistency at the level of the SR by enforcing a consistent topic. Human judges rate our descriptions as more readable, correct, and relevant than related work.

...read moreread less

244 citations

Book Chapter•DOI•

Coherent Multi-Sentence Video Description with Variable Level of Detail

[...]

Anna Senina, Marcus Rohrbach, Wei Qiu, Annemarie Friedrich, Sikandar Amin, Mykhaylo Andriluka, Manfred Pinkal, Bernt Schiele - Show less +4 more

24 Mar 2014-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this article, a two-step approach is used to predict a semantic representation from video and then generate natural language descriptions from the semantic representation, enforcing a consistent topic at the level of the representation.

...read moreread less

Abstract: Humans can easily describe what they see in a coherent way and at varying level of detail. However, existing approaches for automatic video description are mainly focused on single sentence generation and produce descriptions at a fixed level of detail. In this paper, we address both of these limitations: for a variable level of detail we produce coherent multi-sentence descriptions of complex videos. We follow a two-step approach where we first learn to predict a semantic representation (SR) from video and then generate natural language descriptions from the SR. To produce consistent multi-sentence descriptions, we model across-sentence consistency at the level of the SR by enforcing a consistent topic. We also contribute both to the visual recognition of objects proposing a hand-centric approach as well as to the robust generation of sentences using a word lattice. Human judges rate our multi-sentence descriptions as more readable, correct, and relevant than related work. To understand the difference between more detailed and shorter descriptions, we collect and analyze a video description corpus of three levels of detail.

...read moreread less

51 citations

Proceedings Article•DOI•

Automatic prediction of aspectual class of verbs in context

[...]

Annemarie Friedrich¹, Alexis Palmer¹•Institutions (1)

Saarland University¹

01 Jun 2014

TL;DR: A semi-supervised approach using linguistically-motivated features and a novel set of distributional features based on representative verb types allows us to predict classes accurately, even for unseen verbs.

...read moreread less

Abstract: This paper describes a new approach to predicting the aspectual class of verbs in context, i.e., whether a verb is used in a stative or dynamic sense. We identify two challenging cases of this problem: when the verb is unseen in training data, and when the verb is ambiguous for aspectual class. A semi-supervised approach using linguistically-motivated features and a novel set of distributional features based on representative verb types allows us to predict classes accurately, even for unseen verbs. Many frequent verbs can be either stative or dynamic in different contexts, which has not been modeled by previous work; we use contextual features to resolve this ambiguity. In addition, we introduce two new datasets of clauses marked for aspectual class.

...read moreread less

40 citations

Proceedings Article•DOI•

The SOFC-Exp Corpus and Neural Approaches to Information Extraction in the Materials Science Domain

[...]

Annemarie Friedrich¹, Heike Adel¹, Federico Tomazic, Johannes Hingerl, Renou Benteau, Anika Marusczyk, Lukas Lange¹ - Show less +3 more•Institutions (1)

Bosch¹

04 Jun 2020

TL;DR: An annotation scheme for marking information on experiments related to solid oxide fuel cells in scientific publications, such as involved materials and measurement conditions is developed, as well as a SOFC-Exp corpus consisting of 45 open-access scholarly articles annotated by domain experts.

...read moreread less

Abstract: This paper presents a new challenging information extraction task in the domain of materials science. We develop an annotation scheme for marking information on experiments related to solid oxide fuel cells in scientific publications, such as involved materials and measurement conditions. With this paper, we publish our annotation guidelines, as well as our SOFC-Exp corpus consisting of 45 open-access scholarly articles annotated by domain experts. A corpus and an inter-annotator agreement study demonstrate the complexity of the suggested named entity recognition and slot filling tasks as well as high annotation quality. We also present strong neural-network based models for a variety of tasks that can be addressed on the basis of our new data set. On all tasks, using BERT embeddings leads to large performance gains, but with increasing task complexity, adding a recurrent neural network on top seems beneficial. Our models will serve as competitive baselines in future work, and analysis of their performance highlights difficult cases when modeling the data and suggests promising research directions.

...read moreread less

39 citations

Proceedings Article•DOI•

Situation entity types: automatic classification of clause-level aspect.

[...]

Annemarie Friedrich¹, Alexis Palmer², Manfred Pinkal³•Institutions (3)

Heidelberg University¹, University of Cambridge², Saarland University³

01 Aug 2016

TL;DR: It is shown that the sequence labeling approach using distributional information in the form of Brown clusters, as well as syntactic-semantic features targeted to the task, is robust across genres, reaching accuracies of up to 76%.

...read moreread less

Abstract: This paper describes the first robust approach to automatically labeling clauses with their situation entity type (Smith, 2003), capturing aspectual phenomena at the clause level which are relevant for interpreting both semantics at the clause level and discourse structure. Previous work on this task used a small data set from a limited domain, and relied mainly on words as features, an approach which is impractical in larger settings. We provide a new corpus of texts from 13 genres (40,000 clauses) annotated with situation entity types. We show that our sequence labeling approach using distributional information in the form of Brown clusters, as well as syntactic-semantic features targeted to the task, is robust across genres, reaching accuracies of up to 76%.

...read moreread less

37 citations

1
2
3
4
…
5
6
7

Collapse

Cited by

PDF

Open Access

More filters

Proceedings Article•DOI•

Long-term recurrent convolutional networks for visual recognition and description

[...]

Jeff Donahue¹, Lisa Anne Hendricks¹, Sergio Guadarrama¹, Marcus Rohrbach¹, Subhashini Venugopalan², Trevor Darrell¹, Kate Saenko³ - Show less +3 more•Institutions (3)

University of California, Berkeley¹, University of Texas at Austin², University of Massachusetts Lowell³

07 Jun 2015

TL;DR: A novel recurrent convolutional architecture suitable for large-scale visual learning which is end-to-end trainable, and shows such models have distinct advantages over state-of-the-art models for recognition or generation which are separately defined and/or optimized.

...read moreread less

Abstract: Models based on deep convolutional networks have dominated recent image interpretation tasks; we investigate whether models which are also recurrent, or “temporally deep”, are effective for tasks involving sequences, visual and otherwise. We develop a novel recurrent convolutional architecture suitable for large-scale visual learning which is end-to-end trainable, and demonstrate the value of these models on benchmark video recognition tasks, image description and retrieval problems, and video narration challenges. In contrast to current models which assume a fixed spatio-temporal receptive field or simple temporal averaging for sequential processing, recurrent convolutional models are “doubly deep” in that they can be compositional in spatial and temporal “layers”. Such models may have advantages when target concepts are complex and/or training data are limited. Learning long-term dependencies is possible when nonlinearities are incorporated into the network state updates. Long-term RNN models are appealing in that they directly can map variable-length inputs (e.g., video frames) to variable length outputs (e.g., natural language text) and can model complex temporal dynamics; yet they can be optimized with backpropagation. Our recurrent long-term models are directly connected to modern visual convnet models and can be jointly trained to simultaneously learn temporal dynamics and convolutional perceptual representations. Our results show such models have distinct advantages over state-of-the-art models for recognition or generation which are separately defined and/or optimized.

...read moreread less

4,206 citations

Posted Content•

Long-term Recurrent Convolutional Networks for Visual Recognition and Description

[...]

Jeff Donahue¹, Lisa Anne Hendricks¹, Marcus Rohrbach¹, Subhashini Venugopalan², Sergio Guadarrama¹, Kate Saenko³, Trevor Darrell¹ - Show less +3 more•Institutions (3)

University of California, Berkeley¹, University of Texas at Austin², University of Massachusetts Lowell³

17 Nov 2014-arXiv: Computer Vision and Pattern Recognition

...read moreread less

Abstract: Models based on deep convolutional networks have dominated recent image interpretation tasks; we investigate whether models which are also recurrent, or "temporally deep", are effective for tasks involving sequences, visual and otherwise. We develop a novel recurrent convolutional architecture suitable for large-scale visual learning which is end-to-end trainable, and demonstrate the value of these models on benchmark video recognition tasks, image description and retrieval problems, and video narration challenges. In contrast to current models which assume a fixed spatio-temporal receptive field or simple temporal averaging for sequential processing, recurrent convolutional models are "doubly deep"' in that they can be compositional in spatial and temporal "layers". Such models may have advantages when target concepts are complex and/or training data are limited. Learning long-term dependencies is possible when nonlinearities are incorporated into the network state updates. Long-term RNN models are appealing in that they directly can map variable-length inputs (e.g., video frames) to variable length outputs (e.g., natural language text) and can model complex temporal dynamics; yet they can be optimized with backpropagation. Our recurrent long-term models are directly connected to modern visual convnet models and can be jointly trained to simultaneously learn temporal dynamics and convolutional perceptual representations. Our results show such models have distinct advantages over state-of-the-art models for recognition or generation which are separately defined and/or optimized.

...read moreread less

3,935 citations

Proceedings Article•DOI•

MSR-VTT: A Large Video Description Dataset for Bridging Video and Language

[...]

Jun Xu¹, Tao Mei¹, Ting Yao¹, Yong Rui¹•Institutions (1)

Microsoft¹

01 Jun 2016

TL;DR: A detailed analysis of MSR-VTT in comparison to a complete set of existing datasets, together with a summarization of different state-of-the-art video-to-text approaches, shows that the hybrid Recurrent Neural Networkbased approach, which combines single-frame and motion representations with soft-attention pooling strategy, yields the best generalization capability on this dataset.

...read moreread less

Abstract: While there has been increasing interest in the task of describing video with natural language, current computer vision algorithms are still severely limited in terms of the variability and complexity of the videos and their associated language that they can recognize. This is in part due to the simplicity of current benchmarks, which mostly focus on specific fine-grained domains with limited videos and simple descriptions. While researchers have provided several benchmark datasets for image captioning, we are not aware of any large-scale video description dataset with comprehensive categories yet diverse video content. In this paper we present MSR-VTT (standing for "MSRVideo to Text") which is a new large-scale video benchmark for video understanding, especially the emerging task of translating video to text. This is achieved by collecting 257 popular queries from a commercial video search engine, with 118 videos for each query. In its current version, MSR-VTT provides 10K web video clips with 41.2 hours and 200K clip-sentence pairs in total, covering the most comprehensive categories and diverse visual content, and representing the largest dataset in terms of sentence and vocabulary. Each clip is annotated with about 20 natural sentences by 1,327 AMT workers. We present a detailed analysis of MSR-VTT in comparison to a complete set of existing datasets, together with a summarization of different state-of-the-art video-to-text approaches. We also provide an extensive evaluation of these approaches on this dataset, showing that the hybrid Recurrent Neural Networkbased approach, which combines single-frame and motion representations with soft-attention pooling strategy, yields the best generalization capability on MSR-VTT.

...read moreread less

933 citations

Book Chapter•DOI•

Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding

[...]

Gunnar A. Sigurdsson¹, Gül Varol, Xiaolong Wang¹, Ali Farhadi², Ali Farhadi³, Ivan Laptev, Abhinav Gupta¹, Abhinav Gupta³ - Show less +4 more•Institutions (3)

Carnegie Mellon University¹, University of Washington², Allen Institute for Artificial Intelligence³

08 Oct 2016

TL;DR: This work proposes a novel Hollywood in Homes approach to collect data, collecting a new dataset, Charades, with hundreds of people recording videos in their own homes, acting out casual everyday activities, and evaluates and provides baseline results for several tasks including action recognition and automatic description generation.

...read moreread less

Abstract: Computer vision has a great potential to help our daily lives by searching for lost keys, watering flowers or reminding us to take a pill. To succeed with such tasks, computer vision methods need to be trained from real and diverse examples of our daily dynamic scenes. While most of such scenes are not particularly exciting, they typically do not appear on YouTube, in movies or TV broadcasts. So how do we collect sufficiently many diverse but boring samples representing our lives? We propose a novel Hollywood in Homes approach to collect such data. Instead of shooting videos in the lab, we ensure diversity by distributing and crowdsourcing the whole process of video creation from script writing to video recording and annotation. Following this procedure we collect a new dataset, Charades, with hundreds of people recording videos in their own homes, acting out casual everyday activities. The dataset is composed of 9,848 annotated videos with an average length of 30 s, showing activities of 267 people from three continents. Each video is annotated by multiple free-text descriptions, action labels, action intervals and classes of interacted objects. In total, Charades provides 27,847 video descriptions, 66,500 temporally localized intervals for 157 action classes and 41,104 labels for 46 object classes. Using this rich data, we evaluate and provide baseline results for several tasks including action recognition and automatic description generation. We believe that the realism, diversity, and casual nature of this dataset will present unique challenges and new opportunities for computer vision community.

...read moreread less

865 citations

Journal Article•DOI•

Long-Term Recurrent Convolutional Networks for Visual Recognition and Description

[...]

Jeff Donahue¹, Lisa Anne Hendricks¹, Marcus Rohrbach¹, Subhashini Venugopalan², Sergio Guadarrama¹, Kate Saenko³, Trevor Darrell¹ - Show less +3 more•Institutions (3)

University of California, Berkeley¹, University of Texas at Austin², University of Massachusetts Lowell³

01 Apr 2017-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: In this article, a class of recurrent convolutional architectures was proposed for large-scale visual understanding tasks, and demonstrated the value of these models for activity recognition, image captioning, and video description.

...read moreread less

Abstract: Models based on deep convolutional networks have dominated recent image interpretation tasks; we investigate whether models which are also recurrent are effective for tasks involving sequences, visual and otherwise. We describe a class of recurrent convolutional architectures which is end-to-end trainable and suitable for large-scale visual understanding tasks, and demonstrate the value of these models for activity recognition, image captioning, and video description. In contrast to previous models which assume a fixed visual representation or perform simple temporal averaging for sequential processing, recurrent convolutional models are “doubly deep” in that they learn compositional representations in space and time. Learning long-term dependencies is possible when nonlinearities are incorporated into the network state updates. Differentiable recurrent models are appealing in that they can directly map variable-length inputs (e.g., videos) to variable-length outputs (e.g., natural language text) and can model complex temporal dynamics; yet they can be optimized with backpropagation. Our recurrent sequence models are directly connected to modern visual convolutional network models and can be jointly trained to learn temporal dynamics and convolutional perceptual representations. Our results show that such models have distinct advantages over state-of-the-art models for recognition or generation which are separately defined or optimized.

...read moreread less

812 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93

Collapse