TALL: Temporal Activity Localization via Language Query
Citations
446 citations
Cites background from "TALL: Temporal Activity Localizatio..."
...Most recently, moment localization was proposed in (Hendricks et al., 2017; Gao et al., 2017), where the goal is to localize a short moment from a long video sequence given a query description....
[...]
302 citations
Cites background from "TALL: Temporal Activity Localizatio..."
...Anne Hendricks et al. (2017) and Gao et al. (2017) introduce the task of Single Video Moment Retrieval (SVMR), which aims at retrieving a moment from a single video via a natural language query....
[...]
289 citations
Cites background from "TALL: Temporal Activity Localizatio..."
...t the temporal relationship between different actions (“talk” and “bend down”). Localizing natural language queries in video is an important challenge, recently studied in Hendricks et al. (2017) and Gao et al. (2017) with applications in areas such as video search and retrieval. We argue that to Work done at Adobe during LAH’s summer internship. 4XHU\7KHOLWWOHJLUOWDONVDIWHUEHQGLQJGRZQ 7DON %HQG'RZQ 7DON Fi...
[...]
...n contrast to the query “the little girl talks before bending down” where the relevant contextual moment occurs just after. A limitation of current moment-localization models (Hendricks et al., 2017; Gao et al., 2017) is they consider query-independent video context when localizing moments. For example, when determining whether a proposed temporal region matches a natural language query, Gao et al. (2017) consider...
[...]
...rms well both on simple queries without temporal words and more complex queries requiring temporal reasoning. Moreover, our formulation is generic and unifies approaches in Hendricks et al. (2017) and Gao et al. (2017), allowing us to ablate model component choices, as well as which kind of video context is best for localizing moments described with temporal language. Though datasets used for moment localization in...
[...]
...tantly, to represent a proposed video segment, both models consider context features around a moment: Hendricks et al. (2017) uses global context by averaging features over an entire input video, and Gao et al. (2017) incorporates features adjacent to the proposed video segment. We argue that to do proper temporal reasoning, pre-determined, query independent context features may not cover all possible temporal rel...
[...]
...tructional videos with transcribed text (Kiddon et al., 2015; Huang et al., 2017; Malmaud et al., 2014, 2015). Our work is most related to recent work in video moment retrieval with natural language (Gao et al., 2017; Hendricks et al., 2017). Both works take a natural language query and candidate video segment as input, and output a score for how well the natural language phrase aligns with the video segment. Gao...
[...]
286 citations
Cites background or methods from "TALL: Temporal Activity Localizatio..."
...Cross-modal temporal regression localizer (CTRL) (Gao et al., 2017)...
[...]
...Another weakness is that they exploit the relationships between textual and visual modalities by conducting a simple concatenation (Gao et al., 2017) or measuring a squared distance loss (Hendricks et al., 2017), which ignores the evolving fine-grained video-sentence interactions....
[...]
...Recently, several related works (Gao et al., 2017; Hendricks et al., 2017) leverage one temporal sliding window approach over video sequences to generate video segment candidates, which are then independently combined (Gao et al....
[...]
...Recently, several related works (Gao et al., 2017; Hendricks et al., 2017) leverage one temporal sliding window approach over video sequences to generate video segment candidates, which are then independently combined (Gao et al., 2017) or compared (Hendricks et al., 2017) with the given sentence…...
[...]
...The methods proposed in (Gao et al., 2017; Hendricks et al., 2017) learn a common embedding space shared by video segment features and sentence representations, in which their similarities are measured....
[...]
267 citations
Cites background from "TALL: Temporal Activity Localizatio..."
...A wide range of tasks based on real-life videos have been designed to test such ability, including text-to-video retrieval [75, 28, 54], video captioning [54, 75, 82], video question answering [74, 23, 33, 34], and video moment retrieval [2, 18, 35]....
[...]
...Popular video-andlanguage tasks include text-to-video retrieval [75, 28, 54], video captioning [75, 82, 28, 54, 40], video question answering [74, 23, 33], and moment retrieval [2, 18, 35]....
[...]
References
111,197 citations
"TALL: Temporal Activity Localizatio..." refers methods in this paper
...We set batch size as 64, the networks are optimized by Adam [12] optimizer on a Nvidia TITAN X GPU....
[...]
49,914 citations
27,256 citations
"TALL: Temporal Activity Localizatio..." refers background in this paper
...One of the key element shared in those successful object detection frameworks [21, 23, 6] is the bounding box regression layer....
[...]
24,012 citations
"TALL: Temporal Activity Localizatio..." refers methods in this paper
...Skip-thought [13] learned a Sent2Vec model by applying skip-gram [19] on sentence level and achieved top performance in sentencebased image retrieval task....
[...]
21,729 citations