scispace - formally typeset
Search or ask a question
Author

Ming-Wei Chang

Bio: Ming-Wei Chang is an academic researcher from Google. The author has contributed to research in topics: Question answering & Parsing. The author has an hindex of 41, co-authored 98 publications receiving 36404 citations. Previous affiliations of Ming-Wei Chang include Microsoft & National Taiwan University.


Papers
More filters
Journal ArticleDOI
TL;DR: Experiments demonstrate that the proposed bounds are competitive with Bayesian SVR for parameter selection and the differentiability of leave-one-out bounds is discussed.
Abstract: Minimizing bounds of leave-one-out errors is an important and efficient approach for support vector machine (SVM) model selection. Past research focuses on their use for classification but not regression. In this letter, we derive various leave-one-out bounds for support vector regression (SVR) and discuss the difference from those for classification. Experiments demonstrate that the proposed bounds are competitive with Bayesian SVR for parameter selection. We also discuss the differentiability of leave-one-out bounds.

112 citations

Proceedings ArticleDOI
18 May 2015
TL;DR: A new QA system that mines answers directly from the Web, and meanwhile employs KBs as a significant auxiliary to further boost the QA performance, and makes the first attempt to link answer candidates to entities in Freebase, during answer candidate generation.
Abstract: Most recent question answering (QA) systems query large-scale knowledge bases (KBs) to answer a question, after parsing and transforming natural language questions to KBs-executable forms (eg, logical forms) As a well-known fact, KBs are far from complete, so that information required to answer questions may not always exist in KBs In this paper, we develop a new QA system that mines answers directly from the Web, and meanwhile employs KBs as a significant auxiliary to further boost the QA performance Specifically, to the best of our knowledge, we make the first attempt to link answer candidates to entities in Freebase, during answer candidate generation Several remarkable advantages follow: (1) Redundancy among answer candidates is automatically reduced (2) The types of an answer candidate can be effortlessly determined by those of its corresponding entity in Freebase (3) Capitalizing on the rich information about entities in Freebase, we can develop semantic features for each answer candidate after linking them to Freebase Particularly, we construct answer-type related features with two novel probabilistic models, which directly evaluate the appropriateness of an answer candidate's types under a given question Overall, such semantic features turn out to play significant roles in determining the true answers from the large answer candidate pool The experimental results show that across two testing datasets, our QA system achieves an 18%~54% improvement under F_1 metric, compared with various existing QA systems

108 citations

Posted Content
TL;DR: NQG-T5 is proposed, a hybrid model that combines a high-precision grammar-based approach with a pre-trained sequence-to-sequence model that outperforms existing approaches across several compositional generalization challenges on non-synthetic data, while also being competitive with the state of theart on standard evaluations.
Abstract: Sequence-to-sequence models excel at handling natural language variation, but have been shown to struggle with out-of-distribution compositional generalization. This has motivated new specialized architectures with stronger compositional biases, but most of these approaches have only been evaluated on synthetically-generated datasets, which are not representative of natural language variation. In this work we ask: can we develop a semantic parsing approach that handles both natural language variation and compositional generalization? To better assess this capability, we propose new train and test splits of non-synthetic datasets. We demonstrate that strong existing approaches do not perform well across a broad set of evaluations. We also propose NQG-T5, a hybrid model that combines a high-precision grammar-based approach with a pre-trained sequence-to-sequence model. It outperforms existing approaches across several compositional generalization challenges on non-synthetic data, while also being competitive with the state-of-the-art on standard evaluations. While still far from solving this problem, our study highlights the importance of diverse evaluations and the open challenge of handling both compositional generalization and natural language variation in semantic parsing.

103 citations

Proceedings ArticleDOI
03 Jul 2014
TL;DR: It is shown how the pooling technique is adapted to address the difficulties of gathering annotations for the entity linking task, and how the task definition, issues encountered during annotation, and detailed analysis of all the participating systems are provided.
Abstract: In this paper we overview the 2014 Entity Recognition and Disambiguation Challenge (ERD'14), which took place from March to June 2014 and was summarized in a dedicated workshop at SIGIR 2014. The main goal of the ERD challenge was to promote research in recognition and disambiguation of entities in unstructured text. Unlike many past entity linking challenges, no mention segmentations were given to the participating systems for a given document. Participants were asked to implement a web service for their system to minimize human involvement during evaluation and to enable measuring the processing times. The challenge has attracted a lot of interest (over 100 teams registered, and 27 of those submitted final results).In this paper we cover the task definition, issues encountered during annotation, and provide a detailed analysis of all the participating systems. Specifically, we show how we adapted the pooling technique to address the difficulties of gathering annotations for the entity linking task. We also summarize the ERD workshop that followed the challenge, including the oral and poster presentations as well as the invited talks.

95 citations

Proceedings ArticleDOI
13 May 2013
TL;DR: A semi-supervised clustering model is proposed based on the latent structural SVM framework, and a set of effective automatic annotation rules are proposed as weak supervision to release the burden of manual annotation in identifying cross-session search tasks.
Abstract: Search tasks, comprising a series of search queries serving the same information need, have recently been recognized as an accurate atomic unit for modeling user search intent. Most prior research in this area has focused on short-term search tasks within a single search session, and heavily depend on human annotations for supervised classification model learning. In this work, we target the identification of long-term, or cross-session, search tasks (transcending session boundaries) by investigating inter-query dependencies learned from users' searching behaviors. A semi-supervised clustering model is proposed based on the latent structural SVM framework, and a set of effective automatic annotation rules are proposed as weak supervision to release the burden of manual annotation. Experimental results based on a large-scale search log collected from Bing.com confirms the effectiveness of the proposed model in identifying cross-session search tasks and the utility of the introduced weak supervision signals. Our learned model enables a more comprehensive understanding of users' search behaviors via search logs and facilitates the development of dedicated search-engine support for long-term tasks.

93 citations


Cited by
More filters
Posted Content
TL;DR: It is found that BERT was significantly undertrained, and can match or exceed the performance of every model published after it, and the best model achieves state-of-the-art results on GLUE, RACE and SQuAD.
Abstract: Language model pretraining has led to significant performance gains but careful comparison between different approaches is challenging. Training is computationally expensive, often done on private datasets of different sizes, and, as we will show, hyperparameter choices have significant impact on the final results. We present a replication study of BERT pretraining (Devlin et al., 2019) that carefully measures the impact of many key hyperparameters and training data size. We find that BERT was significantly undertrained, and can match or exceed the performance of every model published after it. Our best model achieves state-of-the-art results on GLUE, RACE and SQuAD. These results highlight the importance of previously overlooked design choices, and raise questions about the source of recently reported improvements. We release our models and code.

13,994 citations

Posted Content
TL;DR: Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.
Abstract: While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.

12,690 citations

Journal ArticleDOI
15 Jul 2021-Nature
TL;DR: For example, AlphaFold as mentioned in this paper predicts protein structures with an accuracy competitive with experimental structures in the majority of cases using a novel deep learning architecture. But the accuracy is limited by the fact that no homologous structure is available.
Abstract: Proteins are essential to life, and understanding their structure can facilitate a mechanistic understanding of their function. Through an enormous experimental effort1–4, the structures of around 100,000 unique proteins have been determined5, but this represents a small fraction of the billions of known protein sequences6,7. Structural coverage is bottlenecked by the months to years of painstaking effort required to determine a single protein structure. Accurate computational approaches are needed to address this gap and to enable large-scale structural bioinformatics. Predicting the three-dimensional structure that a protein will adopt based solely on its amino acid sequence—the structure prediction component of the ‘protein folding problem’8—has been an important open research problem for more than 50 years9. Despite recent progress10–14, existing methods fall far short of atomic accuracy, especially when no homologous structure is available. Here we provide the first computational method that can regularly predict protein structures with atomic accuracy even in cases in which no similar structure is known. We validated an entirely redesigned version of our neural network-based model, AlphaFold, in the challenging 14th Critical Assessment of protein Structure Prediction (CASP14)15, demonstrating accuracy competitive with experimental structures in a majority of cases and greatly outperforming other methods. Underpinning the latest version of AlphaFold is a novel machine learning approach that incorporates physical and biological knowledge about protein structure, leveraging multi-sequence alignments, into the design of the deep learning algorithm. AlphaFold predicts protein structures with an accuracy competitive with experimental structures in the majority of cases using a novel deep learning architecture.

10,601 citations

Christopher M. Bishop1
01 Jan 2006
TL;DR: Probability distributions of linear models for regression and classification are given in this article, along with a discussion of combining models and combining models in the context of machine learning and classification.
Abstract: Probability Distributions.- Linear Models for Regression.- Linear Models for Classification.- Neural Networks.- Kernel Methods.- Sparse Kernel Machines.- Graphical Models.- Mixture Models and EM.- Approximate Inference.- Sampling Methods.- Continuous Latent Variables.- Sequential Data.- Combining Models.

10,141 citations

Proceedings Article
28 May 2020
TL;DR: GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic.
Abstract: Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3's few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general.

10,132 citations