Open AccessJournal Article
news-please : a Generic News Crawler and Extractor
TLDR
News-please is presented, a generic, multi-language, open-source crawler and extractor for news that works out-of-the-box for a large variety of news websites.Abstract:
The amount of news published and read online has increased tremendously in recent years, making news data an interesting resource for many research disciplines, such as the social sciences and linguistics. However, large scale collection of news data is cumbersome due to a lack of generic tools for crawling and extracting such data. We present news-please, a generic, multi-language, open-source crawler and extractor for news that works out-of-the-box for a large variety of news websites. Our system allows crawling arbitrary news websites and extracting the major elements of news articles on those websites, i.e., title, lead paragraph, main content, publication date, author, and main image. Compared to existing tools, news-please features full website extraction requiring only the root URL.read more
Citations
More filters
Posted Content
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Yinhan Liu,Myle Ott,Naman Goyal,Jingfei Du,Mandar Joshi,Danqi Chen,Omer Levy,Michael Lewis,Luke Zettlemoyer,Veselin Stoyanov +9 more
TL;DR: It is found that BERT was significantly undertrained, and can match or exceed the performance of every model published after it, and the best model achieves state-of-the-art results on GLUE, RACE and SQuAD.
Journal ArticleDOI
Automated identification of media bias in news articles : an interdisciplinary literature review
TL;DR: It is suggested that suitable, automated methods from computer science, primarily in the realm of natural language processing, are already available for each of the discussed forms of media bias, opening multiple directions for promising further research in computer science in this area.
Journal ArticleDOI
CoVerifi: A COVID-19 News Verification System
Nikhil L. Kolluri,Dhiraj Murthy +1 more
TL;DR: This study seeks to make a timely intervention to the information landscape through a COVID-19 “fake news”, misinformation, and disinformation website and introduces CoVerifi, a web application which combines both the power of machine learning and human feedback to assess the credibility of news.
Book ChapterDOI
Giveme5W: Main Event Retrieval from News Articles by Extraction of the Five Journalistic W Questions
TL;DR: Giveme5W is the first open-source, syntax-based 5W extraction system for news articles, which retrieves an article’s main event by extracting phrases that answer the journalistic 5Ws.
Proceedings ArticleDOI
Trafilatura: {A} Web Scraping Library and Command-Line Tool for Text Discovery and Extraction
TL;DR: The software allows for main text, comments and metadata extraction, while also providing building blocks for web crawling tasks and performs significantly better than other open-source solutions in this evaluation and in external benchmarks.
References
More filters
Journal ArticleDOI
RCV1: A New Benchmark Collection for Text Categorization Research
TL;DR: This work describes the coding policy and quality control procedures used in producing the RCV1 data, the intended semantics of the hierarchical category taxonomies, and the corrections necessary to remove errorful data.
Proceedings ArticleDOI
Boilerplate detection using shallow text features
TL;DR: This paper analyzes a small set of shallow text features for classifying the individual text elements in a Web page and derives a simple and plausible stochastic model for describing the boilerplate creation process.
Book ChapterDOI
PNS: A Personalized News Aggregator on the Web
TL;DR: A system that aggregates news from various electronic news publishers and distributors by using source-specific information extraction programs and parsers, organizes them according to pre-defined news categories and constructs personalized views via a Web-based interface.
Journal ArticleDOI
Scraping Scientific Web Repositories : Challenges and Solutions for Automated Content Extraction
TL;DR: The challenges and present strategies to programmatically access scientometric raw data in scientific Web repositories and demonstrate the strategies as part of an open source tool (MIT license) that allows research performance comparisons based on Google Scholar data.