scispace - formally typeset
Open AccessJournal Article

news-please : a Generic News Crawler and Extractor

TLDR
News-please is presented, a generic, multi-language, open-source crawler and extractor for news that works out-of-the-box for a large variety of news websites.
Abstract
The amount of news published and read online has increased tremendously in recent years, making news data an interesting resource for many research disciplines, such as the social sciences and linguistics. However, large scale collection of news data is cumbersome due to a lack of generic tools for crawling and extracting such data. We present news-please, a generic, multi-language, open-source crawler and extractor for news that works out-of-the-box for a large variety of news websites. Our system allows crawling arbitrary news websites and extracting the major elements of news articles on those websites, i.e., title, lead paragraph, main content, publication date, author, and main image. Compared to existing tools, news-please features full website extraction requiring only the root URL.

read more

Citations
More filters
Posted Content

RoBERTa: A Robustly Optimized BERT Pretraining Approach

TL;DR: It is found that BERT was significantly undertrained, and can match or exceed the performance of every model published after it, and the best model achieves state-of-the-art results on GLUE, RACE and SQuAD.
Journal ArticleDOI

Automated identification of media bias in news articles : an interdisciplinary literature review

TL;DR: It is suggested that suitable, automated methods from computer science, primarily in the realm of natural language processing, are already available for each of the discussed forms of media bias, opening multiple directions for promising further research in computer science in this area.
Journal ArticleDOI

CoVerifi: A COVID-19 News Verification System

TL;DR: This study seeks to make a timely intervention to the information landscape through a COVID-19 “fake news”, misinformation, and disinformation website and introduces CoVerifi, a web application which combines both the power of machine learning and human feedback to assess the credibility of news.
Book ChapterDOI

Giveme5W: Main Event Retrieval from News Articles by Extraction of the Five Journalistic W Questions

TL;DR: Giveme5W is the first open-source, syntax-based 5W extraction system for news articles, which retrieves an article’s main event by extracting phrases that answer the journalistic 5Ws.
Proceedings ArticleDOI

Trafilatura: {A} Web Scraping Library and Command-Line Tool for Text Discovery and Extraction

TL;DR: The software allows for main text, comments and metadata extraction, while also providing building blocks for web crawling tasks and performs significantly better than other open-source solutions in this evaluation and in external benchmarks.
References
More filters
Journal ArticleDOI

RCV1: A New Benchmark Collection for Text Categorization Research

TL;DR: This work describes the coding policy and quality control procedures used in producing the RCV1 data, the intended semantics of the hierarchical category taxonomies, and the corrections necessary to remove errorful data.
Proceedings ArticleDOI

Boilerplate detection using shallow text features

TL;DR: This paper analyzes a small set of shallow text features for classifying the individual text elements in a Web page and derives a simple and plausible stochastic model for describing the boilerplate creation process.
Book ChapterDOI

PNS: A Personalized News Aggregator on the Web

TL;DR: A system that aggregates news from various electronic news publishers and distributors by using source-specific information extraction programs and parsers, organizes them according to pre-defined news categories and constructs personalized views via a Web-based interface.
Journal ArticleDOI

Scraping Scientific Web Repositories : Challenges and Solutions for Automated Content Extraction

TL;DR: The challenges and present strategies to programmatically access scientometric raw data in scientific Web repositories and demonstrate the strategies as part of an open source tool (MIT license) that allows research performance comparisons based on Google Scholar data.
Related Papers (5)