news-please : a Generic News Crawler and Extractor

Open AccessJournal Article

news-please : a Generic News Crawler and Extractor

Felix Hamborg, +3 more

- 01 Jan 2017 -

Ingénierie Des Systèmes D'information

- pp 218-223

TLDR

News-please is presented, a generic, multi-language, open-source crawler and extractor for news that works out-of-the-box for a large variety of news websites.

Abstract:

The amount of news published and read online has increased tremendously in recent years, making news data an interesting resource for many research disciplines, such as the social sciences and linguistics. However, large scale collection of news data is cumbersome due to a lack of generic tools for crawling and extracting such data. We present news-please, a generic, multi-language, open-source crawler and extractor for news that works out-of-the-box for a large variety of news websites. Our system allows crawling arbitrary news websites and extracting the major elements of news articles on those websites, i.e., title, lead paragraph, main content, publication date, author, and main image. Compared to existing tools, news-please features full website extraction requiring only the root URL.

Citations

PDF

Open Access

More filters

Posted Content

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Yinhan Liu, +9 more

- 26 Jul 2019 -

arXiv: Computation and Language

TL;DR: It is found that BERT was significantly undertrained, and can match or exceed the performance of every model published after it, and the best model achieves state-of-the-art results on GLUE, RACE and SQuAD.

...read moreread less

Journal ArticleDOI

Automated identification of media bias in news articles : an interdisciplinary literature review

Felix Hamborg, +2 more

- 01 Dec 2019 -

International Journal on Digital Librari...

TL;DR: It is suggested that suitable, automated methods from computer science, primarily in the realm of natural language processing, are already available for each of the discussed forms of media bias, opening multiple directions for promising further research in computer science in this area.

...read moreread less

Journal ArticleDOI

CoVerifi: A COVID-19 News Verification System

Nikhil L. Kolluri, +1 more

- 23 Jan 2021 -

Online Social Networks and Media

TL;DR: This study seeks to make a timely intervention to the information landscape through a COVID-19 “fake news”, misinformation, and disinformation website and introduces CoVerifi, a web application which combines both the power of machine learning and human feedback to assess the credibility of news.

...read moreread less

Book ChapterDOI

Giveme5W: Main Event Retrieval from News Articles by Extraction of the Five Journalistic W Questions

Felix Hamborg, +4 more

TL;DR: Giveme5W is the first open-source, syntax-based 5W extraction system for news articles, which retrieves an article’s main event by extracting phrases that answer the journalistic 5Ws.

...read moreread less

Proceedings ArticleDOI

Trafilatura: {A} Web Scraping Library and Command-Line Tool for Text Discovery and Extraction

Adrien Barbaresi

TL;DR: The software allows for main text, comments and metadata extraction, while also providing building blocks for web crawling tasks and performs significantly better than other open-source solutions in this evaluation and in external benchmarks.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal ArticleDOI

RCV1: A New Benchmark Collection for Text Categorization Research

David D. Lewis, +3 more

- 01 Dec 2004 -

Journal of Machine Learning Research

TL;DR: This work describes the coding policy and quality control procedures used in producing the RCV1 data, the intended semantics of the hierarchical category taxonomies, and the corrections necessary to remove errorful data.

...read moreread less

Proceedings ArticleDOI

Boilerplate detection using shallow text features

Christian Kohlschütter, +2 more

TL;DR: This paper analyzes a small set of shallow text features for classifying the individual text elements in a Web page and derives a simple and plausible stochastic model for describing the boilerplate creation process.

...read moreread less

Book ChapterDOI

PNS: A Personalized News Aggregator on the Web

Georgios Paliouras, +3 more

TL;DR: A system that aggregates news from various electronic news publishers and distributors by using source-specific information extraction programs and parsers, organizes them according to pre-defined news categories and constructs personalized views via a Web-based interface.

...read moreread less

Journal ArticleDOI

Scraping Scientific Web Repositories : Challenges and Solutions for Automated Content Extraction

Philipp Meschenmoser, +3 more

- 01 Sep 2016 -

D-lib Magazine

TL;DR: The challenges and present strategies to programmatically access scientometric raw data in scientific Web repositories and demonstrate the strategies as part of an open source tool (MIT license) that allows research performance comparisons based on Google Scholar data.

...read moreread less

arXiv: Computation and Language

Political processes and local newspaper coverage of protest events : From selection bias to triadic interactions

Pamela Oliver, +1 more

- 01 Sep 2000 -

American Journal of Sociology

news-please : a Generic News Crawler and Extractor

Citations

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Automated identification of media bias in news articles : an interdisciplinary literature review

CoVerifi: A COVID-19 News Verification System

Giveme5W: Main Event Retrieval from News Articles by Extraction of the Five Journalistic W Questions

Trafilatura: {A} Web Scraping Library and Command-Line Tool for Text Discovery and Extraction

References

RCV1: A New Benchmark Collection for Text Categorization Research

Boilerplate detection using shallow text features

PNS: A Personalized News Aggregator on the Web

Scraping Scientific Web Repositories : Challenges and Solutions for Automated Content Extraction

Related Papers (5)

Attention is All you Need

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

NewsCube: delivering multiple aspects of news to mitigate media bias

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Political processes and local newspaper coverage of protest events : From selection bias to triadic interactions