The package provides three simple functions for reading RSS feeds from news outlets and have them conveniently returned as a tibble.
The newscatcheR package provides a dataset of news sites and their rss feeds, together with some characteristics of the websites such as the topic, country or language of the website, and few functions explore and access the feeds from R
.
Two functions that work as a wrapper around tidyRSS can be used to fetch the feed from a given website. Two additional functions can be used to conveniently browse the websites dataset.
The first function get_news()
returns a tibble of the rss feed of a given site.
# adding a small time delay to avoid simultaneous posts to the API
Sys.sleep(3)
get_news(website = "ycombinator.com", rss_table = package_rss)
#> GET request successful. Parsing...
#> Warning: Predicate functions must be wrapped in `where()`.
#>
#> # Bad
#> data %>% select(is.character)
#>
#> # Good
#> data %>% select(where(is.character))
#>
#> ℹ Please update your code.
#> This message is displayed once per session.
#> # A tibble: 30 x 10
#> feed_title feed_link feed_description feed_pub_date item_title
#> <chr> <chr> <chr> <dttm> <chr>
#> 1 Hacker Ne… https://… Links for the i… 2020-07-12 06:18:12 Just Too …
#> 2 Hacker Ne… https://… Links for the i… 2020-07-12 06:18:12 Do not re…
#> 3 Hacker Ne… https://… Links for the i… 2020-07-12 06:18:12 Tour of R…
#> 4 Hacker Ne… https://… Links for the i… 2020-07-12 06:18:12 Hosting y…
#> 5 Hacker Ne… https://… Links for the i… 2020-07-12 06:18:12 Building …
#> 6 Hacker Ne… https://… Links for the i… 2020-07-12 06:18:12 An F-22 t…
#> 7 Hacker Ne… https://… Links for the i… 2020-07-12 06:18:12 The Polym…
#> 8 Hacker Ne… https://… Links for the i… 2020-07-12 06:18:12 Show HN: …
#> 9 Hacker Ne… https://… Links for the i… 2020-07-12 06:18:12 MariaDB T…
#> 10 Hacker Ne… https://… Links for the i… 2020-07-12 06:18:12 Ruby lib/…
#> # … with 20 more rows, and 5 more variables: item_link <chr>,
#> # item_description <chr>, item_pub_date <dttm>, item_category <list>,
#> # item_comments <chr>
The second function get_headlines
is a helper function that returns a tibble of just the headlines, instead of the full rss feed.
# adding a small time delay to avoid simultaneous posts to the API
Sys.sleep(3)
get_headlines(website = "ycombinator.com", rss_table = package_rss)
#> GET request successful. Parsing...
#> feed_entries$item_title
#> 1 Just Too Efficient
#> 2 Do not remain nameless to yourself (1966)
#> 3 Tour of Rust
#> 4 Hosting your entire web application using S3 and CloudFront
#> 5 Building a self-updating profile README for GitHub
#> 6 An F-22 test pilot on the Raptor's flight control system
#> 7 The Polymath Playbook
#> 8 Show HN: Trail Router – generate running routes that prefer greenery and nature
#> 9 MariaDB Temporal Data Tables
#> 10 Ruby lib/irb/easter-egg.rb
#> 11 Venice test brings up floodgates for first time
#> 12 Scientists say you can cancel the noise but keep your window open
#> 13 The illusion of control, and how to give it up
#> 14 Reflections on Trusting Trust (1984) [pdf]
#> 15 Linux kernel in-tree Rust support
#> 16 PG: The biggest source of stress for me at YC was running HN
#> 17 Build a No-Slot MIDI Interface on the Apple ][ Game I/O Socket
#> 18 Epigrams on Programming (1982)
#> 19 How to Understand Things
#> 20 Make Your Own ColecoVision at Home
#> 21 Ask HN: What's the worst piece of software you use everyday?
#> 22 Show HN: HN Demetricator – An extension that removes upvote and comment counts
#> 23 Soup.io Will Be Discontinued
#> 24 Tracking Pico Balloons Using Ham Radio [pdf]
#> 25 Porting a Wolfenstein-type engine to the MEGA65
#> 26 Linus Torvalds: “I Hope AVX512 Dies a Painful Death”
#> 27 CRDTs: The Hard Parts [video]
#> 28 Migrating Away from Google Analytics
#> 29 How much your computer can do in a second (2015)
#> 30 Announcing The Zig Software Foundation
Because some website have multiple feeds divided by topics, describe_url(website)
can be helpful to see the topics of a given website.
describe_url("bbc.com")
#> Topics available for website bbc.com are: business, news, science, travel.
Finally, filter_urls(topic, country, language )
can be used to browse the dataset by topic, country, or language.
filter_urls(topic = "tech", country = "IT", language = "it")
#> # A tibble: 5 x 7
#> clean_url language topic_unified main clean_country rss_url GlobalRank
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 repubblic… it tech None IT http://www.r… 1086
#> 2 lastampa.… it tech None IT http://www.l… 2413
#> 3 ilsole24o… it tech None IT http://nova.… 2681
#> 4 corriere.… it tech None IT http://www.c… 1328
#> 5 ansa.it it tech None IT http://www.a… 2248
This package can be convenient if you need to fetch news from various websites for further analysis and you don’t want to search manually for the URL of their RSS feeds.
Assuming we have the news sites we want to follow:
c("bbc.com", "spiegel.de", "washingtonpost.com") sites =
We can get a list of data frames with:
lapply(sites, get_news)