Splash has a ton of features and splashr
exposes many of them. The render_
functions and DSL can return everything from simple, tiny JSON data to huge, nested list
structures of complex objects.
Furthermore, web content mining can be tricky. Modern sites can present information in different ways depending on the type of browser or device you use and many won’t serve pages to “generic” browsers.
Finally, the Dockerized containers of Splash servers make it really easy to get started, but you may prefer an R console over the system command-line.
Let’s see what extra goodies splashr
provides to make our lives easier.
splashr
ObjectsOne of the most powerful functions in splashr
is render_har()
. You get every component loaded by dynamic web page, and some sites have upwards of 100 elements for any given page. How can you get to the bits that you want? We’ll use render_har()
to demonstrate how to find resources a site loads and use the data we gather to assess how “safe” these sites are — i.e. how many third-party javascript components they load and how safely they are loaded. Note that code in this vignette assumes a Splash instance is running locally on your system.
We’ll check https://apple.com/ first since Apple claims to care about our privacy. If that’s true, then they’ll will load few or no third-party content.
(apple <- render_har(url = "https://apple.com/", response_body = TRUE))
## --------HAR VERSION--------
## HAR specification version: 1.2
## --------HAR CREATOR--------
## Created by: Splash
## version: 3.3.1
## --------HAR BROWSER--------
## Browser: QWebKit
## version: 602.1
## --------HAR PAGES--------
## Page id: 1 , Page title: Apple
## --------HAR ENTRIES--------
## Number of entries: 84
## REQUESTS:
## Page: 1
## Number of entries: 84
## - https://apple.com/
## - https://www.apple.com/
## - https://www.apple.com/ac/globalnav/4/en_US/styles/ac-globalnav.built.css
## - https://www.apple.com/ac/localnav/4/styles/ac-localnav.built.css
## - https://www.apple.com/ac/globalfooter/4/en_US/styles/ac-globalfooter.built.css
## ........
## - https://www.apple.com/v/home/ea/images/heroes/iphone-xs/iphone_xs_0afef_mediumtall.jpg
## - https://www.apple.com/v/home/ea/images/heroes/iphone-xr/iphone_xr_5e40f_mediumtall.jpg
## - https://www.apple.com/v/home/ea/images/heroes/iphone-xs/iphone_xs_0afef_mediumtall.jpg
## - https://www.apple.com/v/home/ea/images/heroes/macbook-air/macbook_air_mediumtall.jpg
## - https://www.apple.com/v/home/ea/images/heroes/macbook-air/macbook_air_mediumtall.jpg
The HAR output shows that when you visit apple.com
your browser makes at least 84 requests for resources. We can see what types of content is loaded:
har_entries(apple) %>%
purrr::map_chr(get_content_type) %>%
table(dnn = "content_type") %>%
broom::tidy() %>%
dplyr::arrange(desc(n))
## # A tibble: 9 x 2
## content_type n
## <chr> <int>
## 1 font/woff2 27
## 2 application/x-javascript 15
## 3 image/svg+xml 10
## 4 text/css 9
## 5 image/jpeg 7
## 6 image/png 6
## 7 application/font-woff 4
## 8 text/html 3
## 9 application/json 2
Lots of calls to fonts, 15 javascript files and even 2 JSON files. Let’s see what the domains are for these resources:
har_entries(apple) %>%
purrr::map_chr(get_response_url) %>%
purrr::map_chr(urltools::domain) %>%
unique()
## [1] "apple.com" "www.apple.com" "securemetrics.apple.com"
Wow! Only calls to Apple-controlled resources.
I wonder what’s in those JSON files, though:
har_entries(apple) %>%
purrr::keep(is_json) %>%
purrr::map(get_response_body, "text") %>%
purrr::map(jsonlite::fromJSON) %>%
str(3)
## List of 2
## $ :List of 2
## ..$ locale :List of 3
## .. ..$ country : chr "us"
## .. ..$ attr : chr "en-US"
## .. ..$ textDirection: chr "ltr"
## ..$ localeswitcher:List of 7
## .. ..$ name : chr "localeswitcher"
## .. ..$ metadata : Named list()
## .. ..$ displayIndex: int 1
## .. ..$ copy :List of 5
## .. ..$ continue :List of 5
## .. ..$ exit :List of 5
## .. ..$ select :List of 5
## $ :List of 2
## ..$ id : chr "ad6ca319-1ef1-20da-c4e0-5185088996cb"
## ..$ results:'data.frame': 2 obs. of 2 variables:
## .. ..$ sectionName : chr [1:2] "quickLinks" "suggestions"
## .. ..$ sectionResults:List of 2
So, locale metadata and something to do with on-page links/suggestions.
As demonstrated, the har_entries()
function makes it easy to get to the individual elements and we used the is_json()
helper with purrr
functions to slice and dice the structure at will. Here are all the is_
functions you can use with HAR objects:
is_binary()
is_content_type()
is_css()
is_gif()
is_html()
is_javascript()
is_jpeg()
is_json()
is_plain()
is_png()
is_svg()
is_xhr()
is_xml()
You can also use various get_
helpers to avoid gnarly $
or [[]]
constructs:
get_body_size()
— Retrieve size of content | body | headersget_content_size()
— Retrieve size of content | body | headersget_content_type()
— Retrieve or test content type of a HAR request objectget_headers
— Retrieve response headers as a data frameget_headers_size()
— Retrieve size of content | body | headersget_request_type()
— Retrieve or test request typeget_request_url()
— Retrieve request URLget_response_url()
— Retrieve response URLget_response_body()
— Retrieve the body content of a HAR entryWe’ve seen one example of them already, here’s another:
har_entries(apple) %>%
purrr::map_dbl(get_body_size)
## [1] 0 54521 95644 98069 43183 8689 19035 794210 66487 133730 311054 13850 199928 161859 90322 343189 19035
## [18] 794210 66487 133730 554 802 1002 1160 1694 264 1082 1661 390 416 108468 108828 100064 109728
## [35] 109412 99196 108856 109360 108048 8868 10648 10380 10476 137 311054 13850 3192 3253 4130 2027 1247
## [52] 1748 582 199928 109628 107832 109068 100632 108928 97812 108312 108716 107028 65220 73628 72188 72600 70400
## [69] 73928 72164 73012 71080 1185 161859 90322 343189 0 491 60166 58509 60166 58509 53281 53281
So, a visit to Apple’s page transfers nearly 8MB of content down to your browser.
California also claims to care about your privacy, but is it really true?
ca <- render_har(url = "https://www.ca.gov/", response_body = TRUE)
har_entries(ca) %>%
purrr::map_chr(~.x$response$url %>% urltools::domain()) %>%
unique()
## [1] "www.ca.gov" "fonts.googleapis.com" "california.azureedge.net"
## [4] "portal-california.azureedge.net" "az416426.vo.msecnd.net" "fonts.gstatic.com"
## [7] "ssl.google-analytics.com" "cse.google.com" "translate.google.com"
## [10] "api.stateentityprofile.ca.gov" "translate.googleapis.com" "www.google.com"
## [13] "clients1.google.com" "www.gstatic.com" "platform.twitter.com"
## [16] "dc.services.visualstudio.com"
Yikes! It sure doesn’t look that way given all the folks they let track you when you visit their main page. Are they executing javascript from those sites?
## # A tibble: 8 x 2
## dom type
## <chr> <chr>
## 1 california.azureedge.net application/javascript
## 2 california.azureedge.net application/x-javascript
## 3 az416426.vo.msecnd.net application/x-javascript
## 4 cse.google.com text/javascript
## 5 translate.google.com text/javascript
## 6 translate.googleapis.com text/javascript
## 7 www.google.com text/javascript
## 8 platform.twitter.com application/javascript
We can also examine the response headers to check for signs of safety as well (i.e. are there content security policy headers or other types of security-oriented headers):
har_entries(ca) %>%
purrr::map_df(get_headers) %>%
dplyr::count(name, sort=TRUE) %>%
print(n=50)
## # A tibble: 42 x 2
## name n
## <chr> <int>
## 1 date 149
## 2 server 148
## 3 content-type 142
## 4 last-modified 126
## 5 etag 104
## 6 content-encoding 83
## 7 access-control-allow-origin 78
## 8 accept-ranges 74
## 9 vary 69
## 10 content-length 66
## 11 x-ms-ref 57
## 12 x-ms-ref-originshield 57
## 13 access-control-expose-headers 56
## 14 content-md5 51
## 15 x-ms-blob-type 51
## 16 x-ms-lease-status 51
## 17 x-ms-request-id 51
## 18 x-ms-version 51
## 19 cache-control 37
## 20 expires 34
## 21 alt-svc 30
## 22 x-xss-protection 29
## 23 x-content-type-options 27
## 24 age 22
## 25 transfer-encoding 20
## 26 timing-allow-origin 14
## 27 x-powered-by 14
## 28 access-control-allow-headers 7
## 29 pragma 6
## 30 request-context 5
## 31 x-aspnet-version 5
## 32 x-frame-options 4
## 33 content-disposition 3
## 34 access-control-max-age 2
## 35 content-language 2
## 36 p3p 2
## 37 x-cache 2
## 38 access-control-allow-methods 1
## 39 location 1
## 40 set-cookie 1
## 41 strict-transport-security 1
## 42 x-ms-session-id 1
Unfortunately, they do let Google and Twitter execute javascript.
They seem to use quite a bit of Microsoft tech. Let’s look at the HTTP servers they directly and indirectly rely on:
har_entries(ca) %>%
purrr::map_chr(get_header_val, "server") %>%
table(dnn = "server") %>%
broom::tidy() %>%
dplyr::arrange(desc(n))
## # A tibble: 14 x 2
## server n
## <chr> <int>
## 1 Apache 55
## 2 Windows-Azure-Blob/1.0 Microsoft-HTTPAPI/2.0 50
## 3 sffe 23
## 4 Microsoft-IIS/10.0 7
## 5 ESF 3
## 6 HTTP server (unknown) 2
## 7 ECAcc (bsa/EAD2) 1
## 8 ECD (sjc/16E0) 1
## 9 ECD (sjc/16EA) 1
## 10 ECD (sjc/16F4) 1
## 11 ECD (sjc/4E95) 1
## 12 ECD (sjc/4E9F) 1
## 13 ECS (bsa/EB1F) 1
## 14 gws 1
The various render_
functions present themselves as modern WebKit Linux browser (which it is!). If you want more control, you need to go to the DSL to don a mask of your choosing. You may want to be precise and Bring Your Own User-agent string, but we’ve defined and exposed a few handy ones for you:
ua_splashr
ua_win10_chrome
ua_win10_firefox
ua_win10_ie11
ua_win7_chrome
ua_win7_firefox
ua_win7_ie11
ua_macos_chrome
ua_macos_safari
ua_linux_chrome
ua_linux_firefox
ua_ios_safari
ua_android_samsung
ua_kindle
ua_ps4
ua_apple_tv
ua_chromecast
NOTE: These can be used with curl
, httr
, rvest
and RCurl
calls as well.
We can wee it in action:
URL <- "https://httpbin.org/user-agent"
splash_local %>%
splash_response_body(TRUE) %>%
splash_user_agent(ua_macos_chrome) %>%
splash_go(URL) %>%
splash_html() %>%
xml2::xml_text("body") %>%
jsonlite::fromJSON()
## $`user-agent`
## [1] "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 Safari/537.36"
One more NOTE: It’s good form to say who you really are when scraping. There are times when you have no choice but to wear a mask, but try to use your own user-agent that identifies who you are and what you’re doing.
splashr
Docker InterfaceHelping you get Docker and the R docker
package up and running is beyond the scope of this pacakge. If you do manage to work that out (in my experience, it’s most gnarly on Windows), then we’ve got some helper functions to enable you to manage Splash Docker containers from within R.
install_splash()
— Retrieve the Docker image for Splashstart_splash()
— Start a Splash server Docker containerstop_splash()
— Stop a running a Splash server Docker containerkillall_splash()
— Prune all dead and running Splash Docker containersThe install_splash()
will pull the image locally for you. It takes a bit (the image size is around half a gigabyte at the time of this writing) and you can specify the tag
you want if there’s a newer image produced before the package gets updated.
The best way to use start/stop is to:
Now, if you’re like me and totally forget you started Splash Docker containers, you can use the killall_splash()
function which will try to find them and stop/kill and remvoe them from your system. It doesn’t remove the image, just running or stale containers.