crminer 0.4.0

crm_pdf() and crm_text() lose the cache parameter, which toggled whether or not to use caching. those functions always cache requests now (#37)
crm_extract() gains parameter try_ocr (logical, default: FALSE) to optionally try Optical Character Recognition (OCR) with extract pdf pages if the pdf is scanned images. extraction can take a while, but the result is cached, so will be very fast on subsequent requests for the same article (#37)

crm_plain(), crm_xml(), crm_html(), and crm_text() now cache articles as crm_pdf() has for a while. Along with this change caching is now split into separate folders for pdf, txt (for plain), xml, and html (#17)
internally force Pensoft publisher urls to https from http (#48)
added docs section User-agent to crm_html(), crm_pdf(), crm_plain(), crm_xml(), and crm_text() detailing how users can set a user agent string with the useragent curl option (#41) (#42)
fix a link in the README (#47) thanks @salim-b

for wiley articles, replace part of url pdf with pdfdirect for better access (#40)
initially for wiley specific errors, extracted out internal function try_extract_pdf_errors() to attempt to extract various errors that occur when trying to download and extract text from pdfs (#40)
eLife specific url fix in crm_links(), older url was leading to article landing pages (#6)
fix for cases in which Elsevier returns just the first page of a pdf instead of the whole article. we show the user a warning when this occurs and delete the 1 page pdf file (#43)
fix for weird article urls that end in not a file extenstion of pdf, but just the string ‘pdf’ following some other part of the url (#44)
added special handling for malformed pdfs in crm_pdf()/crm_text() (with type="pdf") - arose from a Cambridge publisher article, hopefully will handle all malformed pdfs (#45)
change crm_links() to always include a pdf link even if no returned by Crossref - as almost always probably there is a pdf for every article, but the link just may not have been included in metadata sent to Crossref (#37)
various fixes for Elsevier: A) fix for url parsing, was removing text after ? (as they were all likely query params that we didn’t need), but Elsevier gives the content type as a query param. B) some dois that are listed as having a non-Elsevier owner are actually owned by Elsevier now; special handling for those dois. C) (#37)

crminer 0.3.2

fix for crm_links(): fix full text links from Elsevier that have httpss instead of https (#30) thanks @njahn82
fix for crm_links(): the fuction wasn’t using email header for Crossref polite pool - now it does if you provide your email address, see docs (#31)

crm_cache$cache_path_set() gains ability to set the full cache path directly via its full_path parameter via an update to package hoardr (#27)

add raw as another parameter in crm_extract() to allow raw byte extraction from a pdf (#24)
add intended application (from crossref) to output of crm_links() to allow filtering on the intended application (#28)

Fixed failing tests due to Crossref changing what they give back for links - made tests robust to those changes (#21)

New object crm_cache for managing cached files, see ?crm_cache after installation (#19)

Now using hoardr for managing cached files (#19)
crm_pdf() and crm_text() lose the parameter path - instead cache directory managed through crm_cache