crminer 0.4.0
NEW FEATURES
crm_pdf() and crm_text() lose the cache parameter, which toggled whether or not to use caching. those functions always cache requests now (#37)
crm_extract() gains parameter try_ocr (logical, default: FALSE) to optionally try Optical Character Recognition (OCR) with extract pdf pages if the pdf is scanned images. extraction can take a while, but the result is cached, so will be very fast on subsequent requests for the same article (#37)
MINOR IMPROVEMENTS
crm_plain(), crm_xml(), crm_html(), and crm_text() now cache articles as crm_pdf() has for a while. Along with this change caching is now split into separate folders for pdf, txt (for plain), xml, and html (#17)
- internally force Pensoft publisher urls to https from http (#48)
- added docs section
User-agent to crm_html(), crm_pdf(), crm_plain(), crm_xml(), and crm_text() detailing how users can set a user agent string with the useragent curl option (#41) (#42)
- fix a link in the README (#47) thanks @salim-b
BUG FIXES
- for wiley articles, replace part of url
pdf with pdfdirect for better access (#40)
- initially for wiley specific errors, extracted out internal function
try_extract_pdf_errors() to attempt to extract various errors that occur when trying to download and extract text from pdfs (#40)
- eLife specific url fix in
crm_links(), older url was leading to article landing pages (#6)
- fix for cases in which Elsevier returns just the first page of a pdf instead of the whole article. we show the user a warning when this occurs and delete the 1 page pdf file (#43)
- fix for weird article urls that end in not a file extenstion of pdf, but just the string ‘pdf’ following some other part of the url (#44)
- added special handling for malformed pdfs in
crm_pdf()/crm_text() (with type="pdf") - arose from a Cambridge publisher article, hopefully will handle all malformed pdfs (#45)
- change
crm_links() to always include a pdf link even if no returned by Crossref - as almost always probably there is a pdf for every article, but the link just may not have been included in metadata sent to Crossref (#37)
- various fixes for Elsevier: A) fix for url parsing, was removing text after
? (as they were all likely query params that we didn’t need), but Elsevier gives the content type as a query param. B) some dois that are listed as having a non-Elsevier owner are actually owned by Elsevier now; special handling for those dois. C) (#37)
crminer 0.3.2
MINOR IMPROVEMENTS
- now using
vcr for tests that write to disk (#34)
BUG FIXES
- fix for a case where a DOI’s current owner differs from a previous owner (#36)
crminer 0.3.0
MINOR IMPROVEMENTS
- replace all
xml2::xml_find_one with xml2::xml_find_first (#32)
BUG FIXES
- fix for
crm_links(): fix full text links from Elsevier that have httpss instead of https (#30) thanks @njahn82
- fix for
crm_links(): the fuction wasn’t using email header for Crossref polite pool - now it does if you provide your email address, see docs (#31)
crminer 0.2.0
NEW FEATURES
crm_cache$cache_path_set() gains ability to set the full cache path directly via its full_path parameter via an update to package hoardr (#27)
MINOR IMPROVEMENTS
- add
raw as another parameter in crm_extract() to allow raw byte extraction from a pdf (#24)
- add intended application (from crossref) to output of
crm_links() to allow filtering on the intended application (#28)
crminer 0.1.4
BUG FIXES
- Fixed failing tests due to Crossref changing what they give back for links - made tests robust to those changes (#21)
crminer 0.1.2
NEW FEATURES
- New object
crm_cache for managing cached files, see ?crm_cache after installation (#19)
MINOR IMPROVEMENTS
- Now using
hoardr for managing cached files (#19)
crm_pdf() and crm_text() lose the parameter path - instead cache directory managed through crm_cache
crminer 0.1.0
NEW FEATURES