Generic Extraction of main text content from HTML files; removal of ads, sidebars and headers using the boilerpipe (http://code.google.com/p/boilerpipe/) Java library. The extraction heuristics from boilerpipe show a robust performance for a wide range of web site templates.
| Version: | 1.3 |
| Imports: | rJava |
| Suggests: | RCurl |
| Published: | 2015-05-11 |
| Author: | See AUTHORS file. boilerpipeR author details |
| Maintainer: | Mario Annau <mario.annau at gmail.com> |
| BugReports: | https://github.com/mannau/boilerpipeR/issues |
| License: | Apache License (== 2.0) |
| URL: | https://github.com/mannau/boilerpipeR |
| NeedsCompilation: | no |
| Materials: | NEWS |
| In views: | NaturalLanguageProcessing, WebTechnologies |
| CRAN checks: | boilerpipeR results |
| Reference manual: | boilerpipeR.pdf |
| Vignettes: |
Introduction to the tm.plugin.webmining Package |
| Package source: | boilerpipeR_1.3.tar.gz |
| Windows binaries: | r-devel: boilerpipeR_1.3.zip, r-release: boilerpipeR_1.3.zip, r-oldrel: boilerpipeR_1.3.zip |
| macOS binaries: | r-release: boilerpipeR_1.3.tgz, r-oldrel: boilerpipeR_1.3.tgz |
| Old sources: | boilerpipeR archive |
| Reverse imports: | tm.plugin.webmining |
Please use the canonical form https://CRAN.R-project.org/package=boilerpipeR to link to this page.