With R¶
Introduction¶
R is a free software environment for statistical computing and graphics. The reticulate package provides a comprehensive set of tools for seamless interoperability between Python and R. It basically allows for execution of Python code inside an R session, so that Python packages can be used with minimal adaptations, which is ideal for those who would rather operate from R than having to go back and forth between languages and environments.
The package provides several ways to integrate Python code into R projects:
Python in R Markdown
Importing Python modules
Sourcing Python scripts
An interactive Python console within R.
Complete vignette: Calling Python from R.
This tutorial shows how to import a Python scraper straight from R and use the results directly with the usual R syntax: Web scraping with R: Text and metadata extraction.
Installation¶
The reticulate package can be easily installed from CRAN as follows:
> install.packages("reticulate")
A recent version of Python 3 is necessary. Some systems already have such an environment installed, to check it just run the following command in a terminal window:
$ python3 --version
Python 3.8.6 # version 3.6 or higher is fine
In case Python is not installed, please refer to the excellent [Djangogirls tutorial: Python installation](https://tutorial.djangogirls.org/en/python_installation/).
Trafilatura
has to be installed with pip, conda, or py_install. Skip the installation of Miniconda if it doesn’t seem necessary, you should only be prompted once; or see Installing Python Packages.
Here is a simple example using the py_install() function included in reticulate
:
> library(reticulate)
> py_install("trafilatura")
Download and extraction¶
Text extraction from HTML documents (including downloads) is available in a straightforward way:
# getting started
> install.packages("reticulate")
> library(reticulate)
> trafilatura <- import("trafilatura")
# get a HTML document as string
> url <- "https://example.org/"
> downloaded <- trafilatura$fetch_url(url)
# extraction
> trafilatura$extract(downloaded)
[1] "This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission.\nMore information..."
# extraction with arguments
> trafilatura$extract(downloaded, output_format="xml", url=url)
[1] "<doc sitename=\"example.org\" title=\"Example Domain\" source=\"https://example.org/\" hostname=\"example.org\" categories=\"\" tags=\"\" fingerprint=\"lxZaiIwoxp80+AXA2PtCBnJJDok=\">\n <main>\n <div>\n <head>Example Domain</head>\n <p>This domain is for use in illustrative examples in documents. You may use this\ndomain in literature without prior coordination or asking for permission.</p>\n <p>More information...</p>\n </div>\n </main>\n <comments/>\n</doc>"
For a full list of arguments see extraction documentation.
Already stored documents can also be read directly from R, for example with CSV/TSV output and read_delim()
, see information on data import in R.
Other functions¶
Specific parts of the package can also be imported on demand, which provides access to functions not directly exported by the package. For a list of relevant functions and arguments see core functions.
# using the code for link discovery in sitemaps
> sitemapsfunc <- py_run_string("from trafilatura.sitemaps import sitemap_search")
> sitemapsfunc$sitemap_search("https://www.sitemaps.org/")
[1] "https://www.sitemaps.org"
[2] "https://www.sitemaps.org/protocol.html"
[3] "https://www.sitemaps.org/faq.html"
[4] "https://www.sitemaps.org/terms.html"
...
# import the metadata part of the package as a function
> metadatafunc <- py_run_string("from trafilatura.metadata import extract_metadata")
> downloaded <- trafilatura$fetch_url("https://github.com/rstudio/reticulate")
> metadatafunc$extract_metadata(downloaded)
$title
[1] "rstudio/reticulate"
$author
[1] "Rstudio"
$url
[1] "https://github.com/rstudio/reticulate"
$hostname
[1] "github.com"
...
Going further¶
- Quanteda is an R package for managing and analyzing text: