URL management¶

This page shows how to filter a list of URLs, with Python and on the command-line, using the functions provided by the courlan package which is included with Trafilatura.

Filtering of input URLs is useful to avoid hodgepodges like .../tags/abc or “internationalized” rubrics like .../en/..... It is best used on URL lists, before retrieving all pages and especially before massive downloads.

Hint

See the Courlan documentation for more examples.

Filtering a list of URLs¶

With Python¶

The function check_url() returns a URL and a domain name if everything is fine:

>>> from courlan import check_url

>>> check_url('https://github.com/adbar/courlan')
('https://github.com/adbar/courlan', 'github.com')

# noisy query parameters can be removed
>>> check_url('https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.org', strict=True)
('https://httpbin.org/redirect-to', 'httpbin.org')

# optional argument targeting webpages in English or German
>>> my_url = 'https://www.un.org/en/about-us'
>>> url, domain_name = check_url(my_url, language='en')
>>> url, domain_name = check_url(my_url, language='de')

Other useful functions include URL cleaning and validation:

# helper function to clean URLs
>>> from courlan import clean_url

>>> clean_url('HTTPS://WWW.DWDS.DE:80/')
'https://www.dwds.de'

# URL validation
>>> from courlan import validate_url
>>> validate_url('http://1234')
(False, None)
>>> validate_url('http://www.example.org/')
(True, ParseResult(scheme='http', netloc='www.example.org', path='/', params='', query='', fragment=''))

On the command-line¶

Most fonctions are also available through a command-line utility:

# display a message listing all options
$ courlan --help
# simple filtering and normalization
$ courlan --inputfile url-list.txt --outputfile cleaned-urls.txt
# strict filtering
$ courlan --language de --strict --inputfile mylist.txt --outputfile mylist-filtered.txt
# strict filtering including language filter
$ courlan --language de --strict --inputfile mylist.txt --outputfile mylist-filtered.txt

Sampling by domain name¶

This sampling methods allows for restricting the number of URLs to keep per host, for example:

Before: website1.com: 1000 URLs; website2.net: 50 URLs
After: website1.com: 50 URLs; website2.net: 50 URLs

With Python¶

>>> from courlan import sample_urls
>>> my_urls = ['…', '…', '…', ]  # etc.
>>> my_sample = sample_urls(my_urls, 50)
# optional: exclude_min=None, exclude_max=None, strict=False, verbose=False

On the command-line¶

$ courlan --inputfile urls.txt --outputfile samples-urls.txt --sample --samplesize 50

Blacklisting¶

You can provide a blacklist of URLs which will not be processed and included in the output.

in Python: url_blacklist parameter (expects a set)
on the CLI: --blacklist arguments (expects a file containing URLs)

In Python, you can also pass a blacklist of author names as argument, see documentation.

Web crawling

Tutorials