Here is the main text. It has to be long enough in order to bypass the safety checks. Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
')
>>> extract(mytree)
'Here is the main text. It has to be long enough in order to bypass the safety checks. Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.\n'
Navigation
----------
Feeds
^^^^^
The function ``find_feed_urls`` is a all-in-one utility that attemps to discover the feeds from a webpage if required and/or downloads and parses feeds. It returns the extracted links as list, more precisely as a sorted list of unique links.
.. code-block:: python
>>> from trafilatura import feeds
>>> mylist = feeds.find_feed_urls('https://www.theguardian.com/')
# https://www.theguardian.com/international/rss has been found
>>> mylist
['https://www.theguardian.com/...', '...'] # and so on
# use a feed URL directly
>>> mylist = feeds.find_feed_urls('https://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml')
>>> mylist is not []
True # it's not empty
.. note::
The links are seamlessly filtered for patterns given by the user, e.g. using ``https://www.un.org/en/`` as argument implies taking all URLs corresponding to this category.
An optional argument ``target_lang`` makes it possible to filter links according to their expected target language. A series of heuristics are applied on the link path and parameters to try to discard unwanted URLs, thus saving processing time and download bandwidth.
.. code-block:: python
>>> from trafilatura import feeds
>>> mylist = feeds.find_feed_urls('https://www.un.org/en/rss.xml', target_lang='en')
>>> mylist is not []
True # links found as expected
>>> mylist = feeds.find_feed_urls('https://www.un.org/en/rss.xml', target_lang='ja')
>>> mylist
[] # target_lang set to Japanese, the English links were discarded this time
For more information about feeds and web crawling see:
- This blog post: `Using RSS and Atom feeds to collect web pages with Python