PODCASTPARSER(1) | podcastparser | PODCASTPARSER(1) |
podcastparser - podcastparser Documentation
podcastparser is a simple and fast podcast feed parser library in Python. The two primary users of the library are the gPodder Podcast Client and the gpodder.net web service.
The following feed types are supported:
The following specifications are supported:
These formats only specify the possible markup elements and attributes. We recommend that you also read the Podcast Feed Best Practice guide if you want to optimize your feeds for best display in podcast clients.
Where times and durations are used, the values are expected to be formatted either as seconds or as RFC 2326 Normal Play Time (NPT).
import podcastparser import urllib feedurl = 'http://example.com/feed.xml' parsed = podcastparser.parse(feedurl, urllib.urlopen(feedurl)) # parsed is a dict import pprint pprint.pprint(parsed)
For both RSS and Atom feeds, only a subset of elements (those that are relevant to podcast client applications) is parsed. This section describes which elements and attributes are parsed and how the contents are interpreted/used.
For Atom feeds, podcastparser will handle the following elements and attributes:
Simplified, fast RSS parser
This exception allows users of this library to catch exceptions without having to import the XML parsing library themselves.
The Parser will call this method to report each chunk of character data. SAX parsers may return all contiguous character data in a single chunk, or they may split it into several chunks; however, all of the characters in any single event must come from the same external entity so that the Locator provides useful information.
The name parameter contains the name of the element type, just as with the startElement event.
The name parameter contains the raw XML 1.0 name of the element type as a string and the attrs parameter holds an instance of the Attributes class containing the attributes of the element.
>>> file_basename_no_extension('/home/me/file.txt') 'file'
>>> file_basename_no_extension('file') 'file'
By looking for an open tag (more or less:) >>> is_html(‘<h1>HELLO</h1>’) True >>> is_html(‘a < b < c’) False
This will also normalize feed:// and itpc:// to http://.
>>> normalize_feed_url('itpc://example.org/podcast.rss') 'http://example.org/podcast.rss'
If no URL scheme is defined (e.g. “curry.com”), we will simply assume the user intends to add a http:// feed.
>>> normalize_feed_url('curry.com') 'http://curry.com/'
It will also take care of converting the domain name to all-lowercase (because domains are not case sensitive):
>>> normalize_feed_url('http://Example.COM/') 'http://example.com/'
Some other minimalistic changes are also taken care of, e.g. a ? with an empty query is removed:
>>> normalize_feed_url('http://example.org/test?') 'http://example.org/test'
Leading and trailing whitespace is removed
>>> normalize_feed_url(' http://example.com/podcast.rss ') 'http://example.com/podcast.rss'
Incomplete (too short) URLs are not accepted
>>> normalize_feed_url('http://') is None True
Unknown protocols are not accepted
>>> normalize_feed_url('gopher://gopher.hprc.utoronto.ca/file.txt') is None True
>>> parse_length(None) -1
>>> parse_length('0') -1
>>> parse_length('unknown') -1
>>> parse_length('100') 100
>>> parse_pubdate('Fri, 21 Nov 1997 09:55:06 -0600') 880127706
>>> parse_pubdate('2003-12-13T00:00:00+02:00') 1071266400
>>> parse_pubdate('2003-12-13T18:30:02Z') 1071340202
>>> parse_pubdate('Mon, 02 May 1960 09:05:01 +0100') -305049299
>>> parse_pubdate('') 0
>>> parse_pubdate('unknown') 0
See RFC2326, 3.6 “Normal Play Time” (HH:MM:SS.FRACT)
>>> parse_time('0') 0 >>> parse_time('128') 128 >>> parse_time('00:00') 0 >>> parse_time('00:00:00') 0 >>> parse_time('00:20') 20 >>> parse_time('00:00:20') 20 >>> parse_time('01:00:00') 3600 >>> parse_time(' 03:02:01') 10921 >>> parse_time('61:08') 3668 >>> parse_time('25:03:30 ') 90210 >>> parse_time('25:3:30') 90210 >>> parse_time('61.08') 61 >>> parse_time('01:02:03.500') 3723 >>> parse_time(' ') 0
>>> parse_type('text/plain') 'text/plain'
>>> parse_type('text') 'application/octet-stream'
>>> parse_type('') 'application/octet-stream'
>>> parse_type(None) 'application/octet-stream'
>>> squash_whitespace(' some text with a lot of spaces ') 'some text with a lot of spaces'
This is a list of podcast-related XML namespaces that are not yet supported by podcastparser, but might be in the future.
gPodder Team
2020, gPodder Team
July 6, 2020 | 0.6.5 |