.. _topics-selectors: ========= Selectors ========= When you're scraping web pages, the most common task you need to perform is to extract data from the HTML source. There are several libraries available to achieve this, such as: * `BeautifulSoup`_ is a very popular web scraping library among Python programmers which constructs a Python object based on the structure of the HTML code and also deals with bad markup reasonably well, but it has one drawback: it's slow. * `lxml`_ is an XML parsing library (which also parses HTML) with a pythonic API based on :mod:`~xml.etree.ElementTree`. (lxml is not part of the Python standard library.) Scrapy comes with its own mechanism for extracting data. They're called selectors because they "select" certain parts of the HTML document specified either by `XPath`_ or `CSS`_ expressions. `XPath`_ is a language for selecting nodes in XML documents, which can also be used with HTML. `CSS`_ is a language for applying styles to HTML documents. It defines selectors to associate those styles with specific HTML elements. .. note:: Scrapy Selectors is a thin wrapper around `parsel`_ library; the purpose of this wrapper is to provide better integration with Scrapy Response objects. `parsel`_ is a stand-alone web scraping library which can be used without Scrapy. It uses `lxml`_ library under the hood, and implements an easy API on top of lxml API. It means Scrapy selectors are very similar in speed and parsing accuracy to lxml. .. _BeautifulSoup: https://www.crummy.com/software/BeautifulSoup/ .. _lxml: https://lxml.de/ .. _XPath: https://www.w3.org/TR/xpath/all/ .. _CSS: https://www.w3.org/TR/selectors .. _parsel: https://parsel.readthedocs.io/en/latest/ Using selectors =============== Constructing selectors ---------------------- .. highlight:: python Response objects expose a :class:`~scrapy.selector.Selector` instance on ``.selector`` attribute: >>> response.selector.xpath('//span/text()').get() 'good' Querying responses using XPath and CSS is so common that responses include two more shortcuts: ``response.xpath()`` and ``response.css()``: >>> response.xpath('//span/text()').get() 'good' >>> response.css('span::text').get() 'good' Scrapy selectors are instances of :class:`~scrapy.selector.Selector` class constructed by passing either :class:`~scrapy.http.TextResponse` object or markup as a string (in ``text`` argument). Usually there is no need to construct Scrapy selectors manually: ``response`` object is available in Spider callbacks, so in most cases it is more convenient to use ``response.css()`` and ``response.xpath()`` shortcuts. By using ``response.selector`` or one of these shortcuts you can also ensure the response body is parsed only once. But if required, it is possible to use ``Selector`` directly. Constructing from text: >>> from scrapy.selector import Selector >>> body = '
good' >>> Selector(text=body).xpath('//span/text()').get() 'good' Constructing from response - :class:`~scrapy.http.HtmlResponse` is one of :class:`~scrapy.http.TextResponse` subclasses: >>> from scrapy.selector import Selector >>> from scrapy.http import HtmlResponse >>> response = HtmlResponse(url='http://example.com', body=body) >>> Selector(response=response).xpath('//span/text()').get() 'good' ``Selector`` automatically chooses the best parsing rules (XML vs HTML) based on input type. Using selectors --------------- To explain how to use the selectors we'll use the ``Scrapy shell`` (which provides interactive testing) and an example page located in the Scrapy documentation server: https://docs.scrapy.org/en/latest/_static/selectors-sample1.html .. _topics-selectors-htmlcode: For the sake of completeness, here's its full HTML code: .. literalinclude:: ../_static/selectors-sample1.html :language: html .. highlight:: sh First, let's open the shell:: scrapy shell https://docs.scrapy.org/en/latest/_static/selectors-sample1.html Then, after the shell loads, you'll have the response available as ``response`` shell variable, and its attached selector in ``response.selector`` attribute. Since we're dealing with HTML, the selector will automatically use an HTML parser. .. highlight:: python So, by looking at the :ref:`HTML code`` elements inside ``
`` elements from the document, not only those inside ``
from the whole document ... print(p.get()) This is the proper way to do it (note the dot prefixing the ``.//p`` XPath): >>> for p in divs.xpath('.//p'): # extracts all
inside ... print(p.get()) Another common case would be to extract all direct ``
`` children: >>> for p in divs.xpath('p'): ... print(p.get()) For more details about relative XPaths see the `Location Paths`_ section in the XPath specification. .. _Location Paths: https://www.w3.org/TR/xpath/all/#location-paths When querying by class, consider using CSS ------------------------------------------ Because an element can contain multiple CSS classes, the XPath way to select elements by class is the rather verbose:: *[contains(concat(' ', normalize-space(@class), ' '), ' someclass ')] If you use ``@class='someclass'`` you may end up missing elements that have other classes, and if you just use ``contains(@class, 'someclass')`` to make up for that you may end up with more elements that you want, if they have a different class name that shares the string ``someclass``. As it turns out, Scrapy selectors allow you to chain selectors, so most of the time you can just select by class using CSS and then switch to XPath when needed: >>> from scrapy import Selector >>> sel = Selector(text='
') >>> sel.css('.shout').xpath('./time/@datetime').getall() ['2014-07-23 19:00'] This is cleaner than using the verbose XPath trick shown above. Just remember to use the ``.`` in the XPath expressions that will follow. Beware of the difference between //node[1] and (//node)[1] ---------------------------------------------------------- ``//node[1]`` selects all the nodes occurring first under their respective parents. ``(//node)[1]`` selects all the nodes in the document, and then gets only the first of them. Example: >>> from scrapy import Selector >>> sel = Selector(text=""" ....:Second
Fourth
.. highlight:: python You can use it like this: >>> response.xpath('//p[has-class("foo")]') [`` tags and print their class attribute::
for node in sel.xpath("//p"):
print(node.attrib['class'])
.. _selector-examples-xml:
Selector examples on XML response
---------------------------------
Here are some examples to illustrate concepts for :class:`Selector` objects
instantiated with an :class:`~scrapy.http.XmlResponse` object::
sel = Selector(xml_response)
1. Select all ``