カテゴリー別アーカイブ: lxml

Python lxmlモジュール

Pythonのrequestsモジュールを利用することで, HTTPクライアントプログラムを簡単に実装できます.

まずは, requestsモジュールをインストールします.

$ (sudo) easy_install pip
$ (sudo) pip install lxml

パースする場合,

$ python
Python 2.7.5 (default, Mar  9 2014, 22:15:05) 
[GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.0.68)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import requests
>>> try:
>>>     from lxml import etree
>>> except ImportError:
>>>    import xml.etree.ElementTree as etree
>>> res = requests.get('http://rss.rssad.jp/rss/gihyo/feed/atom')
>>> open('gihyo.xml', 'wb').write(res.content)
>>> tree = etree.parse('gihyo.xml')
>>> root = tree.getroot()
>>> print(root.tag)
{http://www.w3.org/2005/Atom}feed
>>> for child in root:
...     print(child)
...
<Element {http://www.w3.org/2005/Atom}title at 0x10c22fa28>
<Element {http://www.w3.org/2005/Atom}subtitle at 0x10c22f170>
<Element {http://www.w3.org/2005/Atom}id at 0x10c22fe18>
<Element {http://www.w3.org/2005/Atom}link at 0x10c22fa28>
<Element {http://www.w3.org/2005/Atom}author at 0x10c22f998>
<Element {http://www.w3.org/2005/Atom}updated at 0x10c22f170>
<Element {http://www.w3.org/2005/Atom}rights at 0x10c22fe18>
<Element {http://www.w3.org/2005/Atom}icon at 0x10c22f998>
<Element {http://www.w3.org/2005/Atom}link at 0x10c22fa28>
<Element {http://www.w3.org/2005/Atom}entry at 0x10c22f170>
<Element {http://www.w3.org/2005/Atom}entry at 0x10c22fe18>
<Element {http://www.w3.org/2005/Atom}entry at 0x10c22f998>
<Element {http://www.w3.org/2005/Atom}entry at 0x10c22fa28>
<Element {http://www.w3.org/2005/Atom}entry at 0x10c22f170>
<Element {http://www.w3.org/2005/Atom}entry at 0x10c22fe18>
<Element {http://www.w3.org/2005/Atom}entry at 0x10c22f998>
<Element {http://www.w3.org/2005/Atom}entry at 0x10c22fa28>
<Element {http://www.w3.org/2005/Atom}entry at 0x10c22f170>
<Element {http://www.w3.org/2005/Atom}entry at 0x10c22fe18>
<Element {http://www.w3.org/2005/Atom}entry at 0x10c22f998>
<Element {http://www.w3.org/2005/Atom}entry at 0x10c22fa28>
<Element {http://www.w3.org/2005/Atom}entry at 0x10c22f170>
<Element {http://www.w3.org/2005/Atom}entry at 0x10c22fe18>
<Element {http://www.w3.org/2005/Atom}entry at 0x10c22f998>
<Element {http://www.w3.org/2005/Atom}entry at 0x10c22fa28>
<Element {http://www.w3.org/2005/Atom}entry at 0x10c22f170>
<Element {http://www.w3.org/2005/Atom}entry at 0x10c22fe18>
<Element {http://www.w3.org/2005/Atom}entry at 0x10c22f998>
<Element {http://www.w3.org/2005/Atom}entry at 0x10c22fa28>
>>> link = root.find("{http://www.w3.org/2005/Atom}link")
>>> print(link.attrib)
{'href': 'http://gihyo.jp/'}
>>> root = etree.fromstring(res.text.encode('utf-8'))
>>> print(root)
<Element {http://www.w3.org/2005/Atom}feed at 0x10c2302d8>

作成する場合,

>>> nsmap = {None: 'http://www.w3.org/2005/Atom'}
>>> new_elem = etree.Element('feed', nsmap=nsmap)
>>> sub1 = etree.SubElement(new_elem, 'title')
>>> sub1.text = 'my test feed'
>>> sub2 = etree.Element('link', attrib={'href': 'http://gihyo.jp'})
>>> new_elem.append(sub2)
>>> print(etree.tounicode(new_elem, pretty_print=True))
<feed xmlns="http://www.w3.org/2005/Atom">
    <title>my test feed</title>
    <link href="http://gihyo.jp"/>
</feed>