<center id="pz5tg"></center>
  1. <object id="pz5tg"></object>

    <code id="pz5tg"><small id="pz5tg"></small></code>

    A light byte of Python

    Tim Golden

    http://www.lanzhoudaxin.com/python/iet-talk

    Data Integration

    Taking data from here and putting it there

    All of these (and more) are possible sources and targets for everyday data needs.

    • Python is excellent at extracting data and at issuing it.
    • Built-in support for regular expressions, url reading and conventional IO.
    • Many useful third-party packages for reading and writing other formats
    • On Windows, links into COM and -- with IronPython -- the world of .Net.

    Our Task

    • This appears simple, and indeed it is simple.
    • Python is excellent at this kind of thing.
    • We have limited time and we have to go through step-by-step.

    Extras

    If time permits, issue all or part of the stored list as:

    • Once we have the data in one format, it's just a question of issuing it in another

    The Source

      [div class="whatsnew-date"]
    
      [h2]DATE[/h2]
    
      [div class="book-summary"]
        [p][span class="book-title"]TITLE[/span][/p]
        [img class="thumblet" src="IMAGE_URL" /]
    
        [div class="book-synopsis-summary"]
          [p class="synopsis"]SYNOPSIS[/p]
          [p class="summary"]SUMMARY[/p]
        [/div]
    
      [/div]
      
    • I've removed some of the HTML we're not interested in
    • We need to pull each of the highlighted fields out

    The Target

        CREATE TABLE whatsnew
        (
          id INTEGER PRIMARY KEY,
          title TEXT,
          synopsis TEXT,
          summary TEXT,
          image BLOB,
          date_added DATE
        )
      
    • The id is a surrogate primary key, automatically incremented
    • The other information comes out of the web page, including the actual image itself, stored here as a BLOB.
    • Note the date_added is stored for each row

    The Tools

    Built-in modules

    Third-party modules

    • Python comes supplied with large number of useful modules
    • Sometimes, other people build on those to provide more accessible interfaces
    • Other 3rd-party modules are for things the stdlib doesn't cover
    • sqlite3 is standard from Python2.5 but exists as a 3rd-party module for 2.4 and below
    • The stdlib has an HTMLParser but it relies on exactly correct HTML

    The Outline

    1. Suck the web page into a BeautifulSoup structure
    2. Extract the information you need
      • For each date block
        • For each summary within it
          • Extract the Title, the Details, and the Image
          • Add the Date and this information to a list
    3. Write each summary in the list to the database

    Getting the web page

    import is the way to get hold of an external module as a namespace

        import os, sys
        import urllib
        from BeautifulSoup import BeautifulSoup as Soup
      
        web_connection = urllib.urlopen ("http://local.goodtoread.org/whatsnew")
        page = Soup (web_connection)
      
        print page.find ("h2")
        # or print page.h2
        

    Reading the date

        from datetime import datetime
      
        date_block = page.find ("div", "whatsnew-date")
        date_string = date_block.h2.string
        #
        # Wow! Feature
        #
        w, d, m, y = date_string.split ()
        print w
        print d
        print m
        print y
        d = d[:-2]
        date_string = "%s %s %s" % (d, m, y)
        date = datetime.datetime.strptime (date_string, "%d %B %Y").date ()
        

    Reading the story

        summaries = []
        for summary in date_block.findAll ("div", "book-summary"):
          title = summary.find ("span", "book-title").string
          summaries.append ((date, title))
        

    Getting the picture

        import urlparse
      
        image = summary.find ("img", "thumblet")
        href = urlparse.urljoin ("http://local.goodtoread.org/", image['src'])
        image_data = urllib.urlopen (href).read ()
      
        open ("temp.jpg", "wb").write (image_data)
        os.startfile ("temp.jpg")
        

    Writing it out

        import sqlite3
      
        reviews = []
        for review in date_block.findAll ("div", "book-summary"):
          title = review.find ("span", "book-title").string
          synopsis = review.find ("p", "synopsis").string
          summary = review.find ("p", "summary").string
          image = review.find ("img", "thumblet")
          url = urlparse.urljoin ("http://local.goodtoread.org", image['src'])
          image_data = urllib.urlopen (url).read ()
      
          reviews.append ((date, title, synopsis, summary, image_data))
      
        db = sqlite3.connect ("books.db")
        db.execute ("DELETE FROM whatsnew")
        db.executemany ("""
          INSERT INTO whatsnew (title, synopsis, summary, image, date_added)
          VALUES (?, ?, ?, ?, ?)
          """,
          reviews
        )
        db.commit ()
      
        

    The finished result

    books_to_db.py

    Other possibilities

    11