ciaranmcnulty (ciaranmcnulty) wrote,
ciaranmcnulty
ciaranmcnulty

Microformats let you take control of your data

I use iCal on the Mac quite a lot now - I've got my basic diary in there plus a feed from Facebook of any Events I've RSVPed to, a feed of Last.FM gigs I'm going to and a list of UK bank holidays I found somewhere (incidentally this is why I think it's much better when people organise things via Facebook!).

The whole lot syncs to my phone and helps me compensate for my terrible terrible memory. One thing that was missing, that I can never remember, is the fixture list for Crewe - I'm always being surprised on a Tuesday when the results come in and we've played.

I searched around but couldn't find an iCalendar feed for the fixtures anywhere. There were plenty of web pages with the list but nothing I could subscribe to, so I decided to try and build my own.

The first thing was to find a reliable source - I settled on the good old BBC. Their HTML is not as nice as it might be but they have lists for lots of different teams and I figured it'd be fairly accurate. I chose the page at:
http://news.bbc.co.uk/sport1/hi/football/teams/c/crewe_alexandra/fixtures/default.stm
which contains fixtures in the following format:

<div class="mvb"><b>Saturday, 01 December 2007</b></div>
<div class="pvtb"><b>The FA Cup sponsored by E.ON</b></div>
<div class="mvb"><a href="..." class="stats">Oldham</a> v
<a href="..." class="stats">Crewe</a>, R2, 15:00</div>

The second thing I realised was that I didn't know enough about iCalendar to generate it reliably, so instead I opted to generate something I knew - HTML. By marking up my HTML with the appropriate hCalendar Microformat markup, I knew I'd be able to then feed the resulting page into a proxy that converted it into a nice .ICS for me.

Grabbing the HTML from a remote source was fairly straightforward, and grabbing the lines I wanted was just a case of applying a sensible Xpath:

$dom = new DomDocument('1.0', 'utf8');
@$dom->loadHTMLFile($url);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query("//td[@width='416']//div[@class='mvb' or @class='pvtb']");

The Xpath gets all the nodes with class="mvb" or class="pvtb" in the main content area. If that td[@width='416'] looks a bit lame then I apologies, but the BBC markup is so tangled that that was the only way I could see of identifying the main area on the page. Obviously if the BBC change their HTML much this will break but hopefully it'll be easy enough to change.

The next thing is basically to loop through each group of three lines and parse them individually, then generate some nice(ish) hCalendar markup:

<div class="vevent">
    <h2 id="event-0">
        <a class="uid" href="http://[...snipped...]">
            <abbr class="dtstart dtstamp" 
                  title="2007-12-01T15:00:00+00:00">15:00</abbr>
            -
            <abbr class="dtend" 
                  title="2007-12-01T16:45:00+00:00">16:45</abbr> 
        1st December 2007</a>
    </h2>

    <p class="description">
        <span class="summary">Oldham v Crewe, R2</span>
         - The FA Cup sponsored by E.ON</p>
</div>

The whole page of events can be seen here.That might not seem like much - I've turned some HTML tag soup into some other HTML with slightly more descriptive markup. The difference is that the markup in there is hCal and corresponds directly to the fields in iCalendar, so can be easily converted. In fact running the whole page through an hCal->iCal proxy such as Brian Suda's will output a nicely formatted .ICS file that most calendaring apps will understand:

BEGIN:VEVENT
DESCRIPTION;CHARSET=utf-8:Oldham v Crewe\, R2 - The FA Cup sponsored by E.ON
SUMMARY;CHARSET=utf-8:Oldham v Crewe\, R2
UID: ...snipped...
DTSTART;VALUE=DATE-TIME:20071201T150000Z
DTEND;VALUE=DATE-TIME:20071201T164500Z
DTSTAMP:20071201T150000Z
END:VEVENT

I can create this static link to the proxy and import it into iCal, the data from the BBC pages shows up in my calendar and will regularly update as fixtures change.

It all sounds pretty complicated, but from start to finish it took me about 20 mins to write the code to parse the BBC HTML, 5 to generate hCard markup and another 5 mins of fiddling with it to check it imports into iCal correctly, and now I've got data from some random site in a form I can use with any app I like.

To me, that's the promise of the semantic web (or Web2.0 if I must use the buzzword) in a nutshell and is a great testament to the power of Microformats in general for enabling HTML data to be used in new and innovative ways.

My BBC -> hCal translator is available at my new development site and should work on any of the beeb's football fixtures pages and maybe some other sports too. The source is also freely available for those interested.

Tags: crewe, development, html, microformats, php, tech
Subscribe

  • Curriculum Vitae

    I've posted my CV on my site. I'm not frantically looking around for work, but if someone cool is hiring and wants to swoop in and offer me a job I…

  • Lifestreaming

    A while back, probably via the uF mailing lists, I came across an interesting article by a chap called Jeremy Keith. The basic principle of it…

  • Brentford 0 - 4 Crewe

    Such a blurry photo, but you get the idea. This was, I think, my third trip to Brentford over the years. When I lived in SW London it was…

  • Post a new comment

    Error

    default userpic

    Your IP address will be recorded 

    When you submit the form an invisible reCAPTCHA check will be performed.
    You must follow the Privacy Policy and Google Terms of use.
  • 0 comments