Article
List
Use Perl to harness XML Data Sources
A step by step guide to process the Moreover XML
headline feeds
19th February 2001
Web gurus are constantly telling us that great content
is the key to an even greater website. But, with available ‘content
wizards’ and JavaScript code snippets offering the somewhat limited
implementation of content from external sources for your site, where do
you turn to for headlines, statistics and other useful data for use on a
website? The answer? It must be XML !
The Perl server-side scripting language is the ultimate partner for XML as
it enables you to actually use the data from XML sources. At this point we
are assuming a basic knowledge of the Perl language and how to upload and
maintain scripts on a server.
Perl lets developers get web pages (and other files) from the web via the
use of its LWP module. The following script will download a web page and
pass it on to the user:
#!/usr/bin/perl
use LWP::Simple;
$WebPage=get(‘http://www.freesticky.com/stickyweb/default.asp’);
# $Webpage now holds FreeSticky.com page
print ‘Status: HTTP/2.0 200 OK\nContent-type: text/html\n\n’; # Print
Headers for web browser viewing
print $WebPage;
When uploaded to a Perl/CGI-enabled host and viewed
through a web browser, this script should display the Freesticky.com
homepage. As you’ll have guessed, the get() can now also be used to
retrieve XML documents.
You’ll find many sources for XML-formatted data on the web, but some may
limit your commercial usage of such content. Moreover.com are famous for
their Javascript-creating web wizard, but did you know that they also
offer their content in the form of XML for your own customized use. A full
list of XML addresses from Moreover is available here.
For the following example we are going to use the Microsoft Corporation
XML feed (http://p.moreover.com/cgi-local/page?c=Microsoft%20news&o=xml).
On Moreover, content is offered in the form:
<article id="ARTICLE_ID">
<url>ARTICLE_URL</url>
<headline_text>HEADLINE_CLIPPET</headline_text>
<source>ORIGINATION_OF_ARTICLE</source>
<media_type>text</media_type>
<cluster>moreover...</cluster>
<tagline> </tagline>
<document_url>ORIGINATION_WEB_ADDRESS</document_url>
<harvest_time>TIME_HARVESTED</harvest_time>
<access_registration> </access_registration>
<access_status> </access_status>
</article>
Of course, document headers surround these repeating
clusters of data, but these are the pieces of data we’ll be working
with.
So, to start writing a Perl script to collect, parse and redisplay this
data, we’ll start off with the mandatory headers:
#!/usr/bin/perl
use LWP::Simple;
$_=get(‘http://p.moreover.com/cgi-local/page?c=Microsoft%20news&o=xml’);
You may want to replace the XML address with your
preferred choice, but at this point we’ll have the entire XML page in
$_. Now we can run a loop which will, while it can still find the start of
a new article (<article id="ARTICLE_ID">) the script will
find each piece of information - headline text, source URL, etc - and
place it in individual arrays.
while (m/<article id=”/) { #Find start of new
article
#First let’s get the URL
$_=$’; #Now $_ contains all data after the latest ‘<article
id="’
m/<url>/; #Get first piece of article data - a link
$_=$’; #$_ contains URL and rest of data
m#</url>#; #$` contains text before latest find of
‘</url>’ and $’ contains text after
$URL[$ArticleNumber] = $`;
#Now retrieve headline text
$_=$’; #Set $_ to contain data after last find
m/<headline_text>/; #Get the headline start
$_=$’; #$_ contains headline and rest of data
m#</headline_text>#; #$` contains text before latest find of
‘</headline_text>’ and $’ contains text after
$Headline[$ArticleNumber] = $`; #$Headline[$ArticleNumber] contains
headline
#Now retrieve source of article
$_=$’; #Set $_ to contain data after last find
m/<source>/; #Get the source start
$_=$’; #$_ contains source and rest of data
m#</source>#; #$` contains text before find of ‘</source>’
and $’ contains text after
$Source[$ArticleNumber] = $`; #$Source[$ArticleNumber] contains article
headline source
#Now retrieve media type of article
$_=$’; #Set $_ to contain data after last find
m/<media_type>/; #Get the media type start
$_=$’; #$_ contains media type and rest of data
m#</media_type>#; #$` contains text before find of ‘</media_type>’
and $’ contains text after
$MediaType[$ArticleNumber] = $`; #$MediaType[$ArticleNumber] contains the
article’s media type
#Now retrieve tagline of article
$_=$’; #Set $_ to contain data after last find
m/<tagline>/; #Get the tagline start
$_=$’; #$_ contains tagline and rest of data
m#</tagline>#; #$` contains text before find of
‘</tagline>’ and $’ contains text after
$Tagline[$ArticleNumber] = $`; #$Tagline[$ArticleNumber] contains the
article’s tagline
#Now retrieve document URL of article
$_=$’; #Set $_ to contain data after last find
m/<document_url>/; #Get the document URL start
$_=$’; #$_ contains document URL and rest of data
m#</document_url>#; #$` contains text before find of ‘</document_url>’
and $’ contains text after
$DocumentURL[$ArticleNumber] = $`; #$DocumentURL[$ArticleNumber] contains
the article’s document URL
#Now retrieve harvest time of article
$_=$’; #Set $_ to contain data after last find
m/<harvest_time>/; #Get the harvest time start
$_=$’; #$_ contains harvest time and rest of data
m#</harvest_time>#; #$` contains text before find of ‘</harvest_time>’
and $’ contains text after
$HarvestTime[$ArticleNumber] = $`; #$HarvestTime[$ArticleNumber] contains
the article’s time of harvest
#Now retrieve access registration of article
$_=$’; #Set $_ to contain data after last find
m/<access_registration>/; #Get the access registration start
$_=$’; #$_ contains access registration and rest of data
m#</access_registration>#; #$` contains text before find of ‘</access_registration>’
and $’ contains text after
$AccessRegistration[$ArticleNumber] = $`; #$AccessRegistration[$ArticleNumber]
contains the article’s access registration
#Now retrieve access status of article
$_=$’; #Set $_ to contain data after last find
m/<access_status>/; #Get the access status start
$_=$’; #$_ contains access status and rest of data
m#</access_status>#; #$` contains text before find of ‘</access_status>’
and $’ contains text after
$AccessStatus[$ArticleNumber] = $`; #$AccessStatys[$ArticleNumber]
contains the article’s access status
$ArticleNumber++; # Increment the array number to store data about the
same article
}
We now have 9 arrays of article data, each of whose
items correspond with another array. For example, the URL of the headline
$Headline[5] can be found in $DocumentURL[5]. What can be now done with
the data we now have in the arrays? The main thing you’ll probably want
to do is simply display it. A simple piece of code which can follow the
last loop is:
print ‘Status: HTTP/2.0 200 OK\nContent-type:
text/html\n\n’; # HTTP Headers for viewing page through a web browser
for ($Article=0; $Article < $ArticleNumber; $Article++) {
# Go through each article
print "<A HREF=\"$DocumentURL[$Article]\">$Headline[$Article]</A><BR>$HarvestTime[$Article]
from <A HREF=\"$URL[$Article]\">$Source[$Article]</A><BR><BR>";
}
The possibilities for XML are clearly endless -
limitless distribution and representation of data from sources anywhere in
the world; easily parsed and updateable automatically. What is more,
Moreover.com are just one of many suppliers of harvested data and the
market is growing as Microsoft promote this area. The outlook is certainly
great for XML and content kings in the online world.
Copyright © 2001 Adam Waude. All Rights Reserved.
Author Information:
Adam Waude - adamwaude@hotmail.com
See Also:
W3 Standards: www.w3.org/XML
The centre for standards and standards-setting information from W3.org
|