/** * @article Screen-scraping * counterpunch.org and building * a ghetto RSS feed * * @since February 9, 2011 * @package Shortpost * * My first try at building a pirate * RSS feed from a site-scraper. * * @tags scrapers * @comments 5 comments * */
One of my favorite political news sites for several years has been Counterpunch, and lately, with all the news coming out of Tunisia, Egypt, and the Arab world, I’ve found myself visiting their site a couple of times to read up on current events and analysis.
Actually, its a site that I would like to keep up with more regularly, and I probably would, too, if the 1997-style website didn’t make it so hard to keep up with. Seriously, its still maintained in Adobe PageMill. In 2011. Last time I opened that program was probably 1998, and even then it was dated.
So I got the idea to write myself a screen-scraper to scrape the front page, and output all the article links as an RSS feed. I used PHP Simple HTML Dom Parser to gather the content, and a really basic RSS Writer class to format the XML output. Its really basic (just the post titles, authors, and links in an RSS feed) and only took me an hour or so, but even so, it makes the site 100 times easier for me to keep up with.
Edit for those of you finding this page looking for a working counterpunch.org feed: The problem with unauthorized feeds based on scrapers is that when the site changes structure for one reason or another, the feed stops working. I got tired of tweaking my scraper all the time to keep up, so I’m not maintaining it any more. I’d recommend using the official feed (which seems to be working these days) at
http://www.counterpunch.org/category/article/feed
If I worked out a caching solution so that I could scrape each of the individual articles and save an excerpt along with the feed, that would be great, but to do that without caching the feed would just overwhelm my server (not to mention Counterpunch’s). In the meantime, feel free to read along if you’re interested:
http://goldenapplesdesign.com/workinprogress/DOM/counterpunch.php
If you’re interested in seeing how easy it was, here’s the source code:
<?php
include('simple_html_dom.php');
include('rss-writer.php');
$html = file_get_html('http://counterpunch.org');
$feed = new RSS();
$feed->title = "CounterPunch";
$feed->link = "http://counterpunch.org";
$feed->description = "Tells the Facts, Names the Names";
// Find all articles
foreach( array_slice( $html->find('td.style25 p.style2'), 0, 12) as $article) {
$article_link = $article->find('a');
$link = $article_link[0]->href;
if ( $link ) {
if (!strstr( $link, 'http://' )) $link = 'http://counterpunch.org/'.$link;
$title = $article_link[0]->plaintext;
$author = str_replace( $title, '', $article->plaintext );
$item = new RSSItem();
$item->title = $title;
$item->link = $link;
$item->setPubDate( strtotime( $date[0]->plaintext ) );
$feed->addItem($item);
}
}
echo $feed->serve();
?>
[...] This post was mentioned on Twitter by Merhia Madsen Wiese, Nathaniel Taintor. Nathaniel Taintor said: Wrote myself a bootleg RSS feed to read counterpunch.org: http://bit.ly/gGrCWi [...]
I’ve been trying to find a working feed to counterpunch for a while.
This is great – thanks for putting it up!
wpgrab.com is the easiest web scraper
Has held up perfectly, for the first year I’ve had it.
There are several ways to read RSS feed in PHP, but this one is surely one of the easiest.
foreach ( $channel->item as $entry) { echo "<a href='$entry->link' title='$entry->title' >" . $entry->title . "</a>"; }Source:
http://phphelp.co/2012/04/23/how-to-read-rss-feed-in-php/
OR
http://addr.pk/a0401