How to scrape with MyHeadlines

Request from a reader: how to scrape with myheadlines?
Myheadlines has an own engine to scrape websites for a RSS/XML feed
How does this work?. Myheadlines does have a tutorial and when you understand this tutorial you can create scraped RSS feeds.
To explain scraping a little I will use as example an feed with almost all options available.
How to scrape GOOGLE WORLD News
Find the URL: (the less overhead the better so I use the lite version of the page).
http://news.google.nl/news/en/us/mainlite.html

{dump}<title>{site_name}</title>
This line is most of the time my ‘test’ line. If this does function (showing the title within Myheadlines, I know I can probably scrape the page)

{dump}<a name=WORLD>
I’m searching for a name=WORLD> this line is unique
After this line the scraper should find the link, title and/or description

{dump}<a class=y href=”{link_1}”>{title_1}</a>
Here I define where the scraper should use the link and title feature

{dump}</b><br>{desc_1}<br>
And here the description.

So the example for a correct scraped Google News Feed would be:

{dump}<title>{site_name}</title>
{dump}<a name=WORLD>
{dump}<a class=y href=”{link_1}”>{title_1}</a>
{dump}<<br>{desc_1}

{dump}<a class=y href=”{link_2}”>{title_2}</a>
{dump}</b><br>{desc_2}<br>

Please be noticed that you NEED to add {dump} at the start of a new line

Output:
Sharon won't tolerate roadmap violations

Jerusalem – Israeli Prime Minister Ariel Sharon warned his government would not tolerate the slightest Palestinian violation of the roadmap for peace, the Israeli media reported on Friday.

Cambodia's Sihanouk to stay out of poll deadlock

Diplomats have said the king, who commands wide respect as the father of national reconciliation in the war-torn Southeast Asian state, could end the deadlock caused by Prime Minister Hun Sen's need for a coalition partner.

One thought on “How to scrape with MyHeadlines

  1. Posts like this is why my family stopped visiting my site…. Sure my site is supposed to be Agar Family News, but 90% of my traffic is from like minded news junkies who’ve discovered the miracle of Content Syndication. Great Post! Any questions about building scrapers can be directed to either Dennis, or myself.

    Cheers,
    Mike

    http://www.jmagar.com (MyHeadlines Developer)

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.