• The eternal confessions of a beautiful mind...
  • DamianM.Co.UK
  • Home
  • About
  • Archives
  • Contact
  • Sitemap
  • My Flickr

    IMG_9585IMG_9115IMG_9113IMG_9111IMG_9078IMG_9075IMG_9069IMG_9065IMG_9041IMG_9032IMG_8963IMG_8928IMG_8916IMG_8915IMG_8904IMG_8876IMG_8858IMG_8830IMG_8828IMG_8826

  • Recent Posts

    • Can you spot the bands?
    • Strictly for the office
    • Failing the exam with dignity
    • New Ohio Roller Coaster - INSANE!!!
    • Speeding
    • Captions not required
    • Unfortunate Backgrounds
    • Unfortunate Signs
    • OMG!!
    • Women as explained by engineer
    • The Monty Hall Problem
    • ahhh still love them - Motivational Posters
    • What does Mona Lisa do when the Museum is closed………
    • 21st Century kids books
    • Cool Origami
  • My Tools

    • Blog_LinkIt
    • DCoda Theme
    • DCoda Widgets
    • RSS_Sticky
    • WordPress.org
    • WP_BlogNetworking
    • WP_BlogRollSync
    • WP_BoilerPlate
    • WP_Censor
    • WP_ContactMe
    • WP_DeliciousPost
    • WP_EasyReply
    • WP_HeadNFoot
    • WP_LinkIt
    • WP_OneInstall
    • WP_PostDate
    • WP_PostNotes
    • WP_RssSticky
    • WP_Spoiler
    • WP_Submission
  • My Web

    • ASPAlliance
    • ClaimID
    • del.ico.us
    • Digg
    • DSLRBlog
    • DVDProfiler
    • Flickr
    • Honeyed SPAM
    • My Blog
    • My company
    • MYSpace
    • WordPress.org
    • YouTube
    « Suspect until proven guilty
    Is this art? »

    Screen Scraping Lists

    There are many articles about data scraping, concerning returning an entire page or a particular element. Building on the base of the other articles, we will be using the grouping constructs to retrieve easily a list of headlines from Guardian Unlimited.

    You will find the function to scrape the HTML functionally the same as with the other articles, but with the addition try-catch statement, as you can never be too cautious when using resources outside of your control.

    News Source Code

    The getNews function is where things differ. This is where all the work takes place and is what we will be concentrating on here. It is surprisingly small considering its size if you did not use grouping constructs.

    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    
        Private Function getNews() As System.Data.DataTable
            Dim rowNewsItem As System.Data.DataRow
     
            'create the table to be returned
            getNews = New System.Data.DataTable()
            getNews.Columns.Add("strURL")
            getNews.Columns.Add("strHeadline")
            getNews.Columns.Add("strSummary")
     
            'set up the regular expression for the news page
            Dim strRegex As String
            strRegex =  "(?<strurl>[^']+)'[s]*?&lt;(?<strheadline>[^&lt;]+)[swW]*?(?<strsummary>[^>]+)>"
            Dim Regex As System.Text.RegularExpressions.Regex
            Regex = New System.Text.RegularExpressions.Regex(strRegex, System.Text.RegularExpressions.RegexOptions.Compiled)
     
            'scrape the data
            Dim Matches As System.Text.RegularExpressions.MatchCollection = Regex.Matches(getHTML("http://www.guardian.co.uk/syndication/service/0,11065,331-0-5,00.html"))
            Dim Match As System.Text.RegularExpressions.Match
     
            'loop through all matches filling out the table as you go
            For Each Match In Matches
                rowNewsItem = getNews.NewRow()
                rowNewsItem("strURL") = Match.Groups("strURL").Value
                rowNewsItem("strHeadline") = Match.Groups("strHeadline").Value
                rowNewsItem("strSummary") = Match.Groups("strSummary").Value
                getNews.Rows.Add(rowNewsItem)
            Next
        End Function

    Lines 31-34 deal with the creation of a DataTable. DataTables are a useful feature of .NET they allow you to pass data easily between function, without loosing any clarity.

    Look at line 38; you will see the strRegex String, now that is a monster of an expression but one hell of a powerful one.

    Let us take a closer look at the construction of strRegex. The String can be broken down into four basic parts.

    Literals

    <A
    href=’
    (?[^']+)‘[\s]*?>(?<strheadline>[^<]+)</A>[\s\w\W]*?<BR>(?<strsummary>[^<]+)<

    These will be exactly matched against the string, helping to locate the text in which you are interested.

    Character sets

    <A
    href=’(?[^']+)’[\s]*?>(?<strheadline>[^<]+)</A>[\s\w\W]*?<BR>(?<strsummary>[^<]+)<

    The contents of the [] is the set of characters of which you wish to dispose. In the above example, \s represents any white space character, \w represents any word character and \W represents any non-word character.

    Quantifiers

    <A
    href=’(?[^']+)’[\s]*?>(?<strheadline>[^<]+)</A>[\s\w\W]*?<BR>(?<strsummary>[^<]+)<

    The *? indicates that the preceding character is repeated multiple times, this differs from the * on its own in that the * will try to matched to the longest possible string where the *? matches to the shortest. What is the difference you may ask? Well let us imagine you want to get the first cell in a table.
    “<TD>[\s\w\W]*</TD> ” this looks like it should work, but it would match is the <TD> of the first cell with the </TD> of the last.

    Grouping Constructs

    <A
    href=’(?[^']+)‘[\s]*?>(?<strheadline>[^<]+)</A>[\s\w\W]*?<BR>(?<strsummary>[^<]+)

    Now this is where the magic happens. This enables us to extract the information we are after without all that messing about with indexOf or subString and all the validation that goes along with it. (?[^']+), this takes the value from the current position up to but not including the ‘ and assigns it into the strURL construct. Meaning you can now refer to the data by name.
    Once you have Matched the data, all that is left to do is loop though the collection and fill out the table.
    So there you have it, a function to retrieve the news as a DataSet. As an example the DataList Has been bound to a custom control, making a nice little control that can add content to any page .

    This entry was posted on Friday, June 15th, 2007 at 6:42 pm and is filed under ASP.NET, Coding. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.