Screen Scraping Lists
There are many articles about data scraping, concerning returning an entire page or a particular element. Building on the base of the other articles, we will be using the grouping constructs to retrieve easily a list of headlines from Guardian Unlimited.
You will find the function to scrape the HTML functionally the same as with the other articles, but with the addition try-catch statement, as you can never be too cautious when using resources outside of your control.
News Source Code
The getNews function is where things differ. This is where all the work takes place and is what we will be concentrating on here. It is surprisingly small considering its size if you did not use grouping constructs.
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 | Private Function getNews() As System.Data.DataTable Dim rowNewsItem As System.Data.DataRow 'create the table to be returned getNews = New System.Data.DataTable() getNews.Columns.Add("strURL") getNews.Columns.Add("strHeadline") getNews.Columns.Add("strSummary") 'set up the regular expression for the news page Dim strRegex As String strRegex = "(?<strurl>[^']+)'[s]*?<(?<strheadline>[^<]+)[swW]*?(?<strsummary>[^>]+)>" Dim Regex As System.Text.RegularExpressions.Regex Regex = New System.Text.RegularExpressions.Regex(strRegex, System.Text.RegularExpressions.RegexOptions.Compiled) 'scrape the data Dim Matches As System.Text.RegularExpressions.MatchCollection = Regex.Matches(getHTML("http://www.guardian.co.uk/syndication/service/0,11065,331-0-5,00.html")) Dim Match As System.Text.RegularExpressions.Match 'loop through all matches filling out the table as you go For Each Match In Matches rowNewsItem = getNews.NewRow() rowNewsItem("strURL") = Match.Groups("strURL").Value rowNewsItem("strHeadline") = Match.Groups("strHeadline").Value rowNewsItem("strSummary") = Match.Groups("strSummary").Value getNews.Rows.Add(rowNewsItem) Next End Function |
Lines 31-34 deal with the creation of a DataTable. DataTables are a useful feature of .NET they allow you to pass data easily between function, without loosing any clarity.
Look at line 38; you will see the strRegex String, now that is a monster of an expression but one hell of a powerful one.
Let us take a closer look at the construction of strRegex. The String can be broken down into four basic parts.
Literals
<A
href=’(?
These will be exactly matched against the string, helping to locate the text in which you are interested.
Character sets
<A
href=’(?
The contents of the [] is the set of characters of which you wish to dispose. In the above example, \s represents any white space character, \w represents any word character and \W represents any non-word character.
Quantifiers
<A
href=’(?
The *? indicates that the preceding character is repeated multiple times, this differs from the * on its own in that the * will try to matched to the longest possible string where the *? matches to the shortest. What is the difference you may ask? Well let us imagine you want to get the first cell in a table.
“<TD>[\s\w\W]*</TD> ” this looks like it should work, but it would match is the <TD> of the first cell with the </TD> of the last.
Grouping Constructs
<A
href=’(?
Now this is where the magic happens. This enables us to extract the information we are after without all that messing about with indexOf or subString and all the validation that goes along with it. (?[^']+), this takes the value from the current position up to but not including the ‘ and assigns it into the strURL construct. Meaning you can now refer to the data by name.
Once you have Matched the data, all that is left to do is loop though the collection and fill out the table.
So there you have it, a function to retrieve the news as a DataSet. As an example the DataList Has been bound to a custom control, making a nice little control that can add content to any page .



















