.NET Screen Scraping in depth
There are many articles data scraping, today we will be looking at the different
techniques. The WebRequest class is provided for accessing data via the web, it
has two derived classes that will be looking at: Webclient and httpWebresponse.
Both classes are able to do anything you wish to do, it is more of a case of
which to use for what job.
Here we will cover everything you would want to do with the two classes and see
which comes out best.
Simple scraping
Here we are looking at just scraping a simple page. Where you want to do nothing
but get back the page and do not have to pass up any data
If you look at the code you will notice that at this level, there are only small differences in the use
of the classes, and now Webclient seems to have the slight edge with it being
slightly simpler code.
Forms
You have seen how simple it is to scrape any page using either webClient or
httpWebResponse; today we will be looking into how you pass form data to the
page you wish to scrape.
Here we are looking at passing the form data as a query.
If you look at the code you will notice that the difference between the two classes is now becoming
apparent. So which is better Client or Request? Well since Request passes the
data on the URL it is much simpler but as you can see from the example if you
have many fields, things may be a little unclear. Client however is much more
structured in how it passes the data, and if you wanted, you could always pass
the data on the URL here too. Therefore, client now is looking like the better
solution.
Posted Forms
You should have now scraped a page and scraped the result of a passed form.
Here we are looking at posting the form data.
If you look at the code you now see that Client now passes the form as a byte encoded string and oddly
now returns the page as the same. To post data using Request you still use the
same byte encoded array but have to manually open a stream and write the data
yourself. This does however give you a good picture of how things actually
work.
So which is the best now? Well client is still the most concise, but Request is
at least consistent and clearer as to what is happening. So for posting a form
it seem to be a tie.
Passing Headers
Here we are looking in how to pass values in the header. Header values go
unnoticed by the users but can carry important information such as the browser
type.
Once you have looked at the code you will see that both Client and Request
both deal with headers in the same way. However, with request not all values in
the header can be set with this method, some more standard header values such
as user-agent have there own property. This helps to make things clearer
and when you start to deal with cookies this will be a big help.
When you run the standard header example, you will notice that not all header
information can be set in code. Values such as Host are preset and cannot be
modified.
Scraping & passing cookies
Finally are looking in how to pass values in the cookies. One of the most
important things that cookies can be used for than can cause trouble when
scraping is session variables.
Again, you will see in the code that Client is much simpler than the Request method. It sets
the cookie value directly in the header, where as Request uses a cookie
container which may make things a little clearer but also as more powerful
implications.
The best implication of using the cookie container is that if you are going to
be scraping multiple sites you can keep all your cookies in the same container,
the Request then only passes up the cookies with the corresponding domain.
Again, it is a close call between which to use. Since both are derived from the
same base class performance is not much of an issue. But in conclusion if you
are doing simple scraping webClient would appear to be the most convenient but
can become unclear if you are passing lots of values in forms or cookies,
httpWebRequest is a little more long winded but though its uses of classes is a
little more clear. Therefore, the choice is yours.
Scraping in one line using temporary objects
After finishing the series on screen scraping the thought arose, how small can
you make a working scrape? With this in mind the most extreme example was
chosen, is a one-line scrape Doable?
Surprisingly the answer is yes, and more surprisingly, it is still quite clear.
If you look at the code you will notice in the conventional scrape that variables are declared for the
Webclient, stream and stream reader objects, and that they are only used once
before they are disposed of.
If you look at the one line example you will see that when the object is created
it is no longer assigned to a variable it is just used. Placing () around the
object creation allows the object to be accessed as a temporary object, since
there is no other reference the object it will be disposed of after execution
of the command.



















