• The eternal confessions of a beautiful mind...
  • DamianM.Co.UK
  • Home
  • About
  • Archives
  • Contact
  • Sitemap
  • My Flickr

    IMG_9585IMG_9115IMG_9113IMG_9111IMG_9078IMG_9075IMG_9069IMG_9065IMG_9041IMG_9032IMG_8963IMG_8928IMG_8916IMG_8915IMG_8904IMG_8876IMG_8858IMG_8830IMG_8828IMG_8826

  • Recent Posts

    • New Ohio Roller Coaster - INSANE!!!
    • Speeding
    • Captions not required
    • Unfortunate Backgrounds
    • Unfortunate Signs
    • OMG!!
    • Women as explained by engineer
    • The Monty Hall Problem
    • ahhh still love them - Motivational Posters
    • What does Mona Lisa do when the Museum is closed………
    • 21st Century kids books
    • Cool Origami
    • AWESOME Pictures!
    • Things you shouldn’t find in your vegetable patch!
    • World’s Best Graffiti…?
  • My Tools

    • Blog_LinkIt
    • DCoda Theme
    • DCoda Widgets
    • RSS_Sticky
    • WordPress.org
    • WP_BlogNetworking
    • WP_BlogRollSync
    • WP_BoilerPlate
    • WP_Censor
    • WP_ContactMe
    • WP_DeliciousPost
    • WP_EasyReply
    • WP_HeadNFoot
    • WP_LinkIt
    • WP_OneInstall
    • WP_PostDate
    • WP_PostNotes
    • WP_RssSticky
    • WP_Spoiler
    • WP_Submission
  • My Web

    • ASPAlliance
    • ClaimID
    • del.ico.us
    • Digg
    • DSLRBlog
    • DVDProfiler
    • Flickr
    • Honeyed SPAM
    • My Blog
    • My company
    • MYSpace
    • WordPress.org
    • YouTube

    .NET Screen Scraping in depth

    There are many articles data scraping, today we will be looking at the different
    techniques. The WebRequest class is provided for accessing data via the web, it
    has two derived classes that will be looking at: Webclient and httpWebresponse.

    Both classes are able to do anything you wish to do, it is more of a case of
    which to use for what job.


    Here we will cover everything you would want to do with the two classes and see
    which comes out best.


    Simple scraping

    Here we are looking at just scraping a simple page. Where you want to do nothing
    but get back the page and do not have to pass up any data

    If you look at the code you will notice that at this level, there are only small differences in the use
    of the classes, and now Webclient seems to have the slight edge with it being
    slightly simpler code.


    Forms

    You have seen how simple it is to scrape any page using either webClient or
    httpWebResponse; today we will be looking into how you pass form data to the
    page you wish to scrape.


    Here we are looking at passing the form data as a query.


    If you look at the code you will notice that the difference between the two classes is now becoming
    apparent. So which is better Client or Request? Well since Request passes the
    data on the URL it is much simpler but as you can see from the example if you
    have many fields, things may be a little unclear. Client however is much more
    structured in how it passes the data, and if you wanted, you could always pass
    the data on the URL here too. Therefore, client now is looking like the better
    solution.


    Posted Forms

    You should have now scraped a page and scraped the result of a passed form.

    Here we are looking at posting the form data.


    If you look at the code you now see that Client now passes the form as a byte encoded string and oddly
    now returns the page as the same. To post data using Request you still use the
    same byte encoded array but have to manually open a stream and write the data
    yourself. This does however give you a good picture of how things actually
    work.


    So which is the best now? Well client is still the most concise, but Request is
    at least consistent and clearer as to what is happening. So for posting a form
    it seem to be a tie.


    Passing Headers

    Here we are looking in how to pass values in the header. Header values go
    unnoticed by the users but can carry important information such as the browser
    type.

    Once you have looked at the code you will see that both Client and Request
    both deal with headers in the same way. However, with request not all values in
    the header can be set with this method, some more standard header values such
    as user-agent have there own property. This helps to make things clearer
    and when you start to deal with cookies this will be a big help.

    When you run the standard header example, you will notice that not all header
    information can be set in code. Values such as Host are preset and cannot be
    modified.


    Scraping & passing cookies

    Finally are looking in how to pass values in the cookies. One of the most
    important things that cookies can be used for than can cause trouble when
    scraping is session variables.

    Again, you will see in the code that Client is much simpler than the Request method. It sets
    the cookie value directly in the header, where as Request uses a cookie
    container which may make things a little clearer but also as more powerful
    implications.


    The best implication of using the cookie container is that if you are going to
    be scraping multiple sites you can keep all your cookies in the same container,
    the Request then only passes up the cookies with the corresponding domain.


    Again, it is a close call between which to use. Since both are derived from the
    same base class performance is not much of an issue. But in conclusion if you
    are doing simple scraping webClient would appear to be the most convenient but
    can become unclear if you are passing lots of values in forms or cookies,
    httpWebRequest is a little more long winded but though its uses of classes is a
    little more clear. Therefore, the choice is yours.


    Scraping in one line using temporary objects

    After finishing the series on screen scraping the thought arose, how small can
    you make a working scrape? With this in mind the most extreme example was
    chosen, is a one-line scrape Doable?


    Surprisingly the answer is yes, and more surprisingly, it is still quite clear.


    If you look at the code you will notice in the conventional scrape that variables are declared for the
    Webclient, stream and stream reader objects, and that they are only used once
    before they are disposed of.


    If you look at the one line example you will see that when the object is created
    it is no longer assigned to a variable it is just used. Placing () around the
    object creation allows the object to be accessed as a temporary object, since
    there is no other reference the object it will be disposed of after execution
    of the command.

    Leave a Reply

    Related Posts from the Past:

    • Synergy
    • Big Screen Version
    • Scraping in one line using temporary objects code
    • PHP Tip: Output Control Functions
    • User Input