I am working on the spider again tonight. I learned a few things. Like HTTPWebRequest.timeout exists. And that favicon.ico is a pain in the butt, it always comes up and stops the spider. I've been using Regex to parse strings that have chars I just don't want to be in URLs.
I was reading about other people making web spiders and they are just out grabbing links, while I am recursively searching links to other links. As such, I am verifying that these pages get responses. I don't travel to them if I can't. Although I use a page that provides random links as a starting point, I'm following them as far as I can go.
some stuff:
wreq.Timeout = 60000;
Regex r = new Regex("favicon");
if(r.Success)
etc...
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment