Wednesday, November 26, 2008

Favicon is Evil

I am working on the spider again tonight. I learned a few things. Like HTTPWebRequest.timeout exists. And that favicon.ico is a pain in the butt, it always comes up and stops the spider. I've been using Regex to parse strings that have chars I just don't want to be in URLs.

I was reading about other people making web spiders and they are just out grabbing links, while I am recursively searching links to other links. As such, I am verifying that these pages get responses. I don't travel to them if I can't. Although I use a page that provides random links as a starting point, I'm following them as far as I can go.

some stuff:

wreq.Timeout = 60000;
Regex r = new Regex("favicon");
if(r.Success)
etc...

No comments: