I am working on the spider again tonight.  I learned a few things.  Like HTTPWebRequest.timeout exists.  And that favicon.ico is a pain in the butt, it always comes up and stops the spider.  I've been using Regex to parse strings that have chars I just don't want to be in URLs.  
I was reading about other people making web spiders and they are just out grabbing links, while I am recursively searching links to other links.  As such, I am verifying that these pages get responses.  I don't travel to them if I can't.  Although I use a page that provides random links as a starting point, I'm following them as far as I can go.
some stuff:
wreq.Timeout = 60000;
Regex r = new Regex("favicon");
if(r.Success)
etc...
Wednesday, November 26, 2008
Tuesday, November 18, 2008
Pretty much
Well the spider is running pretty much by itself.  Used some conditional breakpoints for debugging today.  Was dealing with stringURL getting too long to be stored in a database record, basically I am truncating it.  I have rearranged the code so that I don't click a button to find each new URL but instead it runs in a loop.  I actually went out to dinner, came back, and it was still finding URLs.
Thursday, November 13, 2008
Saturday, November 8, 2008
Thursday, November 6, 2008
Searchin'
I'm gonna move back to the search engine part of my project, but first...I had an idea for collecting pages to mine.  If I increment through IP URLs, and look at the response each gives I can categorize them and store in a DB.  This gives a big basis for looking around without running into a wall.
So I changed the part where:
if(affinity>=highest)
to:
if (affinity>=highest)
{
highest=affinity;
affinityTwin=whatever;
}
because I made the rookie mistake of not changing affinity to the current highest value when trying to find the highest of all values in a list.
So I've got the AI components going on. It's a little bit Fuzzy Logic and a little bit Neural Net.
More on this later.
So I changed the part where:
if(affinity>=highest)
to:
if (affinity>=highest)
{
highest=affinity;
affinityTwin=whatever;
}
because I made the rookie mistake of not changing affinity to the current highest value when trying to find the highest of all values in a list.
So I've got the AI components going on. It's a little bit Fuzzy Logic and a little bit Neural Net.
Subscribe to:
Comments (Atom)
 
 
