Wednesday, November 26, 2008

Favicon is Evil

I am working on the spider again tonight. I learned a few things. Like HTTPWebRequest.timeout exists. And that favicon.ico is a pain in the butt, it always comes up and stops the spider. I've been using Regex to parse strings that have chars I just don't want to be in URLs.

I was reading about other people making web spiders and they are just out grabbing links, while I am recursively searching links to other links. As such, I am verifying that these pages get responses. I don't travel to them if I can't. Although I use a page that provides random links as a starting point, I'm following them as far as I can go.

some stuff:

wreq.Timeout = 60000;
Regex r = new Regex("favicon");
if(r.Success)
etc...

Tuesday, November 18, 2008

Pretty much

Well the spider is running pretty much by itself. Used some conditional breakpoints for debugging today. Was dealing with stringURL getting too long to be stored in a database record, basically I am truncating it. I have rearranged the code so that I don't click a button to find each new URL but instead it runs in a loop. I actually went out to dinner, came back, and it was still finding URLs.

Thursday, November 13, 2008

Regexing

I was doing some Regexing last night in dealing with the input string search string. An interesting function it is, that Regex. Basically I want to keep weird chars out.

Saturday, November 8, 2008

How Would I?

So how would I handle one word search queries from the textbox on the search page?

Thursday, November 6, 2008

Searchin'

I'm gonna move back to the search engine part of my project, but first...I had an idea for collecting pages to mine. If I increment through IP URLs, and look at the response each gives I can categorize them and store in a DB. This gives a big basis for looking around without running into a wall.

So I changed the part where:

if(affinity>=highest)

to:

if (affinity>=highest)
{
highest=affinity;
affinityTwin=whatever;
}

because I made the rookie mistake of not changing affinity to the current highest value when trying to find the highest of all values in a list.


So I've got the AI components going on. It's a little bit Fuzzy Logic and a little bit Neural Net.
More on this later.