Thursday, July 9, 2015

heirarchial search

I had an idea for a website that lets you search for web pages by categories like in the picture. Each drop down box would have a list of perhaps the most popular 100 categories or subcategories and you can search deeper and deeper into the categories as far as you want and then it gives you a list of web pages that fit there in the heirarchy...


**Google to Hierarchical**

I think a way to seed the site from Google results would be to use the following to check whether one keyword could be a subcategory of another... 


R(S) is the number of results when searching for keyword S
For example:
R(sport) = 4,040,000,000
means there are that many search results on Google for the word "sport"
R(baseball) = 480,000,000
R(sport, baseball) = 476,000,000
So, R(sport, baseball) is searching for pages that match both the keywords baseball and sport, and that is almost as many as just searching for baseball by itself. So perhaps we can conclude that baseball is a subcategory of sport because R(sport) is larger... 
R(animal) = 2,200,000,000
R(anteater) = 864,000
R(animal, anteater) = 561,000
So it's not as good of a match for the subcategory relation by percentage but maybe because it's more than half as many we can keep it as a possible candidate... There might be a different keyword than animal that fits anteater better...
R(anteater, baseball) = 194,000Much less than 50% of the results for anteater alone so these probably don't have a subcategory relationship... 

So one way to start might be to look at R(x,y) where x and y are one of the top 1000 most searched for things on Google and use the information with the method above to form a category tree of the data...


No comments:

Post a Comment