Special thanks to:
|
#1
|
|||
|
|||
|
Hi everyone,
I would first like to address a problem in the search engine world: PageRank. Google uses PR to rank pages on how popular they are, not what content is on them. This sometimes results in irrelevant results being displayed first because the pages have high PageRanks. Vezto was launched several weeks ago as a new type of search engine, one that ranks pages solely on content, not who links to the content. I am one of the developers and we need people to test our search engine and provide feedback. We use linguistics and analyzation of word patterns to determine what words are important on a page and then match these results to a user's query. We find this approach to be most useful when searching for general information like 'Lance Armstrong' or 'ipod'... the user gets results surrounding the topic in question, not results that mention the query a few times but are popular sites. Our only problem is that we have not crawled the entire web and the user may get irrelevant results only because we have not visited the pages relevant to the query. So please take a look at the engine and provide some feedback. I would be happy to answer any questions about how we built this project from the ground up... The link is: http://www.vezto.com Thanks, -Steve |
|
#2
|
||||
|
||||
|
Wouldn't it be better to focus on building the index?
"Searching from 122,947 pages" That's probably a little small to test anything very meaningful.... Gigablast 2,024,193,536 pages indexed Google 8,058,044,651 web pages |
|
#3
|
|||
|
|||
|
Thanks for the reply Chris,
We are working on the index and it is updated every second. If you refresh the page every few seconds you will see the number climb. We are indexing about 25,000 pages each day; you found us at day 5 for 122,000 pages. So as we index more of the web we are trying to at the same time get some feedback as to how we can improve the engine as it crawls the web. If we had the computers that Google and Gigablast have then crawling the web would not be a problem. As of now our project consists of a few spare computers and a server. However from the small piece of the web that we have found that a search for something general will return very good results. -Steve |
|
#4
|
||||
|
||||
|
testing, testing, microphone check
Following are the results that came up for the term "search engine watch." Good job on the blog coming up first, but very strange that it happened to pick that paticular blog post. This makes the Title look fairly non-relevant.
The idea of using the "important words on page" is cool...but seems to be completely unrelated to the search. There are quite a few examples in the following top 9 results that make you wonder why the heck they are listed. For example, I was wondering why #9 about the chevys came up. it just seemed way further off than the others. So I looked at the source and found some interesting overture code as well as a referral tracker, yet no instance of the word "engine" or "watch" on the page. plenty of instances of "search," including if you parsed it from "research." Does your engine not require that all words in a particular search phrase are present? Especially if you rely on content only and no links you would think this would be a major requirement. Needs a lot of work I think... Quote:
|
|
#5
|
|||
|
|||
|
Thanks for the feedback,
The reason an irrelevant site was listed at number 9 is that the engine did not document even 1 percent of the world wide web. The pages the engine is picking from is severly limited and the auto page was a low-ranking relevancy query match however it was displayed ninth because it was one of the best pages found. I will admit thought that about 5 other pages should have been placed above the auto page. I am also looking into why that page came up for having the word 'search' in it when it is not mentioned even in the source code. I am the first to agree that the engine needs work and to add pages to the index; but I am more concerned about the idea of a content-only search... can it work... and when looking at only the best results from vezto in some cases it does.. |
|
#6
|
||||
|
||||
|
Quote:
Then they switched to content-only for the second generation, but that had problems with spam detection and relevancy. It's also an issue for sites with difficult to read content. Then they switched to content+links, which is generation 3, and it's been working fairly well, since spammers have to spam both links and content, which is harder to do. I'm not sure what generation 4 will be, but I suspect it will probably be third party information that is not controlled by the website owner at all, but of course I don't know. My concern would be regarding your spam detection, since content only was abandoned due to spam issues. Do you have a new and improved spam detection system that's better than the ones that were being used when Google first started? Ian
__________________
International SEO |
|
#7
|
|||
|
|||
|
Thanks for all the feedback everyone,
As far as spam protection goes, the engine has been given handpicked domains to crawl and it crawls within close proximity to these trusted domains like aol.com or msn.com. In the future if this idea ever takes off, we will allow webmasters or users to submit sites to be reviewed by us and when approved will be crawled by our spiders. While this method may compromise true comprehensiveness, Google and Yahoo's 'comprehensive' search revolves around the same old pages with the most links pointing to them... Our idea is to provide all pages an equal opportunity in the results ranking, weeding out the spam and bad sites is almost impossible to do by machine and requires about 5 seconds of someone's time instead. Again, the idea will allow all webmasters to have their domains crawled by filling out a simple form in the future. As far as the performance of the search engine goes I can agree that it is poor right now. I looked into the problem myself and after much testing found that the results witnessed before for 'search engine watch' contained a crucial flaw in them: they documented words that contained search such as research, just as predicted. Also, the previous version of Vezto gave high importance to articles with many words on them, but after much review we decided that people surf the web to 'see' stuff, not 'read' all the time. So in short we completely redesigned the scoring and crawling mechanisms for Vezto and are re crawling the web. From initial testing the next version will not disappoint. If you would like to test the new version it is crawling askmen.com, a portal to the web we used for testing the program in the beginning. We should get a more comprehensive database in the future by crawling already popular sites you guys are used to seeing on google (as much as we hate to admit it, we do rely on popular sites for something...) So again, thanks for the feedback administrators, without you this new version would not have been possible. -Steve |
|
#8
|
||||
|
||||
|
wow you guys move fast and decisively...gotta love that
good luck with v2 and bring it by for a spin. |
|
#9
|
||||
|
||||
|
I am concerned, like Ian, that this engine will become either way to easy to spam or having an index too smal worth anything.
Quote:
I am sorry, but to me it dosn't sound like a very scalable model ![]() |
|
#10
|
|||
|
|||
|
Hey Guys,
Just to clear one thing up, we hand pick DOMAINS, not web pages. There are 8 billion web PAGES including pages with absolutely nothing on them, spam, useless information and the like... We plan to find the most popular domains at first (maybe 1000 or so) and then crawl those domains in addition to the domains directly linked to the original 1000. For instance if we start with page A an it links to B and C, pages A, B, and C will be cralwed. However if page B links to D, E, and F, these additional 3 pages will NOT be crawled because they were not directly linked from the original page.... I see your point in scalability, getting to 1000 good domains will require work, but they link to many many other good domains which require no screening assuming we did our work correctly in picking the first 1000 sites. As far as our success goes we have no idea where the idea will go. Our organization is just some programmers doing this as a hobby, we have yet to invest a cent in our search engine; we just want to try the idea and if it happens to work, great. Our second version is just about finished now and we have to build up the database again... Like I mentioned before countless stupid errors and mistakes were corrected from the last version.. I'll post a reply when we feel v2 is ready for critique to help us develop v3. Take Care, -Steve |
|
#11
|
||||
|
||||
|
You cannot detect spam on a domain level! This is definately NOT going to work. Just look up the recent Wordpress case for an example of this - and that is certainly NOT the only case of it's kind (maybe one of the bigger, but not the only one).
It gets even worse if you believe links from an editorially verified domain will only feature direct links to quality sites. What if they have a blog that are getting spamed? What if they have an open forum where people post links? What if they have an editor that decide to give a link to a "good friend"? Or what if they simply have bad judgement or a different opinion of quality than you? |
![]() |
| Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
| Thread Tools | |
|
|