Thursday, December 02, 2004

<a noref>?

Check out http://www.martinlutherking.org/

Notice anything strange? Well.. maybe not at a very quick first glance. But if you click around for a moment or two, you'll probably realize rather quickly that it's a hate site run by a white supremacist group (the same group that runs www.stormfront.org - look at how many people participate in their online forums. scary.).

The shocking thing is that in a Google search for "Martin Luther King", the above site comes up 4th (as of today, December 2nd, 2004).

I did a bit of investigating into this a couple months back. It turns out that the reason this site is listed 4th is because of the large number of sites on the Internet that link to it as an example of racism and/or misinformation.

Interesting, isn't it? Web sites of elementary school libraries link to www.martinlutherking.org as an example of a site students should *not* use, and in doing so, contribute to its rankings in the major search engines (search engines compute rank in large part based on how many other web sites link to a given site).

This is not a new piece of news. The site has been around for quite some time, and many a blogger/journalist/librarian has written about the situation. However, for my own education, I began a one person email campaign to the maintainers of the sites linking to www.martinlutherking.org. I explained how their linking to it actually helps it in terms of search engine rank, and in-turn only helps the spread of misinformation -- exactly what many of these people were attempting to combat.

There were 3 classes of people who replied:

A. People who were not aware of the problem, thanked me for the notification, and removed the link (but perhaps kept the text without an actual link).

B. People who actually didn't realize the site was a racist/white surpremacist site and immediately removed the link and thanked me.

C. People who understood the issue (either before or after receiving my email), but decided to keep the link intact for one of the following reasons:
  1. Search engine rank is something for the search engines to deal with. If the search engines are ranking something higher than they should be, it's their own problem.

  2. Removing the link makes it difficult for some users to access the information. They won't understand why clicking on the text does not work.
I certainly agreed with both points. Personally, I think the benefits of removing the link given the situation outweigh both of the above points, but they're certainly valid nonetheless.

So, I started thinking about ways to solve this problem. The fundamental idea behind using number of links to determine page rank is the notion of link structure as a recommender system. That is, by linking to a site, you are in fact recommending it to your users.

But is this always the case? Certainly not. Even Google utilizes server-side scripts to exclude some sites to which they link (such as Blogger blogs) from inhereting PageRank from www.google.com.

So, I came up with this mindblowingly simple solution. Why not provide a simple way for one site to link to another without "recommending" it?

A simple way to do this would be to allow for an optional "noref" parameter in the html anchor tag (<a>). So, a url that would not be seen as a recommendation would look something like:

<a href="http://www.google.com" noref>Google</a>

An optional additional parameter would be backwards compatible most if not all of the time. So why not implement it?

Thoughts? Suggestions? Just a thought off the top of my head.. but something that could certainly be built upon..

6 comments:

Mike Swingler said...

The fundamental problem is that search engines are implying more semantic meaning to links than simply "take the user here on a click". Your suggestion of adding a noref parameter simply builds on the idea of the falsely implied "approval" of a site. Would the noref param have any greater meaning than "users go here, not search engines"? The real fix to this problem is moving toward something like the semanic web and finally eliminating HTML as the data/logic/presentation crap model that it is now.

And really, a MLK bashing site is, in fact, about MLK. While I in no way approve of or agree with the site, so called "hate speech" is still, in fact, an expression of free speech. Conceptually, this site should come up in Google search, no matter how distorted or untrue it's contents are. The internet after all, was never designed to be an authoritative library of facts.

adamjh said...

Mike,

Well said!

This idea is absolutely not a 'real fix'. It's an incremental improvement to an inherently flawed system.

Without context (such as that which could be provided by the semantic web, for example), search engine algorithms are optimized to return the sites most useful to most people first. I would venture to say that most people would not find www.martinlutherking.org a most useful resource, despite its ranking of 4 out of 4,920,000 sites (Google).

Thus, while the system itself most certainly is flawed due to lack of context, we should be open to making incremental improvements while it continues to evolve.

Right now, search engines view publishers as recommenders, even if they are specifically saying "don't go to this site". Ideally, a good recommender system should take into account both positive and negative recommendations.

Whether the web should be represented as a recommender system is another question entirely. I think the principle has its merits. Even with contextual enhancements (relevance), it's still necessary to have a method to determine rank (usefulness). That, is a problem that the symantic web does not solve.

Phil said...

I still think the best way of doing this is to create a tinyurl link to the site. That way people can still click on the link and go directly to the site, but search engines are not going to see that as a direct link, so it won't increase page rank at all. Solves all the problems, and still lets people link to it.

adamjh said...

First off, I'm pretty sure that PageRank isn't lost through redirection. If it's an actual HTTP redirect, I know for a fact that no PageRank is lost. So I'm pretty sure a tinyurl doesn't help at all.

Second, you're adding a single point of failure into the system. What if tinyurl becomes lagged, unavailable, or goes out of business altogether? Can it scale operationally? Can it scale financially?

Third, you're giving control over navigation to a 3rd party with whom you have no rights or recourse should they choose to change their service (to insert ads, or otherwise).

Hmm...

Allie said...

Stumbled across your site while googling for the exact syntax of Google's rel="nofollow" tag. You probably already know about it, but just in case you don't, here's some info: http://www.google.com/googleblog/2005/01/preventing-comment-spam.html

adamjh said...

Allie, good point. I actually posted briefly about this in January:

http://adamjh.blogspot.com/2005/01/google-msn-and-yahoo-implement-my.html

(should've posted a comment or update to this post too!)

Best regards,
- Adam