Graphic Header
Very often sites are designed with a graphic header. Often, we see an image of the company logo occupying the full-page width. Do not do it! The upper part of a page is a very valuable place where you should insert your most important keywords for best seo. In case of a graphic image, that prime position is wasted since search engines can not make use of images. Sometimes you may come across completely absurd situations: the header contains text information, but to make its appearance more attractive, it is created in the form of an image. The text in it cannot be indexed by search engines and so it will not contribute toward the page rank. If you must present a logo, the best way is to use a hybrid approach – place the graphic logo at the top of each page and size it so that it does not occupy its entire width. Use a text header to make up the rest of the width.
Graphic Navigation Menu
The situation is similar to the previous one – internal links on your site should contain keywords, which will give an additional advantage in seo ranking. If your navigation menu consists of graphic elements to make it more attractive, search engines will not be able to index the text of its links. If it is not possible to avoid using a graphic menu, at least remember to specify correct ALT attributes for all images.
Script Navigation
Sometimes scripts are used for site navigation. As an seo worker, you should understand that search engines cannot read or execute scripts. Thus, a link specified with the help of a script will not be available to the search engine, the search robot will not follow it and so parts of your site will not be indexed. If you use site navigation scripts then you must provide regular HTML duplicates to make them visible to everyone – your human visitors and the search robots.
Session Identifier
Some sites use session identifiers. This means that each visitor gets a unique parameter (&session_id=) when he or she arrives at the site. This ID is added to the address of each page visited on the site. Session IDs help site owners to collect useful statistics, including information about visitors' behavior. However, from the point of view of a search robot, a page with a new address is a brand new page. This means that, each time the search robot comes to such a site, it will get a new session identifier and will consider the pages as new ones whenever it visits them.
Search engines do have algorithms for consolidating mirrors and pages with the same content. Sites with session IDs should, therefore, be recognized and indexed correctly. However, it is difficult to index such sites and sometimes they may be indexed incorrectly, which has an adverse effect on seo page ranking. If you are interested in seo for your site, I recommend that you avoid session identifiers if possible.
Redirects
Redirects make site analysis more difficult for search robots, with resulting adverse effects on seo. Do not use redirects unless there is a clear reason for doing so.
Hidden Text, A Deceptive Seo Method
The last two issues are not really mistakes but deliberate attempts to deceive search engines using illicit seo methods. Hidden text (when the text color coincides with the background color, for example) allows site owners to cram a page with their desired keywords without affecting page logic or visual layout. Such text is invisible to human visitors but will be seen by search robots. The use of such deceptive optimization methods may result in banning of the site. It could be excluded from the index (database) of the search engine.
One-Pixel Links, Seo Deception
This is another deceptive seo technique. Search engines consider the use of tiny, almost invisible, graphic image links just one pixel wide and high as an attempt at deception, which may lead to a site ban.
External Ranking Factors
Why Inbound Links To Sites Are Taken Into Account
As you can see from the previous section, many factors influencing the ranking process are under the control of webmasters. If these were the only factors then it would be impossible for search engines to distinguish between a genuine high-quality document and a page created specifically to achieve high search ranking but containing no useful information. For this reason, an analysis of inbound links to the page being evaluated is one of the key factors in page ranking. This is the only factor that is not controlled by the site owner.
It makes sense to assume that interesting sites will have more inbound links. This is because owners of other sites on the Internet will tend to have published links to a site if they think it is a worthwhile resource. The search engine will use this inbound link criterion in its evaluation of document significance.
Therefore, two main factors influence how pages are stored by the search engine and sorted for display in search results:
* Relevance, as described in the previous section on internal ranking factors.
* Number and quality of inbound links, also known as link citation, link popularity or citation index. This will be described in the next section.
Link Importance (Citation Index, Link Popularity)
You can easily see that simply counting the number of inbound links does not give us enough information to evaluate a site. It is obvious that a link from www.microsoft.com should mean much more than a link from some homepage like www.hostingcompany.com/~myhomepage.html. You have to take into account link importance as well as number of links.
Search engines use the notion of citation index to evaluate the number and quality of inbound links to a site. Citation index is a numeric estimate of the popularity of a resource expressed as an absolute value representing page importance. Each search engine uses its own algorithms to estimate a page citation index. As a rule, these values are not published.
As well as the absolute citation index value, a scaled citation index is sometimes used. This relative value indicates the popularity of a page relative to the popularity of other pages on the Internet. You will find a detailed description of citation indexes and the algorithms used for their estimation in the next sections.
Link Text (Anchor Text)
The link text of any inbound site link is vitally important in search result ranking. The anchor (or link) text is the text between the HTML tags «A» and «/A» and is displayed as the text that you click in a browser to go to a new page. If the link text contains appropriate keywords, the search engine regards it as an additional and highly significant recommendation that the site actually contains valuable information relevant to the search query.
Relevance Of Referring Pages
As well as link text, search engines also take into account the overall information content of each referring page.
Example: Suppose we are using seo to promote a car sales resource. In this case a link from a site about car repairs will have much more importance that a similar link from a site about gardening. The first link is published on a resource having a similar topic so it will be more important for search engines.
Google Pagerank – Theoretical Basics
The Google company was the first company to patent the system of taking into account inbound links. The algorithm was named PageRank. In this section, we will describe this algorithm and how it can influence search result ranking.
page rank is estimated separately for each web page and is determined by the page rank (citation) of other pages referring to it. It is a kind of “virtuous circle.” The main task is to find the criterion that determines page importance. In the case of page rank, it is the possible frequency of visits to a page.
I shall now describe how user’s behavior when following links to surf the network is modeled. It is assumed that the user starts viewing sites from some random page. Then he or she follows links to other web resources. There is always a possibility that the user may leave a site without following any outbound link and start viewing documents from a random page. The page rank algorithm estimates the probability of this event as 0.15 at each step. The probability that our user continues surfing by following one of the links available on the current page is therefore 0.85, assuming that all links are equal in this case. If he or she continues surfing indefinitely, popular pages will be visited many more times than the less popular pages.
The page rank of a specified web page is thus defined as the probability that a user may visit the web page. It follows that, the sum of probabilities for all existing web pages is exactly one because the user is assumed to be visiting at least one Internet page at any given moment.
Since it is not always convenient to work with these probabilities the page rank can be mathematically transformed into a more easily understood number for viewing. For instance, we are used to seeing a page rank number between zero and ten on the Google Toolbar.
According To The Ranking Model Described Above:
* Each page on the Net (even if there are no inbound links to it) initially has a page rank greater than zero, although it will be very small. There is a tiny chance that a user may accidentally navigate to it.
* Each page that has outbound links distributes part of its page rank to the referenced page. The page rank contributed to these linked-to pages is inversely proportional to the total number of links on the linked-from page – the more links it has, the lower the page rank allocated to each linked-to page.
* page rank A “damping factor” is applied to this process so that the total distributed page rank is reduced by 15%. This is equivalent to the probability, described above, that the user will not visit any of the linked-to pages but will navigate to an unrelated website.
Let us now see how this page rank process might influence the process of ranking search results. We say “might” because the pure page rank algorithm just described has not been used in the Google algorithm for quite a while now. We will discuss a more current and sophisticated version shortly. There is nothing difficult about the page rank influence – after the search engine finds a number of relevant documents (using internal text criteria), they can be sorted according to the page rank since it would be logical to suppose that a document having a larger number of high-quality inbound links contains the most valuable information.
Thus, the page rank algorithm "pushes up" those documents that are most popular outside the search engine as well.
Google Page rank – Practical Use
Currently, page rank is not used directly in the Google algorithm. This is to be expected since pure page rank characterizes only the number and the quality of inbound links to a site, but it completely ignores the text of links and the information content of referring pages. These factors are important in page ranking and they are taken into account in later versions of the algorithm. It is thought that the current Google ranking algorithm ranks pages according to thematic page rank. In other words, it emphasizes the importance of links from pages with content related by similar topics or themes. The exact details of this algorithm are known only to Google developers.
You can determine the page rank value for any web page with the help of the Google tool bar that shows a page rank value within the range from 0 to 10. It should be noted that the Google tool bar does not show the exact page rank probability value, but the page rank range a particular site is in. Each range (from 0 to 10) is defined according to a logarithmic scale.
Here is an example: each page has a real page rank value known only to Google. To derive a displayed page rank range for their tool bar, they use a logarithmic scale as shown in this table
Real PR Tool bar PR
1-10 1
10-100 2
100-1000 3
1000-10.000 4
Etc.
This shows that the page rank ranges displayed on the Google tool bar are not all equal. It is easy, for example, to increase page rank from one to two, while it is much more difficult to increase it from six to seven.
In practice, page rank is mainly used for two purposes:
1. Quick check of the sites popularity. page rank does not give exact information about referring pages, but it allows you to quickly and easily get a feel for the sites popularity level and to follow trends that may result from your seo work. You can use the following “Rule of thumb” measures for English language sites: PR 4-5 is typical for most sites with average popularity. PR 6 indicates a very popular site while PR 7 is almost unreachable for a regular webmaster. You should congratulate yourself if you manage to achieve it. PR 8, 9, 10 can only be achieved by the sites of large companies such as Microsoft, Google, etc. PageRank is also useful when exchanging links and in similar situations. You can compare the quality of the pages offered in the exchange with pages from your own site to decide if the exchange should be accepted.
2. Evaluation of the competitiveness level for a search query is a vital part of seo work. Although PageRank is not used directly in the ranking algorithms, it allows you to indirectly evaluate relative site competitiveness for a particular query. For example, if the search engine displays sites with PageRank 6-7 in the top search results, a site with PageRank 4 is not likely to get to the top of the results list using the same search query.
It is important to recognize that the PageRank values displayed on the Google ToolBar are recalculated only occasionally (every few months) so the Google ToolBar displays somewhat outdated information. This means that the Google search engine tracks changes in inbound links much faster than these changes are reflected on the Google ToolBar.
AltaVista
In a time where Google is pre-eminent in the search business it is difficult to conceive how it was otherwise. Once upon a time AltaVista was the 800 lbs gorilla in the search engine jungle. AltaVista was originally conceived to showcase Digital Equipment Corporation’s technology. In the spring of 1995 DEC launched the Alpha 8400, a high performance database server. The AltaVista spider first started indexing the Web on the 4th of July, 1995. A team lead by Dr Louis Monier unveiled AltaVista on the 15th of December 1995. This search engine used several hundred robots running in parallel to index a much larger portion of the Web than predecessors. The system was also fast. Monier’s team had growth in mind. The system could be expanded to cope with increasing popularity. More than 300,000 people used the AltaVista on the first day and within 12 months the system was handling 19 million requests per day. For the unsophisticated Web of the mid-1990s it also delivered pretty good results based largely on on-page factors, especially for those searchers who mastered the advanced query interface. AltaVista added more services, in particular Babelfish. Named after a creature in the book The Hitchhiker’s Guide to the Galaxy, the Babelfish could automatically translate Web pages into a myriad of languages.
AltaVista’s subsequent decline, caused by a mixture of ambition and hubris, should serve as a lesson for anyone who bases their business around the results delivered by a single search engine.
At the start of 1998 Compaq, who’d grown from a maker of luggable IBM PC clones to the world’s largest personal computer manufacturer, swallowed the once mighty DEC. It was the 2nd wave of the dot.com boom. Compaq spun out AltaVista with the idea of an Initial Public Offering (IPO). Other search engines such as Yahoo! and Excite had already gone down the same road and had brought their founders and investors vast wealth. However the window of IPO opportunity was fast closing.
By 1999 Search Engines were viewed as being passĂ©. Portals were all the rage. A portal would act as a focus for a surfer’s activity on the Web and would provide the owner multiple channels to market products. AltaVista recast itself as a portal and even started to offer Internet access. In the United Kindom it went as far as to announce unmetered access. This at a time when AOL, the biggest online provider, charged by the hour. Unfortunately for AltaVista the telecommunications market wasn’t ready. The botched announcement cost the UK boss, Andy Mitchell his job and damaged AltaVista’s reputation.
The move to a portal also detracted from the core search business. Users had to cut through the cruft to get to search then found the results cluttered with sponsored links. It then emerged that, with the notable exception of paid inclusion, the index hadn’t been updated in months. By the end of 1999 MSN Search dropped AltaVista as its provider. With stale content and untrustworthy results users began to desert in droves to the simple, search focussed interface of new kid: Google. In February 2003 Overture acquired AltaVista for $140 million, a fraction of its $2.3 billion valuation at the height of the dot.com boom. Although they’d survived the dot.bomb this lead some wags to dub the search engine: AltaBusta.
AltaVista holds a number of search related US patents including methods for identifying duplicate content in indexes (5,970,497 and 6,138,113) and a method for spidering and indexing the Web (6,021,409)
Canonical URLs
The term Canonical is derived from mathematics and means a URL in simplest or standard form. It is widely used within SEO circles. For example a home page could have multiple URLs
• http://www.mysite.com/
• http://mysite.com/
• http://www.mysite.com/index.php
These are different URLs as far as a search engine is concerned as technically a Web server could return different content for each. However many web servers are configured to return exactly the same content:
index.html
In this case we should pick a one version of the URL, the canonical form, and use this both internally and externally. All other forms should use an HTTP 301 permenant redirect to send search engine robots (and users) to the correct version.
Concept-based search
Concept-based search identifies and suggests alternative search queries that are closely related to the user’s search query. The idea is to focus a user’s search activity from the general, where lots of results are returned to the more specific with fewer, better matching results.
One way for search engines to implement concept based search is to examine how closely the results match those obtained from other searches. If there is a close match it is likely that the two search queries are related and the second query can be suggested as an alternative. Analyzing clicks can also reveal relationships. If two different queries both result in a large number of clicks on the link there queries may be considered as related.
The popularity of searches can also be used to match independent queries. Microsoft has filed a patent application (Method for finding semantically related search engine queries; 20060248068; 2nd November, 2006) based on this concept. As an example the change in popularity for searches about the “winter olympics” might match those for “curling” or “Bode Miller” (a downhill skier). Microsoft’s invention analyzes the density of a given query at various points in time. That is how many searches are there for “winter olympics” compared to the overall number of searches. This removes global effects such as a rise in overall popularity of the search engine affecting results. A mathematical process called Fourier analysis can be used to make rapid comparisons between the various results.
Domain Parking
It is possible to make money from domain names without even setting up a website. Services such as http://domainspa.com/, http://namedrive.com/ and http://trafficparking.com/ let you park domains. If type-in traffic arrives at the parked domain it is redirected to a template page. Advertisers that are part of the domain parking service’s network bid for keywords. If any of these match keywords in the parked domain they are displayed on the template page either in the form of adverts or links. Advertisers are charged for click through traffic and a percentage goes to the domain owner. Ads are usually geo-targeted, by language, country or even city. At the same time it is usually possible to advertise the domain as “for sale”.
Although not an SEO technique in itself domain parking can be an option for domains prior to building a website. With domains costing relatively little to register domain parking can be a viable option for well researched domain names. However you share a lot of your revenue with the domain parking service and you are very unlikely to get repeat visits or links, unless the domain was previously owned. If the parked domain is getting significant traffic you should consider developing a minisite.
Duplicate Content
A great deal of the Web is duplicate or near-duplicate content. Documents may be served in different formats: HTML, PDF, Text for different audiences. Documents may get mirrored to avoid delays or to provide fault tolerance. Content is syndicated and re-branded for different audiences and markets. Some websites aggregate or incorporate content from other sources on the Web, the most common example are RSS news feeds. Affiliate websites present identical storefronts with only cosmetic changes. Press releases are often duplicated by many media outlets. Businesses wishing to protect their trademarks often register different versions of their domain name which all point to the same content but look like different websites from the point of view of a search engine. Content management systems, forums and blogs are often designed to let the same content be accessed through alternative URLs.
Finally there is a problem of plagiarism and copying from public domain sources, such as Wikipedia, the Open Directory Project and Project Gutenberg. This is often done to create large, content rich sites in order to manipulate rankings and generate revenue based on content targeted advertising.
When users submit queries to search engines they do not want the results pages stuffed with many duplicate or near duplicate pages. Indexing and filtering near duplicate content also puts a load on search engines in terms of storage and computational resources. Algorithms already exist for efficiently classifying duplicate content. For example a Hash function can generate a numeric fingerprint representing a page’s content. Pages with identical fingerprints can be dropped from search results and excluded by robots when they next index pages.
Near duplicate pages are more complicated. Both Altavista (now owned by Yahoo! - patents: 5,970,497 and 6,138,113), Google (6,615,209 and 6,658,423) have been awarded US patents that improve on existing methods for classifying duplicate content. The secret is to make comparisons quickly without doing some kind of word-by-word matching. One of Altavista’s patents looks for similarities in the outbound links on a page. Google’s patents focus on generating hashes or fingerprints for parts rather than the whole page. Now to you and me neither of these ideas would seem to be that novel and probably took less than a wet Sunday afternoon in Menlo Park to conceive but you have to remember that the US patent office also gave a patent for how to use a garden swing (US Patent No. 6,368,227). The patent land-grab is also about having some bargaining chips with other companies, many would stand up about as well as a beach condo in a Florida Hurricane if tested in court. However they do have the effect of discouraging new entrants to the market.
Microsoft has also gotten into the game with a patent application (20060248066) for a “system and method for optimizing search results through equivalent results collapsing”. This patent is based on a method known as shingleprints which is the subject of a previous patent application (20050210043). A shingleprint reduces a document to a set of features that are representative of the document. For example this could be all the proper-nouns in the document. The number of common features, divided by the total number of features gives a number between 0 and 1. Essentially similar documents will have a shingleprint closer to 1.
Both Microsoft and Google’s patents are capable of identifying duplicate content that is either a subset of another document or substantially similar. Google suggests that the most relevant document is returned in the results pages. This could be the most recent (although to my mind most recent would imply a copy) or the document with the highest page rank. Microsoft say that user clicks could be used to select the most popular version to return in future queries. Probably the biggest target in Google’s sights at the moment are the many duplicates of public domain content such as Wikipedia. Some webmasters have found their original pages have been dropped in favor of mirrors so the system is not without flaws. The system should also foil domain spammers who register many different domain names under different keywords all pointing to the same website. Google keeps many of what it considers duplicate pages in its secondary supplemental results index.
The implication of all this from an optimization perspective is that search engines are getting increasingly sophisticated in identifying duplicate content. Building a site using duplicate content to inflate rankings will become increasingly difficult.
Mini site
A mini site is a website that is focussed on a single topic. The aim is not to build relationships with visitors or provide a wide coverage of a subject but to get users to take an action such as buy a product (typically an eBook), click on an affliate link or sign up to a newsletter, or all three. This should be achieved without burning up a lot of resources such as bandwidth. You could also use a minisite to provide information about a current trend, for example on the 11 November 2006 Hershey announced a recall of their chocolate bars due to Salmonella. I checked and the domains:
hershey-recall.com
hersheyrecall.com
were available. According to Word tracker this was one of the top searches in November. I would not expect much type-in traffic for this subject but you could register the hyphenated version, slap up a mini site with facts you find on the recall, add some good inbound-links so the site gets spidered quickly and hope to make some money from advertising before the trend dies. However before you dash off to your nearest registrar I checked on Google AdWords and advertisers were only bidding around 20 cents for clicks on Salmonella, although they were paying a dollar on Hershey.
By being focussed visitor choice is streamlined and product, affiliate and content targeted advertising can be extremely relevant. Mini sites range from one to a number of pages, there is no hard and fast rule except that they are single topic. By concentrating on one subject there are possibilities for search engine optimization and type-in traffic in terms of keywords in domains, URLs and on-page elements as well as inter and cross-linking. The generally small number of pages makes it easier to experiment with structure and layout. Mini sites will usually have a shallow structure, the famous “2-clicks” rule, which makes it easier for search engine robots to find and index the content.
Although some mini sites are extremely popular and earn a lot of money most will bring in much more modest revenue. The aim should be to build the site rapidly and then do very little in the way of updates. A network of mini sites could earn more than a single site covering a lot of bases and be much lower maintenance. This obviously has an affect on the subject matter. Spreading effort over a number of sites also spreads the risk if one of the sites suffers a drop in popularity due to increased competition or a change in market.
Building a network of mini sites is not simply a case of taking your old macro site and splitting it down by subject area. The SEO benefit will be minimal, there will be no increase in page rank as you have the same amount of content. However there is an argument for spinning off sections of large sites as mini sites. You can benefit from a keyword rich domain name. If the site has useful content people will link directly to the site using this domain which has page rank and anchor text benefits.
Because revenue, at least initially, can be very low hosting costs need to be kept to a minimum. Some have had success building mini sites on free hosting packages, either using the free host’s domain name or by a redirect. However hosting services are frequently parking their own advertising on these sites. The other solution is to run your own web server or virtual web server. Packages are not that expensive. This lets you direct as many domains as you like (and your web server can cope with) to a single Internet address. There is a caveat, having a number of sites on a single address is not unusual, this is how many host packages work, having a deeply interlinked network of sites on a single address may look like a link farm to a search engine. The aim of your mini sites is to garner inbound links.
Misspellings
To err is human, and all too common it would seem. For example Google ran a project to analyze misspellings of the first name of Britney Spears, a singer, over a three month period from information provided by their spelling correction system:
http://www.google.com/jobs/britney.html
Over 20% of the queries were incorrectly spelt with the two most common errors, brittany and brittney, covering around 16% of searches. Assuming people don’t accept the correction suggested by Google or Yahoo! that is an awful lot of searches going somewhere. Britney Spears may not the easiest name to spell, there is an urban legend that her parents named her after the province of Brittany in Western France where they had once taken a vacation but didn’t know how to spell the name correctly.
Not all errors are misspellings. Some are good old fashioned typos; these commonly involve forgotten letters and reversed letter pairs in a word. Examples are traslation instead of translation and eihgt instead of eight.
Domain Names
There are two ways we can use misspellings. We can register domains for misspellings of popular keywords, brands and existing domain names in the hope of piggy-backing off other people’s SEO efforts. This borders on cyber-squatting and can have legal ramifications if the name is a trademark. Sometimes you can find that a link from an authority or high PageRank site will use the misspelled domain name buying you some instant credibility with search engines, until the webmaster notices his error. This also has the corollary that if we are going to invest money establishing a domain name we should also consider registering common misspellings in addition to registering in different countries to protect our brand. This can begin to cost quite a bit of money in registration fees and may only be worthwhile for well funded sites.
As an example, common misspellings of the domain: google.com are gogle.com, googel.com and goolge.com. All of these redirect to Google’s home page. Google missed a few though. Googl.com, gpogle.com and goolgle.com redirect to sites totally unrelated to Google and would appear to exist simply to profit from the Google brand.
Misspellings can also be incorporated into the content of pages. I recently noticed that a lot of searches to a site I manage used a common misspelling. This occurred in a couple of places in the text and because it was a proprietary term it was not found by the spell checker built into some search engines. I was about to fix the page when I decided to check out the Word tracker and Overture databases to see how many people searched on the correct and incorrect spelling. I was surprised to find that the misspelling was actually more popular. Checking Google and Yahoo! it was clear that most websites spelt the term correctly so my page had risen to number one in the search results because it was well optimized and there was very little competition.
Common Misspellings
It is fairly easy to come up with misspellings for your target keywords. Try reversing letter combinations, missing letters or using letters close to each other on the keyboard. As an example the letters R and T may get substituted. Phonetic spellings are also common as Ms Spears demonstrates. Before creating pages full of misspellings you should check out whether anyone uses the terms, Overture and Word tracker are your friends. See how much competition there is from orthographically challenged webmasters. Wikipedia, amongst other resources, have data on frequently misspelled words:
http://en.wikipedia.org/wiki/List_of_common_misspellings
Forums are also a source of common misspellings for specialized areas. For example in skiing many Anglophones spell the French winter resort of Courchevel as Courcheval.
Incorporating misspellings into web pages is more challenging. Having a site full of spelling mistakes won’t impress visitors and potential advertisers much. It would be possible to use techniques such as entry pages which send the user to the correct version of the page. Search engines may see this as a Black Hat technique although if the content of the real and entry pages are identical it is really providing a service to the end user. If your website is database driven you could take a list a common misspellings, words such as effect spelt as affect, and automatically generate duplicate content pages substituting misspellings for the real words. With the introduction of spell checking of queries by search engines the effects will be somewhat diluted. Google for one also seems to be aware of common misspellings and language differences (e.g. color and colour) and indexes pages with alternate versions of words.
You could also use inbound-links with misspellings in the anchor text. Given the value of good inbound-links this technique should be used sparsely although some kind of internal, search engine friendly site map with major misspellings could be an idea.
Automatic Correction of URLs
Not directly related to search engine optimization are users that mistype Uniform Resource Locators (URLs), either directly into the browser address bar or webmasters who make errors with links. A typo will generate an error on the web server commonly known as a 404 Not Found error after its HTTP (HyperText Transfer Protocol) code. It is a good idea to trap these errors and redirect the user either to the home page or to a site map so they can try to find the right link. As part of this process it is also possible to spell check the URL to try and locate the correct resource name. Filters such as mod_speling
Spam
The term Spam comes from a Monty Python comedy sketch set in a trucker’s cafĂ©. All the dishes on the menu come with spam - a type of tinned spiced ham. In the computer world spam is used to denote excessive repetition: multiple posts, usually commercial, to forums and unsolicited email are the two most frequent examples. For SEOers the term includes the excessive use of keywords, duplicate content, unnatural link structures and the posting of links to guestbooks and membership lists.
Blog comment, guestbook and member list spam
The blog or weblog phenomena has done a great deal to revitalize interest in the Internet following the dot.com bust. By using a pre-packaged content management system (CMS), blogs enable even technical neophytes (aka newbies) to publish their words. Blogs range from personal diaries right through to online-newspapers written by professional writers and journalists who enjoy the editorial freedom the medium offers.
Blogs also have two features which attract high search engine rankings. Bloggers link freely to other sites, creating dense inter-linking between highly themed content. Bloggers are also prodigious, creating large quantities of fresh content. Blogs were designed from the start to be interactive. Readers can post comments and usually include links to other sources. These features mean that the most popular blogs have PageRanks of 7.
The popularity of blogs was quickly spotted by people wishing to manipulate search engine results. They could boost the rankings of their own sites by using the comment, guestbook or member list features that are part of most blog software. Typing blog, weblog or guestbook into Google will bring up many high-ranking targets, especially when the query is combined with the inurl operator. Usually a spammer’s comment is completely irrelevant and is posted to multiple blogs as part of the same campaign:
Great article about global warming, why don’t you cool off a bit check out this page on hot babes?
Spammers even run automated scripts known as spambots. These attempt to post comment spam to sites running well known blog software. The aim is quantity rather than quality but it can mean that a single site gets hit by huge numbers of comments, often posted at the same time. Spammers are hard to trace as the spambots are frequently run on pirated machines referred to as botnets.
Blog spam had the advantage of keyword rich anchor text coupled with highly ranked pages. The aim is not just to get click through traffic but to subvert the ranking algorithms used by search engines. The fresh content offered by blogs means they get frequent visits from search engines. A day spent spamming the most popular blogs can rapidly boost a website to the top of the search engine results pages. As is often the case on the Web some of the most virulent spammers are pushing adult content sites and cover their tracks using anonymous proxies and compromised zombie hosts.
The popularity of this technique has spread rapidly and blog spammers have soon found themselves in an arms race. They have to visit the best blogs on an ever more frequent basis as other messages soon push their links off the coveted and highly ranked home page into search engine oblivion.
Needless to say blog owners are none too happy with this state of affairs. Some have removed comment pages or disabled the capability to post links. Others, wishing to preserve the spirit of the medium, spend hours moderating and removing a veritable tidal wave of spam. Technical solutions have been adopted, disguising outbound-links using JavaScript or rerouting links via a hidden page to stop anchor text and PageRank benefits from being transferred. Automated systems block links to known spammers or links using popular spam anchor text words.
Comments: I just wanted to say WOW! your site is really good and im proud to
be one of your perm. surfers, be sure to my penis enlargement pills project
site, dont laugh! here is my penis enlargement pills site: penis enlargement pills
Spam protection may have the effect of intensifying spam as spambots may take an ever more scattergun approach to posting. One theory on why spammers are so poor at grammar and spelling is that it helps trick automatic (Bayesian) spam filters. I suspect that after typing in 500 spam messages in a session they just get lazy.
Referrer spam
Referrer spam shows just how ingenious people can be in finding ways to manipulate search engine rankings. When someone clicks on a hyperlink their browser opens up the new web page. As part of the communications process (called HTTP which stands for HyperText Transfer Protocol) their browser sends the web address (URL) of the page that contained the hyperlink. This address is called the Referrer. The user’s web server will log this address and it is useful for traffic analysis, for example to judge the effectiveness of inbound-links.
Referrer spam has become an increasing problem. Spammers have armies of zombie hosts or botnets at their command ready to launch a campaign. These zombies are computers on the Internet where the spammer has installed a server by using some security flaw in the Operating System, usually Windows. Often a scatter gun approach is adopted, the spammer doesn’t know if the log file is indexed by search engines or not and hopes that at least a percentage of the spam will make it through. Webmasters running Apache can look at the mod_security package as a way to combat this kind of spam by blocking popular keywords in referrer pages, examples would be: poker, Viagra and loans.
The technique is definitely frowned upon by search engines and can get you banned from their index. It manipulates search engine rankings by creating what are in effect fake inbound links. It subverts the HTTP Referrer mechanism. It clogs log files with bogus information and it consumes resources on the target web server.
Spammers may counter that it is up to server administrators to protect against this form of manipulation but that is like saying that homeowners must lock their doors or risk being robbed. There is usually no good reason to have log-files publicly viewable. The log files should be password protected and preferably not visible to the Internet. Webmasters can also use a robots.txt file to tell search engines not to index the directory containing their logs and can turn off the referrer feature in CMS. Log reports have many outbound-links on a single page so the overall benefit of each link is limited.
Keyword Spam
Keyword spam is the excessive repetition of keywords on a page. It is usually done using hidden HTML elements that are indexed by search engines but are not visible to users including Title, Meta, and Alt text. Spammers have found that they can disguise keywords in the contents of the page by making the text the same color as the background and tucking it away at the bottom of the page. However this still takes up space so may be noticed by competitors, particularly if they type CTRL-A to highlight all the text on a page. It is possible for search engines to detect text which is the same color as the background and this could flag that the page is using spammy techniques. Microsoft Search claims to automatically penalize such pages.
An extension on the hidden text idea is to hide the keyword spam using style-sheets (CSS). This gives the spammer great scope for stuffing keywords into important elements such as Headings without them being noticed. The following style will format all Heading 1 text as 1pt high white text.
Search Engines and Spam
Tackling spam in results has been one of the major efforts of search engines over the last couple of years. For example in November 2006 Microsoft filed patent application 20060248072 outlining a system and method for spam identification. The method takes a multi-pronged approach including identifying pages that look like spam and incorporating user feedback into search results. Microsoft says that its user base of searchers is the best way of identifying whether results are spam. It suggests that something as simple as a toolbar button could be used to flag a page as spam. To prevent a spammer marking competitor pages as spam the user would be tracked via their IP address or network to identify the type and quantity of sites being marked as spam and to compare this with other user input from different queries. An obvious weakness is that a botnet could be used to generate a large amount of feedback from random IP addresses.
Microsoft’s patent also suggests that user feedback would be combined with other algorithmic techniques. For example they could examine the percentage of content that is advertising (the so called MFA or Made for AdSense sites), whether there is keyword stuffing or if the site is part of a bad neighbourhood of spam related sites. It may also use intelligence from its content targeted advertising to identify the value of query terms, so called money words. These are terms where advertisers bid high rates such as “hotel” or “viagra”. Pages that satisfy these terms would have more aggressive spam filtering than non-commercial websites. Less aggressive filtering may also apply to sites that a user visits regularly and sites that they link to, so called authority sites. This data could be gathered through the user’s tool bar.
Stem Words
Stemming is the ability to automatically search for different forms of a keyword. If the word computers is queried, the search engine may also return pages containing computing, computed, computer, computation etc. Computer is the stem or root word.
Yahoo! and Google support stemming by default. Google introduced stemming around the time of the Florida update leading some pundits to suggest this was the cause of the major upheavals that some highly optimized commercial sites suffered. Google’s stemming algorithm provides a wider choice of results where the keywords used are too restrictive. You will notice it most on queries with three or more keywords.
Stemming means that it is no longer necessary to target different forms of a single word in optimizations. However the specific keywords will rank better than their stemmed variants.
Stop Words
Stop Words are words that are so common that they have little relevance to the context of a web page. Examples would be adverbs, conjunctions and prepositions. Excluding stop words saves resources on search engines with little effect on the quality of results.
Common stop words include
about, an, and, are, as, at, be, by, for, from, how
in, is, it, of, or, that, the, this, to, was, what, when
which, who, why, will, with
Searchers can ask search engines to include stop words by using the ‘+’ symbol before the stop word or by putting the entire search phrase in quotes but such searches are the exception rather than the norm. They are often used where the searcher knows an exact phrase from a page. A good example is the start of Hamlet’s soliloquy, “To be or not to be, that is the question”. On a search engine that ignores stop words the results will be very different. Google started indexing stop words in 2005.
Except for these specific cases stop words may be avoided in phrases that target keywords. Examples would be in Anchor Text, Title elements and ALT (alternative) text in image links. This should not be taken to extreme, for example headings should still include stop words where it helps the readability of the content.
Supplemental Results
Supplemental results were introduced to Google searches during 2003. Google says that the results are part of an auxiliary index with fewer constraints placed on pages. For example pages may be orphans, doorway pages with no inbound links, empty pages or have content that Google cannot index (the results relying on meta data). SERPS from the supplemental index are only shown only where there are very few matches from the main index. It is like a final throw of the search dice to throw up some useful information. Supplemental cache results are frozen at the time they were indexed and will often be stale and may show information you no longer want to be public. Supplemental updates are infrequent and results can stick around for up to a year.
Cloaking
Cloaking describes the process of returning different content depending on whether the visitor is a search engine spider or end user. The content seen by the robot indexing the site can be highly optimized for that search engine and may even be completely different from the page the user will see. Search engines do not like this kind of manipulation of their results and cloaked pages can result in a ban. A software business selling spyware, was kicked off both Google and Yahoo! when, they claim, their SEO company used cloaking to optimize their site. All the more reason to understand the techniques any paid SEO outfit may be thinking of using.
Competition
It is important to remember that you are not just trying to second guess how search engines work but are competing with thousands of other websites to get into the top ten of search engine results pages. In the SEO game there are only a few winners for a given set of keywords. Beating the competition is not a question of luck or chance but strategy. A campaign should be planned by selecting keywords, then studying the competition, analysing their strategy and then either doing it better or targeting points of weakness.
Content Management Systems (CMS)
Content Management Systems (CMS) are becoming increasingly popular for managing today's large and complex web sites. The actual content of the website is held in a database, MySql is a very popular relational database choice as it is free and is often supplied as part of a web hosting package. The content is retrieved from the database and packaged into web pages by a software system running on the web server. The format of the pages can be highly customized by using templates and style sheets (CSS). From the user viewpoint the site looks like normal web pages.
Content Management Systems let website owners concentrate on the information in the site without worrying about detail such as creating pages in the Hyper Text Markup Language (HTML). Many large websites, particularly anything interactive such as news sites, blogs and forums are driven by CMS. Complex sites that have specific requirements will write their own software but many off-the-shelf packages, both free and commercial, are available. These are often written in the PHP or Perl programming languages. As with MySQL these two computer languages are free and frequently come as an integral part of web hosting packages. Search engines can spider Perl, PHP, ASP.Net, Cold fusion, Python and Java amongst other languages providing the pages are reachable. Movable Type and pMachine are examples of the most popular Content Management Systems.
Just as the standard look and feel of a CMS will not suit most websites they are also poorly optimized for search engines straight out of the box. The focus of CMS designers is information delivery to human users not search engine robots. There are a number of customizations that make Content Management Systems more search engine friendly.
Thursday 13 December 2007
Common SEO Mistakes
Subscribe to:
Posts (Atom)





















