Search Engines vs. SEO Spam: Statistical Methods
In the best case SEO specialists create relevant well-structured keyword rich pages, which not only please the eyes of a search engine crawler but also have value to the human visitor. Unfortunately it takes months for this strategic approach to produce feasible results, and many search engine optimizers use so-called "black-hat" SEO.
'Black Hat' SEO and Search Engine Spam
The oldest and simplest "black SEO" strategy is adding a variety of popular keywords into web pages to make them rank high for popular queries. This behavior is easily detected since generally such pages include unrelated keywords that lack topical focus. With the introduction of the term vector analysis search engine became immune to this sort of manipulation. However "black-hat' SEO went one step further creating the so-called "doorway' pages – tightly focused pages consisting of a bunch of keywords relevant to a single topic. In terms of keyword density such pages are able to rank high in search results but never seen by human visitors as they are redirected to the page intended to receive the traffic.
Another trend is the abusing the link popularity based ranking algorithms, such as PageRank with the help of dynamically-generated pages. Such pages receive the minimum guaranteed PageRank and the small endorsements from thousands of these pages are able to produce a sizeable PageRank for the target page. Search engines constantly improve their algorithms trying to minimize the effect of "black-hat"' SEO techniques, but SEOs also persistently respond with new more sophisticated and technically advanced tricks so that this process bears a resemblance to an arms race.
"Black-hat" SEO is responsible for the immense amount of search engine spam—pages and links created solely to mislead search engines and boost rankings for client web sites. To weed out the web spam search engines can use statistical methods that allow computing distributions for a variety of page properties. The outlier values in these distributions can be associated with web spam. The ability to identify web spam is extremely valuable to search engine not just because it allows excluding spam pages from their indices but also using them to train more sophisticated machine learning algorithms capable to battle web spam with higher precision.
Using Statistics to Detect Search Engine Spam
An example of an application of statistical methods to detect web spam is presented in the paper "Spam, Damn Spam and Statistics" by Dennis Fetterly, Mark Manasse and Marc Najork from Microsoft. They used two sets of pages downloaded from the Internet. The first set was crawled repeatedly from November 2002 to February 2003 and consisted from 150 million URLs. For each page the researches recorded HTTP status, time of download, document length, number of non-markup words, and a vector indicating the changes in page content between downloads. A sample of this set (751 pages) was inspected manually and 61 spam pages were discovered, or 8.1% of the set with a confidence interval of 1.95% at 95% confidence.
Another set was crawled between July and September 2002 and comprises 429 million pages and 38 million HTTP redirects. For this set the following properties were recorded: URL, URLs of outgoing links; for the HTTP redirects – the source and the target URL. 535 pages were manually inspected and 37 of them were identified as spam (6.9%).
The research concentrates on studying the following properties of web pages: – URL properties, including length and percentage of non-alphabetical characters (dashes, digits, dots etc.). – Host name resolutions. – Linkage properties. – Content properties. – Content evolution properties. – Clustering properties.
URL Properties
Search engine optimizers often use numerous automatically generated pages to massively distribute their low PageRank to a single target page. Since the pages are machine generated we can expect their URLs to look differently from those created by humans. The assumptions are that these URLs are longer and include more non-alphabetical characters such as dashes, slashes or digits. When searching for spam pages we should consider the host component only, not the entire URL down to the page name.
The manual inspection of the 100 longest hostnames had revealed that 80 of them belong to adult site and 11 refer to the financial and credit related sites. Therefore in order to produce a spam identification rule the length property has to be combined with the percentage of non-alphabetical characters. In the given set 0.173% of URLs are at least 45 characters long and contain at least 6 dots, 5 dashes or 10 digits—and the vast majority of these pages appear to be spam. By changing the threshold values we can change the number of pages flagged as spam and the number of false positives.
Host Name Resolutions
One can notice that Google, given a query q, tends to rank a page higher if the host component of the page's URL contains keywords from q. To utilize this search engine optimizers stuff pages with URLs containing popular keywords and keyphrases and set up DNS servers to resolve these URLs to a single IP. Generally SEOs generate a large number of host names to rank for a wide variety of popular queries.
This behavior can also be relatively easy detected by observing the number of host name resolutions to a single IP. In our set 1,864,807 IP addresses are mapped to only one host name, and 599,632 IPs—to 2 host names. There are also some extreme cases with hundreds of thousands host names mapped to a single IP, and the record-breaking IP referred by 8,967,154 host names.
To flag pages as spam a threshold of 10,000 name resolutions was chosen. About 3.46% of the pages in the Set 2 are served from IP addresses referred by 10,000 and more host names and the manual inspection of this sample proved that with very few exceptions they were spam. Lower threshold (1,000 name resolutions or 7.08% pages in the set) produces an unacceptable amount of false positives.
Linkage Properties
The Web consisting of interlinked pages has a structure of a graph. Therefore in graph terminology the number of outgoing links of a page can be referred to as the out-degree, while the in-degree equals to the number link pointing to a page. By analyzing out- and in-degrees values it is also possible to detect spam pages which would represent the outliers in the corresponding distributions.
In our set for example there are 158,290 pages with out-degree 1301, while according to the overall trend only 1,700 such pages are expected. Overall 0.05% of pages in the Set 2 have out-degrees at least three times more than suggested by the Zipfian distribution, and according to the manual inspection of a cross section, almost all of them are spam.
Similarly the distribution for in-degrees is calculated. For example 369,457 pages have the in-degree of 1001, while according to the trend only 2,000 such pages are expected. Overall, 0.19% of pages in the Set 2 have in-degrees at least three times more common than the Zipfian distribution would suggest, and the majority of them are spam.
Content Properties
Despite the recent measures taken by search engines to diminish the effect of keyword stuffing, this technique is still used by some SEOs who generate pages filled with meaningless keywords to promote their AdSense pages. Quite often such pages are based on a single template and even have the same number of words which makes them especially easy to detect using statistical methods.
For Set 1 the number of non-markup words in each page was recorded, so we can draw the variance of word count in pages downloaded from a given host name. The variance is plotted on the x-axis and the word count is shown on the y-axis, both axes are drawn on a logarithmic scale. Points in the left side of the graph marked with blue represent cases where at list 10 pages from a given host have the same word count. There are 944 such hosts (0.21% of the pages in Set 1). A random sample of 200 these pages was examined manually: 35% were spam, 3.5% contained no text and 41.5% were soft errors (a page with a message indicating that the resource is not currently available, despite the HTTP status code 200 "OK").
Content Evolution
The natural evolution of the content in the Web is slow. In a period of a week 65% of all pages will not change at all, while only 0.8% will change completely. In contrast many spam SEO web pages generated in response to an HTTP request independent of the requested URL will change completely of every download. Therefore by looking into extreme cases of content mutation we search engines are able to detect web spam.
The outliers represent IPs serving the pages that change completely every week. Set 1 contains 367 such servers with 1,409,353 pages (97.2%). The manual examination of a sample of 106 pages showed that 103 (97.2%) were spam, 2 were soft errors and 1 adult pages counted as a false positive.
Clustering Properties
Automatically generated spam pages tend to look very similar. In fact, as already said above, most of them are based on the same model and have only minor differences (like inserting varying keywords into a template). Pages with such properties can be detected by applying clustering analysis to our samples.
To form clusters of similar pages the 'shingling' algorithm described by Broder et al. [2] will be used. Figure 7 shows the distribution of the cluster sizes on near duplicate pages in Set 1. The horizontal axis shows the size of the cluster (the number of pages in the near-equivalence class), and the vertical axis shows how many such clusters Set 1 contains.
The outliers can be put into two groups. The first group did not contain any spam pages, pages in this group are more related to the duplicated content issue. In the same time the second group is populated predominantly by spam documents. 15 of 20 largest clusters were spam containing 2,080,112 pages (1.38% of all pages in Set 1)
To Sum Up
The methods described above are the examples of a fairly simple statistical approach to spam detection. The real life algorithms are much more sophisticated and are based on machine learning technologies which allow search engine to detect and battle spam with a relatively high efficiency at an acceptable rate of false positives. Applying the spam detection techniques enables search engine to produce more relevant results and ensures a more fair competition based on the quality of web resources and not on technical tricks.
References:
1. Dennis Fetterly, Mark Manasse, Marc Najork. "Spam, Damn Spam, and Statistics: Using statistical analysis to locate spam web pages" (2004). Microsoft Research.
2. A. Broder, S. Glassman, M. Manasse, and G. Zweig. "Syntactic Clustering of the Web". In 6th International World Wide Web Conference, April 1997.
Related Tags: seo, spam, search engines, algorithms
Oleg Ishenko, MCSE, MCDBA, BScGet more useful info on SEO at our Search Engine Optimization Research Your Article Search Directory : Find in Articles
Recent articles in this category:
- Genuine Ways To Promote A Website
Internet is overloaded with millions of websites on different topics and subjects, here being notice - Simple Ways To Promote Your Website
Website promotion is the conjunctive outcome of Business, Customers and Internet. The best web promo - Seopressor Review Wordpress Plugin
Yes, a different SEO tool that can do everything including washing your old dirty laundry and bring - Japanese Seo - Break Into The Lucrative Japanese Market
Are you looking to drive traffic to your website? Do you want to successfully cater to the Japanese - Search Engine Optimization: Constantly Changing
There are plenty of cheaters out there who find ways around just about everything Google sets up and - Local Seo - What It Is & How To Use It Successfully In Your Search Strategy
Local SEO. Do you really know what it is? It's a term you often hear thrown around at parties by guy - Overview Of Pay Per Click
Pay per Clicking (PPC) advertising is a form of online marketing that drives targeted leads to your - Use Blog Commenting Service For High Traffic
You must be visiting to internet in your day to day life, and must be coming across different types - Professional Link Building Services Improving Search Engine Ranking
If you have collected all the needed facts about link building and its services then definitely you' - Link Baiting And Link Building Techniques
Search engine optimization is a very vast field, it contains many terminologies and techniques. Link
Most viewed articles in this category:
- Google Adsense Best Ads Placement
There are lots of stratigies, and ways of thinking, and I guess all of them has been tried at some p - Search Engine Tips & Techniques
As you are building your site or getting your site built, you need to do as much as you can to ensur - Social Media Optimization Gives you Online Business That Extra Boost
Social media optimization is nothing but the various methods that are utilized for making a site eas - Google PageRank Update Analysis
For those of you not yet aware, Google is currently updating the PageRank they are displaying in the - Common Search Engine Optimization Misktakes and Solutions
7 Search Engine Optimization Mistakes and SolutionsTo many websites, webmasters discover that major - Search Engines Secrets - Easy To Follow
1) Before you start, you must find the right keywords. If you optimize your WebPages for the wrong k - Keywords and Keyword Density
One of the best ways to insure that your site is being properly designed is to insure that keyword d - How you Can Make Money With Google Adsense
Google adsense is an advertising program that can help you earn lots of money from advertising, if u - The Great Search Engine Experiment Revisited Who is the Coolest Guy in the Universe
A recent Search Engine Experiment Demonstrates how by combining Key Word Rich Web Pages and Blog Ent - SEO Contests: Good or Bad?
As a webmaster you probably already know what a SEO Contest is or you surely came across some of the