Robert McAnderson Author
As most of you know, search engines use ‘ spiders’ to crawl all the known websites on the internet to update the information stored in their database. Without spiders, changes to your website would not be recorded and therefore not capable of being searched for by potential customers looking for the product or service you provide.
Recently we were experiencing very slow internet response time so we had one of our IT people set about trying to identify if anyone on our network was downloading or uploading excessive amounts of data. We run a sophisticated network with the ability to see incident by incident who is doing what on the internet.
We were surprised to find the offender was not a staff member but actually the Google Spider, that was chewing up our bandwidth at a rate of almost two Meg per second making internet access for all staff members slower than grass grows in winter.
In an effort to stop Google, we restricted the bandwidth allocation assigned to them and the internet access speed for all users returned to normal.
When you have 80% of the market share globally, over 45 Billion Dollars in cash reserves and your revenue growth YTD second quarter 2011 is 33%, do you care if you are chewing up someone’s bandwidth and thereby reducing staff productivity? I guess the answer is no.
Google being the fabulous Search Engine that it is, I decided to use it to try to find out what was causing the issue, so I searched the term “Google attacking my site” and found amongst many things the following link. http://www.google.com/support/forum/p/Webmasters/thread?tid=6b265391fd6167fd&hl=en
The post offers insight into the problem and demonstrated the extent of what could almost be described as a denial of service hack given Google’s spider ramps up its efforts to crawl the site when access is denied. As you can see in the extracts from this post below (and more on the link I have provided above) this is not the case with the Yahoo or Ask spiders who stop after 116 and 67 requests while Google will continue on for 10,000 plus attempts.
Moral to the story; can’t live without Google but does not have to be this hard living with them? So next time your internet access slows down put your IT people on the job to find out if Google are strangling the life out of your connection speed.
Comments from a Complaining Customer
“Further attacks from Google – this one is a file inclusion attack – I would have thought the big G would be smart enough not to index or use fingerprinting attacks against unsuspecting sites.”
“This is getting crazy. Most of the entries in my mod_sec logs are now from 22.214.171.124 (a Google IP), 35 in the last few minutes.”
“I had no choice but to block this Google IP address for what is looking like a DDOS attack – the bogus requests are now coming every one to two seconds”
” Here’s what I find frustrating – https sites/pages come and go the same way that regular sites/pages do and while Google respects and updates 404 errors (unless it’s a feed in which case it’ll just pound on it forever), but it does not respect 501 errors when https goes away. Yahoo has made a bunch of failed https requests, the same for Ask…
What’s wrong with this picture?”
“These requests appear to be HTTPS requests that are not handled by the server (which is probably why you’re logging them like that). It looks like the site that we’re making these requests for has been indexed with HTTPS in the past, so it’s normal that we attempt to recrawl it using the same URLs. If you do not wish to have these URLs crawled, you can use the robots.txt file to disallow access to the HTTPS URLs, or, better yet, use a 301 redirect to the preferred URL that should replace these URLs. Both of these methods would however require being able to access the URLs via HTTPS.”
Response from Customer
“Apparently this is a huge bug in the bot programming, that it would fail tens of thousands of times, and in response, ramp UP its crawling efforts.”