Discussion:
Crawler problems
Steve Holdoway
2011-11-21 20:45:02 UTC
Permalink
Came to work yesterday, and had 3 (geographically) separate web servers
on their knees, looking like they were being DDoSed. Web services
unusable, databases screaming, resources depleted. 2 VPSes and a
dedicated server.

It transpires that all of them are being heavily spidered by
Baiduspider, Bingbot and Googlebot.

After a load of analysis this is what I've done:

1. Baiduspider - drop at firewall. These are English speaking sites, so
we don't need this.
2. Bingbot - the same.
3. Google - dropped the 5 most prolific crawlers.

I did want to keep bing, but tell it to behave itself, but it seems to
just ignore robots.txt and do what it wants to anyway.

Google was an afterthought - I'd hoped I could get away with that but it
was still slowing the site down - so I've not analyzed so closely.

Has anyone else noticed this increased aggressiveness in crawling
policy? Any ideas on how to control it??

Cheers,

Steve
--
Steve Holdoway BSc(Hons) MNZCS <***@greengecko.co.nz>
http://www.greengecko.co.nz
MSN: ***@greengecko.co.nz
Skype: sholdowa
Mark Foster
2011-11-21 21:00:47 UTC
Permalink
Post by Steve Holdoway
Came to work yesterday, and had 3 (geographically) separate web servers
on their knees, looking like they were being DDoSed. Web services
unusable, databases screaming, resources depleted. 2 VPSes and a
dedicated server.
It transpires that all of them are being heavily spidered by
Baiduspider, Bingbot and Googlebot.
1. Baiduspider - drop at firewall. These are English speaking sites, so
we don't need this.
2. Bingbot - the same.
3. Google - dropped the 5 most prolific crawlers.
I did want to keep bing, but tell it to behave itself, but it seems to
just ignore robots.txt and do what it wants to anyway.
Google was an afterthought - I'd hoped I could get away with that but it
was still slowing the site down - so I've not analyzed so closely.
Has anyone else noticed this increased aggressiveness in crawling
policy? Any ideas on how to control it??
Baidubot went nuts on my image gallery last month - blew out my
bandwidth limits. They're now blocked at the supernet in the webserver
config (returning errors to everything, now). 180.76.5.0/24 occupies 29
of the top 30 IP's to visit my image gallery this month. Each request is
zero'd-out byte-wize due to the drop, but as I can't firewall the
connections off, they continue to pollute my log files.

I havn't dropped any other crawlers yet but I'm monitoring the situ, as
i've also put in a 'block all crawlers' robots.txt file - it really is
interesting to see which crawlers remain; I'm actually seeing 'real'
user-agents now, and not just bots.

Here's the top 15 user agents on my image gallery (my old one, anyway)
for month to date. Check out the percentages in terms of hits.

Loading Image...

/rage!


Mark.



_______________________________________________
NZLUG mailing list ***@linux.net.nz
http://www.linux.net.nz/cgi-bin/mailman/listinfo/nzlug
Steve Holdoway
2011-11-21 21:10:11 UTC
Permalink
Post by Mark Foster
Baidubot went nuts on my image gallery last month - blew out my
bandwidth limits. They're now blocked at the supernet in the webserver
config (returning errors to everything, now). 180.76.5.0/24 occupies 29
of the top 30 IP's to visit my image gallery this month. Each request is
zero'd-out byte-wize due to the drop, but as I can't firewall the
connections off, they continue to pollute my log files.
Good* to know I'm not alone. I firewalled 180.76.5.0/24 and
180.76.6.0/24 to drop Baidu.

Cheers,

Steve

*Well, maybe a new definition of good, but you know what I mean.
--
Steve Holdoway BSc(Hons) MNZCS <***@greengecko.co.nz>
http://www.greengecko.co.nz
MSN: ***@greengecko.co.nz
Skype: sholdowa
Simon Lyall
2011-11-22 03:32:24 UTC
Permalink
Post by Steve Holdoway
Post by Mark Foster
Baidubot went nuts on my image gallery last month - blew out my
bandwidth limits. They're now blocked at the supernet in the webserver
config (returning errors to everything, now). 180.76.5.0/24 occupies 29
of the top 30 IP's to visit my image gallery this month. Each request is
zero'd-out byte-wize due to the drop, but as I can't firewall the
connections off, they continue to pollute my log files.
Good* to know I'm not alone. I firewalled 180.76.5.0/24 and
180.76.6.0/24 to drop Baidu.
Some of the bots obey this:

Crawl-Delay: 30

which will make them go slow. In the Case of Googlebot, Yahoo and
MsnBot you can register on their site and control them (somewhat) via a
web interface. Depends how much you worry about SEO.

They are all a pain though, for example GoogleBot will recheck all the 301
redirects on your site every month or two "just in case". Which can be a
lot of links if you did a site reorg a couple of years ago.

My advice would be to decide which ones you like, register and control
those and then block the rest in robots.txt. Any that sneak though block
explicitly. I'm a little surprised about Baidu ignoring robots.txt
though.
--
Simon Lyall | Very Busy | Web: http://www.darkmere.gen.nz/
"To stay awake all night adds a day to your life" - Stilgar | eMT.


_______________________________________________
NZLUG mailing list ***@linux.net.nz
http://www.linux.net.nz/cgi-bin/mailman/listinfo/nzlug
Mark Foster
2011-11-22 04:27:47 UTC
Permalink
Post by Simon Lyall
My advice would be to decide which ones you like, register and control
those and then block the rest in robots.txt. Any that sneak though
block explicitly. I'm a little surprised about Baidu ignoring
robots.txt though.
Simon,

throw 'baidu ignores robots.txt' into Google.

http://web-robot-abuse.blogspot.com/2006/09/baiduspider-bad-bot-ignores-robotstxt.html

It appears that 5 years on, nothing has changed.

https://groups.google.com/group/sinatrarb/browse_thread/thread/7409a67be63ea65f
another example.

Overwhelmingly, Baidu is badly behaved. Happy to see them firewalled to
hell and back, to be honest.

Mark.



_______________________________________________
NZLUG mailing list ***@linux.net.nz
http://www.linux.net.nz/cgi-bin/mailman/listinfo/nzlug
Martin D Kealey
2011-11-22 04:44:30 UTC
Permalink
Based on a quick scan of my Apache logs, followed by a bit of "whois" fu, I
get these:

iptables -A blacklist -j DROP -s 119.63.192.0/21
iptables -A blacklist -j DROP -s 123.125.71.0/24
iptables -A blacklist -j DROP -s 180.76.0.0/16
iptables -A blacklist -j DROP -s 220.181.0.0/16

Of course, that blocks them from sending me spam^H^H^H^Huseful email as
well; what a pity.

-Martin
Date: Tue, 22 Nov 2011 17:27:47 +1300
Subject: Re: [nzlug] Crawler problems
Post by Simon Lyall
My advice would be to decide which ones you like, register and control
those and then block the rest in robots.txt. Any that sneak though
block explicitly. I'm a little surprised about Baidu ignoring
robots.txt though.
Simon,
throw 'baidu ignores robots.txt' into Google.
http://web-robot-abuse.blogspot.com/2006/09/baiduspider-bad-bot-ignores-robotstxt.html
It appears that 5 years on, nothing has changed.
https://groups.google.com/group/sinatrarb/browse_thread/thread/7409a67be63ea65f
another example.
Overwhelmingly, Baidu is badly behaved. Happy to see them firewalled to
hell and back, to be honest.
Mark.
_______________________________________________
http://www.linux.net.nz/cgi-bin/mailman/listinfo/nzlug
_______________________________________________
NZLUG mailing list ***@linux.net.nz
http://www.linux.net.nz/cgi-bin/mailman/listinfo/nzlug

Jethro Carr
2011-11-22 00:11:07 UTC
Permalink
Post by Mark Foster
Post by Steve Holdoway
Has anyone else noticed this increased aggressiveness in crawling
policy? Any ideas on how to control it??
Baidubot went nuts on my image gallery last month - blew out my
bandwidth limits.
heh, Baidubot went and downloaded every single RPM in my public
repository, was a good chunk of data!

regards,
jethro
--
Jethro Carr
www.jethrocarr.com
www.amberdms.com
Continue reading on narkive:
Loading...