Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I work for a (non profit) journal publisher and we do indeed cut off robot downloading but not after one click of a link. We analyze traffic to determine robot downloads. I suspect though that the whole entire university did not get cut off in this incident. Usually it is on a per IP basis and unless the University proxies all of their journal traffic through a single IP which is not common I think saying the whole university being blocked may be an exaggeration. I personally wish we had no robot monitor but then again we would get heavy spidering then of large files.


Is there reason to block instead of throttling?


We do have a CAPTCHA too before the block. Basically to get the block you have to really work hard at it. We also do not mind limited robot use for cases like downloading all papers given a search term or author but we do not want people downloading our entire corpus either. So throttling is not an option.

I think the case mentioned in the article is definitely heavy a heavy handed approach. When it comes down to it at my place we are just trying to block the wget -r's of the world.


Is there any reason why you don't want people downloading your entire corpus?


Or captcha?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: