Avoiding undesired web Scraping and fake Web Search engines in Ruby on Rails


If you have developed a nice web app with a lot of content, you will sooner or later face undesired web scraping.

The undesired web scraping will sometimes come from an unmasked bot, with user agents such as Go-http-client, curl, Java and others. But sometimes you will have to deal with bots pretending to be almighty Googlebot or some other legitimate bot.

In this article I will propose you a defense to mitigate undesired web scraping, and to detect fake bots disguised under a legitimate bot name (user agent), without compromising the response time.

This defense can be integrated in any rack-based web app, such as Ruby on Rails or Sinatra.

Request Throttling

If your website has a lot of content, any reasonable human visitor will not access many pages. Let's say that your visitor is a very avid reader and enjoys a lot your content. How many pages do you think it can visit:
  • per minute?
  • per hour?
  • per day?
Our defense strategy will be based on accumulating the number of requests coming from a single IP address for different slots of time.
When one IP address exceeds a pre-configured reasonably high number of requests for the given interval, our app will respond with an HTTP 429 "Too many requests" code.

To the rescue comes rack-attack: a rack middleware for blocking and throttling abusive requests.

Rack-attack stores request information in a configurable cache, with Redis and Memcached as some of the possible cache stores. If you are using Resque, you will probably want to use Redis for rack-attack too.

Here's a possible implementation of rack-attack:

Let's go through the code.

Any request whose path starts with one of these entries will be a candidate for throttling:

We set up a reasonable maximum number of requests for each of the intervals of time we will consider for request throttling:

This is arbitrary and you can choose different intervals of time.

We would like to limit the number of requests within 60 seconds coming from the same IP:

When this throttle block returns a non-falsey value, a counter will be incremented in the Rack::Attack.cache. If the throttle's limit is exceeded, the request will be blocked.

We will modify slightly the default rack_attack algorithm to allow legitimate web indexers in a timely manner.
Here's the new implementation of the algorithm:

Our new algorithm is basically the same as the original rack_attack one, except for the addition of these lines which check if the request comes from one of our allowed Search crawlers:

What this block does is:
  • Check if the request comes from a Search  Engine, identified by its user agent
  • If that's the case, assume it's true and verify offline the authenticity of the bot, so we do not delay the response. If it turns out to be fake, it will be blocked in subsequent requests

The performance of this algorithm will tipically be of just a few milliseconds.

Here's the Rails ActiveJob that will verify the authenticity of the bot. This can be implemented by a Resque queue.

Verify Bot

Let's see a possible implementation of VerifyBot.
Methods that VerifyBot will have:
  • verify: given a user agent and IP, verify the authenticity of the bot
  • allowed_user_agent: true for the user agents from bots we will allow
  • fake_bot: true for bots already verified as fake
  • allowed_bot: true for bots already verified as authentic

VerifyBot will use Redis to cache already verified bots and marked either as safe or fake. These two lists will be stored as Redis sets.

With these, only the implementation of the BotValidator is missing to complete the puzzle.

Bot Validator

Popular search engines authenticity can be verified by a reverse-forward DNS lookup. For instance, this is what Google recommends to verify Googlebot:
  1. Run a reverse DNS lookup on the accessing IP address
  2. Verify that the domain name is in either googlebot.com or google.com
  3. Run a forward DNS lookup on the domain name retrieved in step 1. Verify that it is the same as the original accessing IP address

Our BotValidator will have two main methods:
  • allowed_user_agent: true for users agents from bots we will allow
  • do_validation: true if the user agent can be authenticated. Will raise exception in case of a fake bot

Subclasses for each bot we want to validate will implement the methods:
  • validates? : true if responsible of validation for the given user agent
  • is_valid? : true when the bot is validated for the given user agent and IP address
Here's the implementation:

Subclass ReverseForwardDnsValidator implements the mentioned validation strategy that many search engines and bots follow.

To validate Googlebot or Bingbot, we will only need to subclass ReverseForwardDnsValidator and implement:
  • validates? : true if passed user_agent is the one the class validates
  • valid_hosts: array of valid reverse DNS host name terminations

Other subclasses for different validations can be added. For intance, one to validate Facebook bot, a generic one for Reverse-only DNS validation, etc.

No comments:

Post a Comment