Avoiding undesired web Scraping and fake Web Search engines in Ruby on Rails


If you have developed a nice web app with a lot of content, you will sooner or later face undesired web scraping.

The undesired web scraping will sometimes come from an unmasked bot, with user agents such as Go-http-client, curl, Java and others. But sometimes you will have to deal with bots pretending to be almighty Googlebot or some other legitimate bot.

In this article I will propose you a defense to mitigate undesired web scraping, and to detect fake bots disguised under a legitimate bot name (user agent), without compromising the response time.

This defense can be integrated in any rack-based web app, such as Ruby on Rails or Sinatra.

Request Throttling

If your website has a lot of content, any reasonable human visitor will not access many pages. Let's say that your visitor is a very avid reader and enjoys a lot your content. How many pages do you think it can visit:
  • per minute?
  • per hour?
  • per day?
Our defense strategy will be based on accumulating the number of requests coming from a single IP address for different slots of time.
When one IP address exceeds a pre-configured reasonably high number of requests for the given interval, our app will respond with an HTTP 429 "Too many requests" code.

To the rescue comes rack-attack: a rack middleware for blocking and throttling abusive requests.

Rack-attack stores request information in a configurable cache, with Redis and Memcached as some of the possible cache stores. If you are using Resque, you will probably want to use Redis for rack-attack too.

Here's a possible implementation of rack-attack:

Let's go through the code.

Any request whose path starts with one of these entries will be a candidate for throttling:

We set up a reasonable maximum number of requests for each of the intervals of time we will consider for request throttling:

This is arbitrary and you can choose different intervals of time.

We would like to limit the number of requests within 60 seconds coming from the same IP:

When this throttle block returns a non-falsey value, a counter will be incremented in the Rack::Attack.cache. If the throttle's limit is exceeded, the request will be blocked.

We will modify slightly the default rack_attack algorithm to allow legitimate web indexers in a timely manner.
Here's the new implementation of the algorithm:

Our new algorithm is basically the same as the original rack_attack one, except for the addition of these lines which check if the request comes from one of our allowed Search crawlers:

What this block does is:
  • Check if the request comes from a Search  Engine, identified by its user agent
  • If that's the case, assume it's true and verify offline the authenticity of the bot, so we do not delay the response. If it turns out to be fake, it will be blocked in subsequent requests

The performance of this algorithm will tipically be of just a few milliseconds.

Here's the Rails ActiveJob that will verify the authenticity of the bot. This can be implemented by a Resque queue.

Verify Bot

Let's see a possible implementation of VerifyBot.
Methods that VerifyBot will have:
  • verify: given a user agent and IP, verify the authenticity of the bot
  • allowed_user_agent: true for the user agents from bots we will allow
  • fake_bot: true for bots already verified as fake
  • allowed_bot: true for bots already verified as authentic

VerifyBot will use Redis to cache already verified bots and marked either as safe or fake. These two lists will be stored as Redis sets.

With these, only the implementation of the BotValidator is missing to complete the puzzle.

Bot Validator

Popular search engines authenticity can be verified by a reverse-forward DNS lookup. For instance, this is what Google recommends to verify Googlebot:
  1. Run a reverse DNS lookup on the accessing IP address
  2. Verify that the domain name is in either googlebot.com or google.com
  3. Run a forward DNS lookup on the domain name retrieved in step 1. Verify that it is the same as the original accessing IP address

Our BotValidator will have two main methods:
  • allowed_user_agent: true for users agents from bots we will allow
  • do_validation: true if the user agent can be authenticated. Will raise exception in case of a fake bot

Subclasses for each bot we want to validate will implement the methods:
  • validates? : true if responsible of validation for the given user agent
  • is_valid? : true when the bot is validated for the given user agent and IP address
Here's the implementation:

Subclass ReverseForwardDnsValidator implements the mentioned validation strategy that many search engines and bots follow.

To validate Googlebot or Bingbot, we will only need to subclass ReverseForwardDnsValidator and implement:
  • validates? : true if passed user_agent is the one the class validates
  • valid_hosts: array of valid reverse DNS host name terminations

Other subclasses for different validations can be added. For intance, one to validate Facebook bot, a generic one for Reverse-only DNS validation, etc.


Building REST services with Spring Boot, JPA and MySql: Part 2


In the first part of this tutorial we saw how to build a skeleton java app from scratch based on the Spring framework and implemented the persistence to MySql database.

In this second part, we will implement a REST web service with the Spring framework.

I'll be using maven 3, version 3.0.5 and Java 8 SDK. Google around for installation of these in your environment.

Step 2: Implement a REST endpoint with Spring

In order to use the Spring framework as the basis for our REST endpoint, we need to add the necessary dependencies to our existing pom.xml:

We already have a model persisted to MySql and now we will add a controller with a method index that retrieves all persisted instances of our model.

We will annotate this method so that it is published as a REST endpoint when running our app within a Servlet container.

The Spring annotations added to our code are:
  • @RestController This declares our class as a controller returning domain objects instead of views. Spring will take care of the JSON serialization automatically via Jackson serializer

  •  @RequestMapping(value="/games", method = RequestMethod.GET) This maps GET requests for the path /games to our controller method.

We can now add a test for our new REST endpoint.

In our test, instead of running our controller within an external application Server, we use the Spring class MockMvc, which will direct requests to our controller, making our test faster.

If we now run mvn clean test:

Running our REST endpoint

We are now ready to package our app and run it.

If we run mvn clean package:

We now have a jar and we can just run it. Yes!!! That's right: we can just run it directly!
Spring has generated an uber jar: a jar with all needed dependencies to run our app, including an embedded servlet container: by default Tomcat, but you can  easily change it for Jetty or any other of your preference.

If we launch the command

java -jar target/spring-boot-mysql-0.0.1-SNAPSHOT.jar

We can see on the console a Tomcat has started and is listening on 8080 for requests!

Source Code

Source code on GitHub


Building REST services with Spring Boot, JPA and MySql: Part 1


In this tutorial, we will see how to build a skeleton Java app from scratch based on the Spring framework and capable of having an evolving model persisted on MySql and a related REST web service.

As requirements are changing continuously, we will be handling updates to our model, which in the end translate as updates to our underlying database schema, with Liquibase: a database migration tool

For an overview of how you can manage Database migrations in your development lifecycle, have a look at one of my previous articles: Automatic DB migration for Java web apps with Liquibase

I'll be using maven 3, version 3.0.5 and Java 8 SDK. Google around for installation of these in your environment.

Step 1: Persist a model with JPA and Hibernate

Let's start with what Spring gives us in Spring Initializr for  a maven project with dependencies JPA and MySql.

Here's the generated POM.

In order to have a non-failing maven project, we need to add the details of the database schema to our project.

The resulting properties section in the POM:

And we add these properties to application.properties:

If we launch  mvn clean package  we have now a successful build.

For details on how to create and assign user permissions on MySql, Google is your friend :-)

Adding our Model and Repository

Let's add a sample Model class to our app.

And a Repository interface to access the persisted data. Spring Data will automatically generate the implementation for us.

We can now add a test to load all instances from our repository and verify it is working correctly.

In order to populate our Database for tests, we have the option of using Spring annotations directly in our Java unit test source code.

In this case, we will be using the maven-db-unit plugin instead.

Our updated pom.xml:

And our src/test/resources/sample-data.xml for the unit tests.

If we now run our test with mvn clean test, we have a build failure: we have no tables in our MySql schema and dbunit cannot insert the test data.

At this point, we need to generate a DDL script for our schema.

There are a number of options. You could opt for a Spring solution.

We will apply a more generic solution from a third party which works on Spring and non-Spring frameworks: Hibernate Maven Plugin from juplo.de. This is a completely new implementation of the Hibernate Maven plugin updated to Hibernate 5.

We need to add these lines to our pom.xml:

And the file src/test/resources/hibernate.properties needed by the hibernate-maven-plugin:

Notice in the updated pom.xml:

  • The hibernate-maven-plugin must appear before the dbunig-maven-plugin: the database tables will be created before the dbunit sample data is inserted.

  • Additionally, the file src/test/resources/hibernate.properties needs to be filtered by the standard maven resources plugin.

If we run mvn clean test, our test is finally passing after creating the database tables and populating them with unit test data:

We leave for a future part publishing a REST web service for our model and handling automatic Database migration with Liquibase.

Source code: GitHub

Check Part 2 of this tutorial here.