How I blocked Stack Overflow website for a few hours in the entire company? Post about unskillful writing a web crawler.

Stack Overflow is a platform, which will from every average programmer make an expert. Nowadays it’s hard to imagine working as a programmer without this tool. Once I accidentally blocked this website in our company. Right after this incident, I saw bigger traffic next to the coffee machine and light conversations on open space. Some people even found and explanation for this situation saying that we probably work too much and StackOverflow blocked our access to this page. But in fact what has happened? What you have to do to block such a page?

How did it happen?

I will start from the beginning. The guilty is Michał Sadowski and his tool for internet monitoring – Brand24. Ok, actually not Michał is guilty but for sure he was an inspiration. Once in a way to Warsaw, I explained to my sister how Brand24 works and that concept that stays behind is really simple. Well, we have an application that is checking websites and looks for keywords and links. Having these links it visits the next websites and next etc. Google and DuckDuckGo and many other tools for internet content indexing work in the same way.

Ok, but what with this lock. Well, as I was really curious about this topic I decided to write my own simple crawler. After less or more three hours it was ready to use. My crawler was trying to find my name and surname on the internet and report me the statistics through the email. Generally, most of the results that it found was about famous TV presenter with the same name, but it found also mentions about me. Next time when I started improvements I decided to boost my application by using many threads and many starting points (websites where crawler starts his race) and immediately tested it. I started the application and moved to my work duties. I admit that when I sow broken StackOverflow for the first time I didn’t know what happened.

We can debate, why so big companies build their own crawlers and I created it in three hours. The sole difference was that my application was blocked by every second website and Brand24 works non-stop. Generally, it’s not a sole difference. The issue was much bigger. Firstly, writing such applications, we have to follow some rules, which say how proper behavior of robots looks. You can follow them or just allow your robot to do whatever he wants to do. I unaware of these rules chose the second way.

From a technical point of view

I wrote a crawler using Java and JSOUP library. Under the hood application downloaded websites, checked the content, and visited all unvisited links. The application did it forever until I stopped it. Every few thousand results I got a report on email.

Thanks to good support to multithreading in Java I could increase the speed of visiting particular websites. I created 20 threads. Every thread in every second called the website a few times. One of the starting points was StackOverflow. As you can imagine 20 threads times around 5 calls per second made atypical movement on the website. Such websites are aware that such behavior can lead to issues with the server and their anti-attack mechanisms are blocking IP in such situations. This time I was this one who tested this mechanism. The Code of my application is here.

What I didn’t know writing own crawler

Starting working on this application I didn’t know many things about such systems. Primarily I didn’t know that application should act like humans. Website creators protect their websites against an unpredictable and unwanted movement that bots can generate. To do that they build solutions that measure the number of calls from particular IP, add hidden links, which normal user can see and interact with them but the bot will do that, and so on.

The second thing that I didn’t know is the fact that not every website allows crawling. Let’s imagine that we write a crawler that crawls an online English dictionary. Are we able to fetch most of the words from this dictionary? Of course. Is this honest? Of course, not! Some websites have clearly written in terms of use that we cannot scrape it. We have to take care of it.

How should it work to not be bad for websites?

There are a few rules, that we have to follow in our crawlers, to make them seen as polite. As I wrote above application should act like a human to simulate many visits of different users with different behaviors.

  1. Connecting to websites from different IPs. When strange requests are coming from one IP address, an anti-attack mechanism can treat it as an attack. In case of requests from many different IPs, it’s not so clear if this is an attack or normal movement.
  2. Click randomly on the website in a random period of time. Humans never click on the website in an equal period of time, sometimes it’s faster sometimes slower.
  3. Simulate mouse movement. Users mostly using the mouse to browse websites, which cursor moves on the display in an indeterministic way.
  4. Simulate scrolling. On every website, there is something to scroll.
  5. Respect links no-index and robot.txt files. Websites are “talking” explicitly what we can do with them. We should respect it.

What with Captcha?

CAPTCHA is type of protection on websites that should ensure, that forms are submitted by humans. I’m talking about a widget that asks users to sign something on from image or find similar images.

Is it possible to break CAPTCHA? From time to time new protections are created to make CAPTCHA more secure. Today’s technology allows creating complex protections as well as breaking them. In this article, you can read more about it.

Summary

Why I wrote this article? I think that this story is quite interesting and shows that not everything which is public is ours and that we cannot do it whatever we want to.

When in the future you will want to create a similar application it’s important to precede it with very careful analysis to avoid problems. Before you start you can ask your mentor for any advice regarding implementation.

Are you thinking maybe “What exactly do I need such application?”. Besides Google, there is a lot of applications that index websites differently. One of the most popular cases for such a program is scanning real-estates services or websites with job offers.

If you like this article I invite you to like my Facebook website and to my private group. If you are a polish speaker I also want to invite you to visit my other blog.

About author

Hi,
my name is Michał. I’m software engineer. I like sharing my knowledge and ideas to help other people who struggle with technologies and design of todays IT systems or just want to learn something new.
If you want to be up to date leave your email below or leave invitation on LinkedIn, XING or Twitter.

Add your email to be up to date with content that I publish


Leave a Reply

Your email address will not be published. Required fields are marked *