How does Cloudflare bot detection works

With studies estimating that over 40% of all internet traffic comes from bots, there is a growing demand for software that can distinguish between human activity and bot activity. A prime example of this is Cloudflare’s bot management solution.

If you clicked on this article, you probably want to know how to bypass Cloudflare. you are in the right place! What is Cloudflare Bot Management How Cloudflare Detects Bots How to Reverse Engineer and Bypass Cloudflare Ready? Let’s Get Started!

What is Cloudflare Bot Management

Cloudflare is a web performance and security company. On the security front, we offer our customers a Web Application Firewall (WAF). A WAF can protect your applications from a variety of security threats, including cross-site scripting (XSS), credential stuffing, and DDoS attacks.

One of the core systems included in WAF is Cloudflare’s Bot Manager. As a bot protection solution, the main goal is to stop attacks from malicious bots without affecting real users.

Cloudflare recognizes the importance of certain bots. For example, no website intentionally wants to prevent Google or other search engines from crawling their web pages. To accommodate this, Cloudflare maintains an allow list of known good bots.

Unfortunately, web scraping enthusiasts like you think that all non-whitelisted bot traffic is malicious. So whatever the intent, it’s entirely possible that the bot will be denied access to his Cloudflare-protected webpage.

If you have ever tried to scrape a website protected by Cloudflare, you may have encountered the following bot manager related errors: error 1020: Access is denied error 1010: Access has been blocked by the owner of this website Error 1015: Rate Limited Error 1012: Access is denied Typically these include a 403 Forbidden HTTP response Accompanied by a status code.

Can you bypass Cloudflare?

Fortunately, the answer is yes. However, developing a Cloudflare bypass is not an easy task to do on your own. First you need to have a solid understanding of how it works.

How does Cloudflare detect bots?

The bot detection methods used by Cloudflare can generally be divided into her two categories: passive and active. Passive bot detection techniques consist of fingerprinting checks performed on the backend, while active detection techniques rely on checks performed on the client side. Together, let’s look at some examples from each category!

Cloudflare Passive Bot Detection Techniques

Below is a non-exhaustive list of some passive bot detection techniques employed by Cloudflare. network is associated. Any device suspected of belonging to one of these networks will either be automatically blocked or face additional client-side challenges to resolve.

IP Address Reputation

User’s reputation (also known as risk score or fraud score) of her IP address is based on factors such as location, ISP and reputation history. For example, an IP belonging to a data center or a well-known VPN provider will have a lower reputation than a private IP address. The website can also restrict access to the website from regions outside of the regions it serves, as no traffic from actual customers will come from it.

HTTP Request Headers

Cloudflare uses HTTP request headers to determine if a user is a robot. If you’re using a non-browser user-agent like python-requests/2.22.0 , you can easily select the scraper as your bot. Cloudflare can also block bots if they send requests that are missing headers that should be present in the browser. Or if the headers don’t match based on user agent

TLS Fingerprinting

This technique allows Cloudflare’s Anti-Bot to identify the client used to make requests to the server.

There are multiple ways to fingerprint TLS (JA3, JARM, CYU, etc.), but each implementation generates a static fingerprint for each requesting client. TLS fingerprinting is useful because browsers’ TLS implementations tend to differ from other release versions, other browsers, and request-based libraries. For example, Chrome browser (version 104) on Windows has a different fingerprint than all of the following:

Chrome Browser (version 87) on Windows Firefox Browser Chrome Browser on Android devices Python HTTP requests library

A TLS fingerprint is constructed during the TLS handshake. Cloudflare parses fields provided in the “Customer Hello” message such as: B. Cipher suites, extensions, and elliptic curves to compute a fingerprint hash for a particular customer.

This hash is then searched against a database of pre-collected fingerprints to identify the client making the request. Suppose the client’s hash matches the allowed fingerprint hash (that is, the browser’s fingerprint). In this case, Cloudflare compares the user-agent header of the client’s request to the user-agent associated with the stored fingerprint hash.

If they match, the security system assumes the request is from a standard browser. In contrast, a mismatch between the client’s TLS fingerprint and its advertised user-agent indicates the obvious use of custom botting software, resulting in blocked requests.

HTTP/2 Fingerprint

The HTTP/2 specification is his second major version of the HTTP protocol, published as RFC 7540 on May 14, 2015. Protocol is supported by all popular browsers A primary goal of HTTP/2 was to improve the performance of websites and web applications by introducing header field compression and allowing simultaneous requests and responses on the same TCP connection. For this purpose, the HTTP/1.1 foundation has been extended with new parameters and values. HTTP/2 fingerprints are based on these new internal structures.

The binary framing layer is a new addition to HTTP/2 and is central to HTTP/2 fingerprinting.

If you are interested in a more in-depth analysis of HTTP/2 fingerprinting, please read Akamai’s proposed his HTTP2 Client Fingerprinting Methodology (HTTP/2 Client Passive His Fingerprinting). But for now, here’s a summary:

Three main components form an HTTP/2 fingerprint: Frames: SETTINGS_HEADER_TABLE_SIZE, SETTINGS_ENABLE_PUSH, SETTINGS_MAX_CONCURRENT_STREAMS, SETTINGS_INITIAL_WINDOW_SIZE, SETTINGS_MAX_FRAME_SIZE, SETTINGS_MAX_HEADER_LIST_SIZE, WINDOW_UPDATE Stream Priority Information: StreamID:Exclusivity_Bit:Dependant_StreamID:Weight Pseudo Order of Header Fields: Order of :method, :authority, :scheme, and :path headers. If you’re interested, you can try a live demo of HTTP/2 fingerprinting by clicking here.

Similar to the TLS fingerprint, each request client has a static HTTP/2 fingerprint. To determine the legitimacy of a request, Cloudflare always verifies that the fingerprint/user-agent pair of the request matches the whitelisted ones stored in our database.

HTTP/2 fingerprints and TLS fingerprints are closely related. Of all the passive bot detection techniques used by Cloudflare, these two are the most technically demanding to control with request-based bots. But they are also the most important. So you have to make sure you understand them correctly. Otherwise, you risk being blocked.

All right! By now you should have a good understanding of how Cloudflare passively detects bots. But remember: this is only half the story So let’s see how they do it proactively!

Cloudflare Active Bot Detection Technology

When you visit a website protected by Cloudflare, it’s used to determine if it’s a bot. , many checks are always performed on the client side (i.e. the local browser). Here are some (non-exhaustive) methods used by bots: However, they are known to degrade the end user experience. Whether Cloudflare provides captchas to users depends on several factors, such as: site configuration. Website administrators can always enable CAPTCHAs, enable them, or disable them at all. Risk level. Cloudflare may choose to serve CAPTCHAs only if the traffic is suspicious. For example, when a user browses to his website using the Tor client, he may see a CAPTCHA, but not when the user uses a standard her web browser such as Google Chrome . In such cases, a Cloudflare CAPTCHA bypass is possible and described below.

Previously, Cloudflare used reCAPTCHA as its primary captcha provider. However, since 2020 they have transitioned to using only hCaptcha. Below is an example of an hCaptcha displayed on a Cloudflare-protected website.

Canvas fingerprinting

Canvas fingerprinting allows the system to identify a web client’s device class. Device class refers to the combination of browser, operating system and graphics hardware of the system used to access the website.

Canvas is an HTML5 API for drawing graphics and animations on web pages using JavaScript. To create the canvas fingerprint, the web page queries her Canvas API in the browser to render the image. Then hash this image to create a fingerprint.

This technique relies on viewing the system’s graphics rendering engine as a feature that cannot be physically replicated. This may sound complicated, so let me explain.

Canvas fingerprinting relies on multiple layers of computer systems, including: For example: hardware. GPU low level software. GPU drivers, operating system (fonts, anti-aliasing/sub-pixel rendering algorithms) High-level software web browser (image processing engine) device.

Just to be clear, canvas fingerprints don’t contain enough information to properly track and identify individuals and bots. Instead, its main purpose is to accurately distinguish between device classes.

In the context of bot detection, this is useful because bots tend to lie (via the user-agent her header) about the underlying technology. Cloudflare has a large dataset of legitimate canvas fingerprint-user-agent pairs. Machine learning can be used to detect spoofing of device properties

Event tracking

Cloudflare adds event listeners to your website. These listen for user actions such as mouse movements, mouse clicks, and keystrokes. In most cases, real users will have to browse using a mouse or keyboard. If Cloudflare detects a persistent lack of mouse or keyboard usage, we can assume that you are a bot.

Environment API querying

This is a very broad category. Browsers have hundreds of web APIs that can be used for bot detection. I will do my best to classify them into four categories.

Browser specific API. These specifications may exist in one browser but not in another. For example, window.chrome is a property specific to Chrome browsers. If the data you’re sending to Cloudflare shows you’re using the Chrome browser, but you’re sending it using the Firefox user agent, then you know something is wrong. Timestamp API. Cloudflare tracks user speed metrics using timestamp APIs such as Date.now() and window.performance.timing.navigationStart. If the timestamp doesn’t look like normal human browsing activity, the user will be blocked. Some examples are: Inhumanly fast browsing or mismatched timestamps (e.g. navigationStart timestamp before page load) Automatic browser detection. Cloudflare queries the browser for properties specific to automated web browsing environments. For example, the presence of the window.document.__selenium_unwrapped or window.callPhantom properties indicates the use of Selenium or PhantomJS respectively. For obvious reasons, you will be blocked if this is detected. Sandbox detection. Sandboxing here refers to trying to emulate a browser in a non-browser environment. Cloudflare has checks in place to prevent users from trying to solve challenges using an emulated browser environment. B. In NodeJS using JSDOM. For example, a script can look up a process object that exists only in her NodeJS. You can also use Function.prototype.toString.call(functionName) on the function in question to check if the function has changed.

The core of Cloudflare bot protection

Like many other anti-bots, Cloudflare collects data from all the above methods as sensor data and validates it for consistency on the server side.

Phew, there was a lot of information! Now you should understand the bot detection technology used by Cloudflare.

So far, we’ve only covered general concepts without going into too much detail about the actual Cloudflare scripts. But don’t worry. In this next section, we’ll take a look at exactly how Cloudflare’s Anti-Bot practices these techniques. By analyzing its core: Cloudflare’s waiting room.

We will discuss bot bypassing in next article.