Scraping Data with Python: Bypassing Sucuri Firewall Filters

by hash3liZer . 05 August 2019

Scraping Data with Python: Bypassing Sucuri Firewall Filters

Scraping data off from websites and other internet sources has been quite common now especially for users who are willing to download bulk amount of data. That's where sometimes, firewalls like Sucuri advance in. Sucuri has some advance level filters which requires javascript to be enabled by the browser.

While scraping, we don't have that facility unless we are reliant on external libraries. Even if we do rely on such libraries, it can't be much of a perfect scraping.

So, we are going to use node.js to bypass the sucuri firewall filters which is dependant on javascript rendering engine. Node.js provides a REPL terminal which could execute javascript externally and hence will be a strong feature for us. While our case will include some filtering over the returned javascript from the firewall.

We could do the process without node, doing it the Python way. But that would be a pain in the butt cause you have to first de-obfuscate the script and figure out the right code for python. That may include replacing strings, texts, codes and more.

STEP 1

Installation of Node.JS

Node.js is a run-time environment that can execute javascript without a browser rendering engine. It's cross-platform, meaning can be installed on mac, linux and window as well. So, we have our javascript rendering run-time. You can download & install node for your platform as per this link:

https://nodejs.org/en/download/

As for Linux, we will do the installation here using apt package manager. If you are using a distribution other than Ubuntu/Debian, check the installation manual for your distribution on node.js site. Update your repositories and install node.js:

$ sudo apt update
$ sudo apt install nodejs npm

As for the user who are using Snap package manager:

$ snap install --stable --classic node

Having error that the package is not stable? Check the available stable version using:
$ snap info node
And install a stable version:
$ snap install --revision="versionhere" --classic node

Check whether you have successfully installed node and have it on your execution path or not:

$ node --version

node.js

STEP 2

Analyzing Sucuri

Before we get on to the execution part, we need to analyze the javascript part returned in response from Sucuri. In order to do that, let's grab a site protected by Sucuri. Mine is: https://footdistrict.com. To request this site, we will use requests library. If you don't have it already installed, use pip to install it:

$ pip install requests

Now, simply request the site and print whatever returned in response from the site:

import requests

reqHeaders={
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.111 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Referer': 'https://footdistrict.com',
'Connection': 'close',
}
r = requests.get( "https://footdistrict.com", headers=reqHeaders )
print r.text

javascript

In the above screenshot, you could see that, the response says that you need javascript to be redirected to the actual page. While the question is, how would you enable the javascript. Infact you don't, what we are going to do is change the javascript to some extent and execute it through node run-time. Here's the script returned:

<html>
<title>You are being redirected...</title>
<noscript>Javascript is required. Please enable javascript before you are allowed to see this page.</noscript>
<script>var s={},u,c,U,r,i,l=0,a,e=eval,w=String.fromCharCode,sucuri_cloudproxy_js='',S='ej0nTTMnLnNsaWNlKDEsMikrJzMnICsgICJkc3VjdXIiLmNoYXJBdCgwKSsiM2wiLmNoYXJBdCgwKSArICAnJyArIAoiYXYiLmNoYXJBdCgwKSArIFN0cmluZy5mcm9tQ2hhckNvZGUoMHg2NikgKyAgJycgKyAKIjFzZWMiLnN1YnN0cigwLDEpICsgJzAnICsgICdaYicuc2xpY2UoMSwyKSsiMnN1Y3VyIi5jaGFyQXQoMCkrICcnICsnJysiMXAiLmNoYXJBdCgwKSArICI4Ii5zbGljZSgwLDEpICsgIjJ1Ii5jaGFyQXQoMCkgKyAiMiIuc2xpY2UoMCwxKSArICAnJyArIAonZElhJy5jaGFyQXQoMikrImIiICsgJ0p6OjAnLnN1YnN0cigzLCAxKSArU3RyaW5nLmZyb21DaGFyQ29kZSg5OCkgKyAiYiIuc2xpY2UoMCwxKSArICIiICsiMSIgKyBTdHJpbmcuZnJvbUNoYXJDb2RlKDB4MzkpICsgICcnICsgCiI2c3UiLnNsaWNlKDAsMSkgKyAnYScgKyAgICcnICsnJytTdHJpbmcuZnJvbUNoYXJDb2RlKDB4MzcpICsgJzg3Jy5zbGljZSgxLDIpKyAnJyArJycrJ2InICsgICc0JyArICBTdHJpbmcuZnJvbUNoYXJDb2RlKDk3KSArICI2bSIuY2hhckF0KDApICsgICcnICsnMicgKyAgICcnICsnJysnMTYnLnNsaWNlKDEsMikrIjZzZWMiLnN1YnN0cigwLDEpICsgJyc7ZG9jdW1lbnQuY29va2llPSdzJysnc3UnLmNoYXJBdCgxKSsnc3VjdWMnLmNoYXJBdCg0KSsgJ3VzdWMnLmNoYXJBdCgwKSsgJ3JzdWN1Jy5jaGFyQXQoMCkgICsnaScrJ18nKydjc3VjdXJpJy5jaGFyQXQoMCkgKyAnc3VjdXJsJy5jaGFyQXQoNSkgKyAnbycrJycrJ3UnKydkc3UnLmNoYXJBdCgwKSArJ3AnKydzdWN1cnInLmNoYXJBdCg1KSArICdvJysnJysneCcrJ3knKydzdV8nLmNoYXJBdCgyKSsnc3UnLmNoYXJBdCgxKSsndXMnLmNoYXJBdCgwKSsnc3VpJy5jaGFyQXQoMikrJ2QnKydzdV8nLmNoYXJBdCgyKSsnYScuY2hhckF0KDApKyc0c3VjdXJpJy5jaGFyQXQoMCkgKyAnc2EnLmNoYXJBdCgxKSsnNCcrJ3N1MycuY2hhckF0KDIpKyczc3VjdXInLmNoYXJBdCgwKSsgJ2InKydzdWN1cmEnLmNoYXJBdCg1KSArICdkc3VjdXInLmNoYXJBdCgwKSsgIj0iICsgeiArICc7cGF0aD0vO21heC1hZ2U9ODY0MDAnOyBsb2NhdGlvbi5yZWxvYWQoKTs=';
L=S.length;U=0;r='';
var A='ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/';
for(u=0;u<64;u++){s[A.charAt(u)]=u;}for(i=0;i<L;i++){
c=s[S.charAt(i)];
U=(U<<6)+c;
l+=6;
while(l>=8){
((a=(U>>>(l-=8))&0xff)||(i<(L-2)))&&(r+=w(a));
}
}
e(r);
</script>
</html>

If you analyze it, you see that there's a line at the end of the script: "e(r)". We have to remove it from there, and replace it with "console.log(r)". So, we could print the de-obfusctaed text on console. Here's arises another situation. We must have to execute that code to get the final text which is basically a cookie.

In order to collect that, we will replace this part of string "document.cookie" with "var cookie" & this part "location.reload()" with "console.log(cookie)". This might look a little strenous but at the next step, you will see how easy it is.

STEP 3

Base the Initial Code.

To make a script out of the conclusion above, we will bascially update the get & post functions of requests library to cope with sucuri javascript filters. So, let's define a function with name get and pass it same arguments as of the normal get function:

import requests

def get(url, params=None, headers={}, cookies={}, proxies={}, allow_redirects=False):
r = requests.get( url, params=params, headers=headers, cookies=cookies, proxies=proxies, allow_redirects=allow_redirects )
return r

Now, based on the returned response from the requested site, make a condition here for sucuri javascript page. If the condition becomes true, the get function must execute the javascript first, store them in a variable and request the site again.

from beautifulsoup import beautifulsoup as soup

def get(url, params=None, headers={}, cookies={}, proxies={}, allow_redirects=False):
r = requests.get( url, params=params, headers=headers, cookies=cookies, proxies=proxies, allow_redirects=allow_redirects )
if r.status_code == 200 and r.headers[ 'Server' ] == "Sucuri/Cloudproxy" and "You are being redirected" in req.text:
decode( soup( r.text ) )
return r

So, above we have a new function which will decode the script for us. While the argument passed to this function is bascially a more compact html version of returned response.

STEP 4

Decode the javascipt.

The decode function is core component here, which will do the most important task. The first thing that this function has to implement is getting the de-obfuscted version of the script. In order to get to data, grab the text from script tag, replace e(r) with console.log(r) as discussed and grab the output again.

import subprocess

def decode( html ):
script = html.find( "script" ).text
script = script.replace( "e(r)", "console.log(r)" )
script = subprocess.check_output([ 'node', '--eval', script ])
print script

While if you combine the parts together and see how it works, you will find that the script is still encoded to some extent. In order to decode that as well, we will again replace some parts of the returned script and evaluate the script again:

import subprocess

def decode( html ):
script = html.find( "script" ).text
script = script.replace( "e(r)", "console.log(r)" )
script = subprocess.check_output([ 'node', '--eval', script ])
script = script.replace( "document.cookie", "var cookie" )
script = script.replace( "location.reload()", "console.log(cookie)" )
cookie = subprocess.check_output([ 'node', '--eval', script ])
print cookie

While in the above, the subprocess module is doing the main work for us, i.e. executing shell commands. If you run the script now, you will see that we have the final cookie. Finally, create a new variable to store this cookie

sucuri_cookies = {}

def decode( html ):
script = html.find( "script" ).text
script = script.replace( "e(r)", "console.log(r)" )
script = subprocess.check_output([ 'node', '--eval', script ])
script = script.replace( "document.cookie", "var cookie" )
script = script.replace( "location.reload()", "console.log(cookie)" )
cookie = subprocess.check_output([ 'node', '--eval', script ])
(key, value) = cookie.split( ";" )[0].split( "=" )
sucuri_cookies[ key ] = value

STEP 4

Update GET & POST functions

Now, we know how we could bypass the sucuri javascript filter. We need a proper way to request the site and for it, we will have to update the get and post functions from requests library. Create a new function which accept two instances of cookies and merge the second one with the new one. I'll explain why, in next step:

def combine( acookies, bcookies ):
for key in list( bcookies.keys() ):
acookies[ key ] = bcookies[ key ]
return acookies

And create new get and post functions:

def get(url, params=None, headers=reqHeaders, cookies={}, timeout=60, allow_redirects=False):
cookies = combine( sucuri_cookies, cookies )
r = requests.get( url, params=None, headers=headers, cookies=cookies, timeout=timeout, allow_redirects=allow_redirects )
if r.status_code == 200 and r.headers[ 'Server' ] == "Sucuri/Cloudproxy" and "You are being redirected" in r.text:
decode( soup( r.text ) )
return get( url, params, headers, cookies, timeout, allow_redirects )
return r

def post(url, data=None, json=None, headers=reqHeaders, cookies={}, timeout=60, allow_redirects=False):
cookies = combine( sucuri_cookies, cookies )
r = requests.post( url, data=data, json=json, headers=headers, cookies=cookies, timeout=timeout, allow_redirects=allow_redirects)
if r.status_code == 200 and r.headers[ 'Server' ] == "Sucuri/Cloudproxy" and "You are being redirected" in r.text:
decode( soup( r.text ) )
return post( url, data, json, headers, cookies, timeout, allow_redirects )
return r

While now, what's really changed is nothing. You could request the url as normal, like this:

r = get( "https://footdistrict.com" )
print len(r.text)

length of response

As we can see from the screenshot, the length of the returned response is quite longer which explicitly explains that we didn't receive the javascript this time.

STEP 5

Code Classification

Finally, we need to classify the code in a sequence, so we could execute it. As for now, the sequence looks pretty bad and it might take you time to figure the right sequence, so i am glueing the individual parts to form the final script.

import requests, subprocess
from beautifulsoup import beautifulsoup as soup
sucuri_cookies = {}
reqHeaders={
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.111 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Referer': 'https://footdistrict.com',
'Connection': 'close',
}
def combine( acookies, bcookies ):
for key in list( bcookies.keys() ):
acookies[ key ] = bcookies[ key ]
return acookies

def decode( html ):
script = html.find( "script" ).text
script = script.replace( "e(r)", "console.log(r)" )
script = subprocess.check_output([ 'node', '--eval', script ])
script = script.replace( "document.cookie", "var cookie" )
script = script.replace( "location.reload()", "console.log(cookie)" )
cookie = subprocess.check_output([ 'node', '--eval', script ])
(key, value) = cookie.split( ";" )[0].split( "=" )
sucuri_cookies[ key ] = value

def get(url, params=None, headers=reqHeaders, cookies={}, timeout=60, allow_redirects=False):
cookies = combine( sucuri_cookies, cookies )
r = requests.get( url, params=None, headers=headers, cookies=cookies, timeout=timeout, allow_redirects=allow_redirects )
if r.status_code == 200 and r.headers[ 'Server' ] == "Sucuri/Cloudproxy" and "You are being redirected" in r.text:
decode( soup( r.text ) )
return get( url, params, headers, cookies, timeout, allow_redirects )
return r

def post(url, data=None, json=None, headers=reqHeaders, cookies={}, timeout=60, allow_redirects=False):
cookies = combine( sucuri_cookies, cookies )
r = requests.post( url, data=data, json=json, headers=headers, cookies=cookies, timeout=timeout, allow_redirects=allow_redirects)
if r.status_code == 200 and r.headers[ 'Server' ] == "Sucuri/Cloudproxy" and "You are being redirected" in r.text:
decode( soup( r.text ) )
return post( url, data, json, headers, cookies, timeout, allow_redirects )
return r

if __name__ == "__main__":
print "[>] Requesting Site"
resp = get( "https://footdistrict.com" )
print "[*] Status Code: %i" % resp.status_code
print "[*] Content Legnth: %s" % len(resp.text)

Save the script somewhere and execute it:

testing script

Conclusion

Just like many other factors that could disturb the process of scraping, firewall is likely to be one of those many major problems. One of those firewals is Sucuri which mainly protects the site from being scraped by checking whether the user have some random cookies which are to be generated through javascript by the browser engine. Whilst we can bypass this filter by modifing a part of this script and executing it locally with node run-time.