A Simple Command line Googler in low-level sockets | Python

by hash3liZer . 30 October 2018

A Simple Command line Googler in low-level sockets | Python

Lately, i was busy writing a new tool Subrake which is a subdomain enumeration tool written in low-level sockets to provide the script with the maximum speed to interact with the HTTP applications. What I found most common was the use of libraries like urllib2, mechanize which is a good approach tough but also making the process a bit slower. So, my goal was to speed up the whole process. Using sockets in the program took the speed up to an incredible approach.

Every high-level language provides both high level and low-level interfaces to interact with network applications. High-level interfaces just make the approach easier for us but sometimes also lack some of the basic requirements, like in our case it's the time taken to process the webpage. Sockets sometimes are difficult to manage and maintain but sometimes it's rather easy. In our case, it's pretty much easy and would not take as long.

A Simple Google Crawler

In this tutorial, we will make a simple search engine crawler that accepts queries from the user, search them on google and return back possible found links. Fulfilling the basic requirements:

que = raw_input("Enter Search String: ")
request = b"GET /search?q=%s\r\nHost: google.com\r\nContent-type: text/plain\r\n\r\n"
request = request % que

Establishing Connection ...

This would take the query from the user and make it the part of request string. Next, we need to connect to google.com. A simple socket would do it:

import socket

que = raw_input("Enter Search String: ")
request = b"GET /search?q=%s\r\nHost: google.com\r\nContent-type: text/plain\r\n\r\n"
request = request % que

def connect():
    s = socket.socket(socket.AF_INET, socket.SOCK_STREAM, 0)
    s.connect( ("google.com", 80) )
 
if __name__ == "__main__":
    connect()

The class socket.socket accepts three arguments:

  1. Address Namespace.
  2. Connection Type: SOCK_STREAM for TCP connections and SOCK_DGRAM for UDP connections.
  3. Protocol: Usually left, default value is 0.

The connect method will establish a connection to google.com on port 80. Since, we have connected to the target site, we need to send to headers to the server. As soon the server would receive headers, it will send appropriate response.

import socket

que = raw_input("Enter Search String: ")
request = b"GET /search?q=%s\r\nHost: google.com\r\nContent-type: text/plain\r\n\r\n"
request = request % que
response = ""

def connect():
    s = socket.socket(socket.AF_INET, socket.SOCK_STREAM, 0)
    s.connect( ("google.com", 80) )
    s.send( request )
    bits = s.recv(1024)
    while bits:
        response = response + bits
        bits = ""
        bits = s.recv(1024)
 
if __name__ == "__main__":
    connect()

The send method will send the headers to the server while recv method will retreive the response. recv function takes a parameter indicating how many bits to receive at a time. The loop will continue until there's no more response from the server.

Filtering Links ...

After all, we will have the required headers along with data in the response variable. The headers and raw html can be seperated by splittng on behalf of "\r\n\r\n" string. We will pass the html to BeautifulSoup library and extract found links:

from BeautifulSoup import BeautifulSoup as soup
import socket, sys

que = raw_input("Enter Search String: ")
request = b"GET /search?q=%s\r\nHost: google.com\r\nContent-type: text/plain\r\n\r\n"
request = request % que
response = ""

def connect():
    ...

def filter():
    if response:
        html = soup( response.split("\r\n\r\n")[1] )
        for a in html.findAll("cite"):
            print a.text
    else:
        sys.exit("No Data Received from Server!")

if __name__ == "__main__":
    connect()
    filter()

The soup class will create an html filtering object. If a response is received then further manipulation will be done whilst the script will exit with an error on screen.

Put it all together ...

Now, put all the code in a sequence:

from BeautifulSoup import BeautifulSoup as soup
import socket, sys

que = raw_input("Enter Search String: ")
request = b"GET /search?q=%s\r\nHost: google.com\r\nContent-type: text/plain\r\n\r\n"
request = request % que
response = ""

def connect(): 
    global response
    s = socket.socket(socket.AF_INET, socket.SOCK_STREAM, 0)
    s.connect( ("google.com", 80) )
    s.send( request )
    bits = s.recv(1024)
    while bits:
        response = response + bits
        bits = ""
        bits = s.recv(1024)


def filter():
    if response:
        html = soup( response.split("\r\n\r\n")[1] )
        for a in html.findAll("cite"):
            print a.text
    else:
        sys.exit("No Data Received from Server!")

if __name__ == "__main__":
    connect()
    filter()

Fire Up!

Save the script and execute it:

$ python google.py
google.py

So, the required results were printed out. This was the most basic use of sockets to retreive data from internet. There could be various reasons behind the usage of low-level networking stratergies. For example, in case you want to just retreive headers and not the HTML body. You can receive the first required bits and break the loop when enough data is received.