Map A Website In Python: Make A Website Link Crawler

by vault . 21 May 2018

Map A Website In Python: Make A Website Link Crawler

In this tutorial, we will make a script, a simple crawler or a web spider to locate and discover all possible directories and sub-directories that would lead us to another document for both same and cross domains. In fact, we will map a website structure through the links provided within a document itself like in href and action attributes used by anchor and form tags respectively.

Have you ever noticed how websites like Google and Bing process the submitted query? They use links and document crawlers. Their main focus is the provided string or the query. They try to find the most reasonable and indistinguishable document on the Internet and then results are shown according to a defined sequence.

Web Crawler

A Crawler or Spider in web technology is just an application. It crawls through web pages and locates other web resources like links, paragraphs, and headings etc. A powerful web crawler tries to unmask every single aspect of a document and unveil even smaller parts included in the scripts.

We will do the whole of this process using two libraries: mechanize and BeautifulSoup. Mechanize will act like a browser for us. We will first initiate a browser object, read the data from the given document (link) and then pass it to our BeautifulSoup library. This library then will be instructed the locate elements with or without specific attributes in the document as we will see so.

STEP 1

Packages

As described before, we need two libraries to make our script work. Install them using pip

pip install BeautifulSoup mechanize
installing beautifulsoup and mechanize

STEP 2

Browser

At first, we will initiate a browser object to read, parse and upload data along with links. Create a new function, name it browser and pass it an argument which will be link to open. So,

import mechanize

def browser( link ):
    browser = mechanize.Browser()   # 1
    document = browser.open( link )   # 2
    if document.code < 400:        # 3
        html = document.read()
        list__ = filterSoup( html )       # 4
        return list__
    else:
        print "Error Code: ", document.code
        import sys
        sys.exit(1)

1, a browser object was created. 2, We opened the document. 3, We checked for document status, whether if it exists or not. 4, we passed the data to filterSoup function which will return a list as we will see so...

STEP 3

Extract Elements

Now, we will extract elements with the help of BeautifulSoup. In the previous step, the document html was passed to a function filterSoup which is expected to return a list of found html tags/elements. So, lets import BeautifulSoup and define the logic to locate the required elements. Lets just start with href attribute of anchor tags.

import mechanize
from BeautifulSoup import BeautifulSoup as bsoup 

def filterSoup( html ):
    data__ = [ ]
    page = bsoup( html )      # 1
    for link in page.findAll( 'a' ):     # 2
        data__.append( link.get( 'href' ) )   # 3
    return data__

def browser( target ):
    ...
    return list__

Note the above steps:

  1. bsoup( html ) will put all the html data into a sequence and will make it readable.
  2. The findAll method of soup object will return a list of required elements and which are in this case are anchor tags. All anchor tags will be returned in a list.
  3. Then we will append the value of href attributes of found tags in the data__ list.

Now, you can add other elements with this very same procedure too. Lets add form elements too.

...

def filterSoup( html ):
    data__ = [ ]
    page = bsoup( html ) 
    for link in page.findAll( 'a' ):
        data__.append( link.get( 'href' ) ) 
    for link in page.findAll( 'form' ):
        data__.append( link.get( 'action' ) )
    return data__

def browser( target ):
    ...
    return list__

STEP 4

Print

Until now, we have our program logic. All we need know is to print it on the screen. Create a new function, name it main, get the argument using sys.argv list and pass it to browser function:

...
import sys

def main():
    tgt = sys.argv[1]
    links__ = browser( str( tgt ) )
    if len( links__ ) == 0:
        sys.exit( 'Nothing Found in the Document' )
    else:
        for link_ in links__:
            print link_

if __name__ == "__main__":
    main( )
       

UP here, we did almost nothing special. Just get the data from browser function, loop through it and print it on screen.

STEP 5

Wrap UP

Put it all now in a sequence. Wrap it UP...

import mechanize
from BeautifulSoup import BeautifulSoup as bsoup 
import sys

def filterSoup( html ):
    data__ = [ ]
    page = bsoup( html ) 
    for link in page.findAll( 'a' ):
        data__.append( link.get( 'href' ) ) 
    for link in page.findAll( 'form' ):
        data__.append( link.get( 'action' ) )
    return data__

def browser( link ):
    browser = mechanize.Browser() 
    document = browser.open( link )
    if document.code < 400:   
        html = document.read()
        list__ = filterSoup( html )
        return list__  
    else:
        print "Error Code: ", document.code
        import sys
        sys.exit(1)

def main():
    tgt = sys.argv[1]
    links__ = browser( str( tgt ) )
    if len( links__ ) == 0:
        sys.exit( 'Nothing Found in the Document' )
    else:
        for link_ in links__:
            print link_

if __name__ == "__main__":
    main( )

STEP 6

Run it

Lets have a test on our newly created web spider. Let be techchummi.com our target. So,

python spider.py https://www.techchummi.com
python web crawler

Conclusion

We have created a simple web crawler with a few lines of code in Python. We can collect multiple attributes, elements, headings, inline and block elements etc. BeautifulSoup has this simple approach for us by simplifying all the data into a list sequence which can be looped.