by vault . 21 May 2018
In this tutorial, we will make a script, a simple crawler or a web spider to locate and discover all possible directories and sub-directories that would lead us to another document for both same and cross domains. In fact, we will map a website structure through the links provided within a document itself like in href and action attributes used by anchor and form tags respectively.
Have you ever noticed how websites like Google and Bing process the submitted query? They use links and document crawlers. Their main focus is the provided string or the query. They try to find the most reasonable and indistinguishable document on the Internet and then results are shown according to a defined sequence.
Web Crawler
A Crawler or Spider in web technology is just an application. It crawls through web pages and locates other web resources like links, paragraphs, and headings etc. A powerful web crawler tries to unmask every single aspect of a document and unveil even smaller parts included in the scripts.
We will do the whole of this process using two libraries: mechanize and BeautifulSoup. Mechanize will act like a browser for us. We will first initiate a browser object, read the data from the given document (link) and then pass it to our BeautifulSoup library. This library then will be instructed the locate elements with or without specific attributes in the document as we will see so.
STEP 1
As described before, we need two libraries to make our script work. Install them using pip
pip install BeautifulSoup mechanize
STEP 2
At first, we will initiate a browser object to read, parse and upload data along with links. Create a new function, name it browser and pass it an argument which will be link to open. So,
import mechanize def browser( link ): browser = mechanize.Browser() # 1 document = browser.open( link ) # 2 if document.code < 400: # 3 html = document.read() list__ = filterSoup( html ) # 4 return list__ else: print "Error Code: ", document.code import sys sys.exit(1)
1, a browser object was created. 2, We opened the document. 3, We checked for document status, whether if it exists or not. 4, we passed the data to filterSoup function which will return a list as we will see so...
STEP 3
Now, we will extract elements with the help of BeautifulSoup. In the previous step, the document html was passed to a function filterSoup which is expected to return a list of found html tags/elements. So, lets import BeautifulSoup and define the logic to locate the required elements. Lets just start with href attribute of anchor tags.
import mechanize from BeautifulSoup import BeautifulSoup as bsoup def filterSoup( html ): data__ = [ ] page = bsoup( html ) # 1 for link in page.findAll( 'a' ): # 2 data__.append( link.get( 'href' ) ) # 3 return data__ def browser( target ): ... return list__
Note the above steps:
Now, you can add other elements with this very same procedure too. Lets add form elements too.
... def filterSoup( html ): data__ = [ ] page = bsoup( html ) for link in page.findAll( 'a' ): data__.append( link.get( 'href' ) ) for link in page.findAll( 'form' ): data__.append( link.get( 'action' ) ) return data__ def browser( target ): ... return list__
STEP 4
Until now, we have our program logic. All we need know is to print it on the screen. Create a new function, name it main, get the argument using sys.argv list and pass it to browser function:
... import sys def main(): tgt = sys.argv[1] links__ = browser( str( tgt ) ) if len( links__ ) == 0: sys.exit( 'Nothing Found in the Document' ) else: for link_ in links__: print link_ if __name__ == "__main__": main( )
UP here, we did almost nothing special. Just get the data from browser function, loop through it and print it on screen.
STEP 5
Put it all now in a sequence. Wrap it UP...
import mechanize from BeautifulSoup import BeautifulSoup as bsoup import sys def filterSoup( html ): data__ = [ ] page = bsoup( html ) for link in page.findAll( 'a' ): data__.append( link.get( 'href' ) ) for link in page.findAll( 'form' ): data__.append( link.get( 'action' ) ) return data__ def browser( link ): browser = mechanize.Browser() document = browser.open( link ) if document.code < 400: html = document.read() list__ = filterSoup( html ) return list__ else: print "Error Code: ", document.code import sys sys.exit(1) def main(): tgt = sys.argv[1] links__ = browser( str( tgt ) ) if len( links__ ) == 0: sys.exit( 'Nothing Found in the Document' ) else: for link_ in links__: print link_ if __name__ == "__main__": main( )
STEP 6
Lets have a test on our newly created web spider. Let be techchummi.com our target. So,
python spider.py https://www.techchummi.com
Conclusion
We have created a simple web crawler with a few lines of code in Python. We can collect multiple attributes, elements, headings, inline and block elements etc. BeautifulSoup has this simple approach for us by simplifying all the data into a list sequence which can be looped.