Scrape and Download all Images from a web page through python

by hash3liZer . 20 February 2019

Scrape and Download all Images from a web page through python

Just like Information can be scraped and extracted from HTML Tags as we have seen in this tutorial, images can be downloaded as well. However, the slight difference is storing them on the local storage. How would that be? Since when we open a link to an image, we get the data in binary form and it must be handled carefully to produce the right image on the local disk.

We download images from a website by saving them through a browser or a download manager, right? What if it's images not an image. We can scrape a bulk amount of images by writing a few lines of code in python. The task could be more extensively done by spawning multiple threads to pretain more images/second. We can use libraries like requests, urllib2 and mechanize to get source information from a web source and can then save it through shutil library.

HTML Tags

We will use requests library to download the images binary data. The first thing is to get all the image tags from a webpage. The requests library would do us the task of getting the web page source code. The tags and other important data can be extracted through BeautifulSoup library. It would create an HTML object with some explicit functions to fetch specific tags with complex attributes.

#!/usr/bin/python
import requests
import sys
from BeautifulSoup import BeautifulSoup as soup

def get_source( link ):
    r = requests.get( link )
    if r.status_code == 200:
        return soup( r.text )
    else:
        sys.exit( "[~] Invalid Response Received." )

def main():
    html = get_source( "https://www.drivespark.com/wallpapers/" )

if __name__ == "__main__":
    main()

So, until yet we created a simple function to get the webpage source code. Depending on how the website produce image results, now we can scrape the image tags. For example, in case of facebook we have to look for the endpoints from where the images are returned in JSON format. The JSON endpoints can be scraped more quicker than those with HTML tags.

BeautifulSoup object provides various functions which uses extensive regular expressions to extract tags with provided attributes. There are multiple such functions like find, findNext, findChildren, and findChild etc. We will use findAll to get all image tags. Let's see:

#!/usr/bin/python
import requests
import sys
from BeautifulSoup import BeautifulSoup as soup

def get_source( link ):
    ...

def filter( html ):
    imgs = html.findAll( "img" )
    if imgs:
        return imgs
    else:
        sys.exit("[~] No images detected on the page.")

def main():
    html = get_source( "https://www.drivespark.com/wallpapers/" )
    tags = filter( html )

if __name__ == "__main__":
    main()

The function filter would extract all the img tags from the html. We can make it more specific by presenting a number of attributes to attrs argument. An example:

$ html.findAll( "img", attrs={'class': 'generlID', 'id': 'specificID'} )

However, there might be another case which we have to look upon. Sometimes we have elements with multiple classes and we don't necessarily need all of them. For example:

<img src="" class="aclass bclass class">

To cope such situations we can specify a function instead of the attribute value. To make it work with class attribute, we may specify an anonymous function which returns true on the basis of the given condition. The following statement would make it clear by extracting the tags which must have aclass and bclass:

$ html.findAll("img", attrs={'class': lambda x:x and 'aclass' \ 
and 'bclass' in x.split()})

A reverse condition can be formed by using or operator instead of and to look upon each tag with one of the mentioned classes:

$ html.findAll("img", attrs={'class': lambda x:x and ('aclass' \ 
or 'bclass') in x.split()})

Images

Let's create a function to loop through each of the image and request the binary data. Here we are again to return to the requests library. However, this time with a little agitation about whether the link in the image tag is right or not. To solve out this put a regular expression in place:

#!/usr/bin/python
import requests
import sys
import shutil
import re
from BeautifulSoup import BeautifulSoup as soup

def main():
    html = get_source( "https://www.drivespark.com/wallpapers/" )
    tags = filter( html )
    for tag in tags:
        src = tag.get( "src" )
        if src:
            src = re.match( r"((?:https?:\/\/.*)?\/(.*\.(?:png|jpg)))", src )
            if src:
                (link, name) = src.groups()
                if not link.startswith("http"):
                    link = "https://www.drivespark.com" + link
                r = requests.get( src, stream=True )
                if r.status_code == 200:
                    r.raw.decode_content = True
                    f = open( name.split("/")[-1], "wb" )
                    shutil.copyfileobj(r.raw, f)
                    f.close()
                    
if __name__ == "__main__":
    main()

So, we get the stream of data and save it as an image. The images would be saved in the same directory with the name specified in the link. You can use os module to create a path and save the images there.

os.mkdir( os.path.join( os.getcwd(), 'images' ) )

This would create a directory by the name of images in your current folder. After this it's almost done. However, since we are making requests to a single source, we can thread this up i.e. spawn multiple requests at the same time. threading module will do the work here:

Threading Requests

#!/usr/bin/python
import requests
import sys
import shutil
import re
import threading
from BeautifulSoup import BeautifulSoup as soup

THREAD_COUNTER = 0
THREAD_MAX     = 5

def requesthandle( link, name ):
    r = requests.get( link, stream=True )
    if r.status_code == 200:
        r.raw.decode_content = True
        f = open( name, "wb" )
        shutil.copyfileobj(r.raw, f)
        f.close()
        print "[*] Downloaded Image: %s" % name

def main():
    html = get_source( "https://www.drivespark.com/wallpapers/" )
    tags = filter( html )
    for tag in tags:
        src = tag.get( "src" )
        if src:
            (link, name) = re.match( r"((?:https?:\/\/.*)?\/(.*\.(?:png|jpg)))", src )
            if link:
                if not link.startswith("http"):
                    link = "https://www.drivespark.com" + link
                _t = threading.Thread( target=requesthandle, args=(link, name.split("/")[-1]) )
                _t.daemon = True
                _t.start()

                while THREAD_COUNTER >= THREAD_MAX:
                    pass

    while THREAD_COUNTER > 0:
        pass

if __name__ == "__main__":
    main()

Exceptions

What's now? We are missing an important part here. If you execute the script at this moment, it will work perfectly as required. But what about the exceptions that can occur during the spawned requests to images link. We have to cover this situation up with try except statement to overcome the possibility of terminal being messed up.

#!/usr/bin/python
import requests
import sys
import shutil
import re
import threading
from BeautifulSoup import BeautifulSoup as soup

THREAD_COUNTER = 0
THREAD_MAX     = 5

def requesthandle( link, name ):
    global THREAD_COUNTER
    THREAD_COUNTER += 1
    try:
        r = requests.get( link, stream=True )
        if r.status_code == 200:
            r.raw.decode_content = True
            f = open( name, "wb" )
            shutil.copyfileobj(r.raw, f)
            f.close()
            print "[*] Downloaded Image: %s" % name
    except Exception, error:
        print "[~] Error Occured with %s : %s" % (name, error)
    THREAD_COUNTER -= 1

def main():
    html = get_source( "https://www.drivespark.com/wallpapers/" )
    tags = filter( html )
    for tag in tags:
        src = tag.get( "src" )
        if src:
            src = re.match( r"((?:https?:\/\/.*)?\/(.*\.(?:png|jpg)))", src )
            if src:
                (link, name) = src.groups()
                if not link.startswith("http"):
                    link = "https://www.drivespark.com" + link
                _t = threading.Thread( target=requesthandle, args=(link, name.split("/")[-1]) )
                _t.daemon = True
                _t.start()

                while THREAD_COUNTER >= THREAD_MAX:
                    pass

    while THREAD_COUNTER > 0:
        pass


if __name__ == "__main__":
    main()

This will circumvent the possibility of the terminal to be messed up in any situation by catching the error and print it in simple format.

Execution

Finally, put the code in a sequence and save it somewhere to execute:

#!/usr/bin/python
import requests
import sys
import shutil
import re
import threading
from BeautifulSoup import BeautifulSoup as soup

THREAD_COUNTER = 0
THREAD_MAX     = 5

def get_source( link ):
    r = requests.get( link )
    if r.status_code == 200:
        return soup( r.text )
    else:
        sys.exit( "[~] Invalid Response Received." )

def filter( html ):
    imgs = html.findAll( "img" )
    if imgs:
        return imgs
    else:
        sys.exit("[~] No images detected on the page.")

def requesthandle( link, name ):
    global THREAD_COUNTER
    THREAD_COUNTER += 1
    try:
        r = requests.get( link, stream=True )
        if r.status_code == 200:
            r.raw.decode_content = True
            f = open( name, "wb" )
            shutil.copyfileobj(r.raw, f)
            f.close()
            print "[*] Downloaded Image: %s" % name
    except Exception, error:
        print "[~] Error Occured with %s : %s" % (name, error)
    THREAD_COUNTER -= 1

def main():
    html = get_source( "https://www.drivespark.com/wallpapers/" )
    tags = filter( html )
    for tag in tags:
        src = tag.get( "src" )
        if src:
            src = re.match( r"((?:https?:\/\/.*)?\/(.*\.(?:png|jpg)))", src )
            if src:
                (link, name) = src.groups()
                if not link.startswith("http"):
                    link = "https://www.drivespark.com" + link
                _t = threading.Thread( target=requesthandle, args=(link, name.split("/")[-1]) )
                _t.daemon = True
                _t.start()

                while THREAD_COUNTER >= THREAD_MAX:
                    pass

    while THREAD_COUNTER > 0:
        pass


if __name__ == "__main__":
    main()

Execute the script :

$ python scraper.py 

So, it was simple downloading all the images:

Breakdown

Now, we can breakdown each part of the script and analyze exactly what are we trying to acheive and how to contribute more with a few more lines of code. The first part was where we have created the function get_source. It was pretty simple that we requested a source through requests and verified the required respnse and got data

To make POST requests, you can use post method:

$ requests.post( "https://url", "data", headers={}, cookies={} )

The next part is where we scraped the html image tags and i don't think i need to explain it more. Coming towards the last thing where we looped through each image, let's start with the threading process. To initiate threads we used Thread from threading module but to limit the number of threads, we have to develop a loop or produce a sleep for a specified time.

The two variables THREAD_COUNTER and THREAD_MAX are the control variables which will determine when to put the program in an infinite loop until the number of current threads fall to a specific number:

while THREAD_COUNTER >= THREAD_MAX:
    time.sleep( 5 )

The expression to extract the link and name for file is doing an important task here for us. It will match the common formats for image links with jpg and png formats:

$ link = "https://wwww.random.com/asdasdasd/asdasd.jpg"
$ re.match(r"((?:https?:\/\/.*)?\/(.*\.(?:png|jpg)))", link).groups()
$ link = "/images/car-image.png"
$ re.match(r"((?:https?:\/\/.*)?\/(.*\.(?:png|jpg)))", link).groups()

Conclusion

It's not quite complicated that how we can scrape images from a website just as information from common html tags. We have to look past normal file handling and produce quick images by decoding the right content.