Python Multiprocessing Pool Download Errors

Im using python multiproccing pool to download thousands of images and process these with python PIL

All works as should except when a image downloads and is corrupt, then PIL throws an error

Im looking for advice on how to re loop the pool, maybe just re downloading the image or the whole pool, the total data per pool is around 15Mb

I check the returned pool data array is the expected length but then the next step throws the error because the image is corrupt.

Pool code


    pool = multiprocessing.Pool(multiprocessing.cpu_count())
    func = partial(url_downloader, map_id)
    data = pool.map(func, url_list)
    pool.close()
    pool.join()

    if len(data) == len(url_list):
        for d in data:
            image = Image.open(BytesIO(d[0]))
            dst.paste(image, (d[1], d[2]))
    else:
        helpers.write_log(os.getcwd(), '{} : {}'.format(map_id, 'data size mismatch, skipping'))
        return

    exif_data = dst.getexif()
    # https://www.awaresystems.be/imaging/tiff/tifftags/extension.html
    # 0x270 ImageDescription - A string that describes the subject of the image
    # 0x269 DocumentName - The name of the document from which this image was scanned.
    # 0x285 PageName - The name of the page from which this image was scanned.
    exif_data[0x269] = str(helpers.normalizefilename(page_meta[0]))

    dst.save(os.path.join(image_folder, master_image_name), exif=exif_data)
    helpers.write_to_file(os.path.join(os.getcwd(), 'index.txt'), 'a+', index_text)

Download function

def url_downloader(map_id, url):

    header = {"User-Agent": "Mozilla/5.0 (X11; CrOS "
                            "x86_64 12871.102.0) "
                            "AppleWebKit/537.36 (KHTML, "
                            "like Gecko) "
                            "Chrome/81.0.4044.141 "
                            "Safari/537.36"}

    try:
        response = requests.get(url[0], headers=header)
        if response.status_code == 200:
            image_data = response.content
            return [image_data, url[1], url[2]]
    except requests.exceptions.RequestException as e:
        helpers.write_log(os.getcwd(), '{} : {}'.format(map_id, e))
        return

Error as requested

Traceback (most recent call last):
  File "/home/james/mapgrabber/./map-grabber.py", line 291, in <module>
    main()
  File "/home/james/mapgrabber/./map-grabber.py", line 69, in main
    auto_map_grabber(save_path, conn)
  File "/home/james/mapgrabber/./map-grabber.py", line 166, in auto_map_grabber
    map_builder(m[1], save_path, conn)
  File "/home/james/mapgrabber/./map-grabber.py", line 247, in map_builder
    image = Image.open(BytesIO(d[0]))
TypeError: 'NoneType' object is not subscriptable

Edit:

For now I have added a simple try, except function, maybe a limit on the retries? I'm guessing usually its just a single bad download so this should suffice

Edit 2:

I have tested this further by saving the tiles into a directory for trouble shooting, i did go down the route of checking the tile size's as i thought it was failed download's but upon checking the directory of tile's I can see all the images download correctly but sometimes they fail to be pasted correctly onto the larger image, its about 1 in 20 or so, i wonder if I'm causing some memory issues and causing a glitch somewhere. Checking the image size or validity cant help as there seems to be no issues there and if there is i catch it with my requests response.

current code

 pool = multiprocessing.Pool(multiprocessing.cpu_count())
    func = partial(url_downloader, map_id)
    data = pool.map(func, url_list)
    pool.close()
    pool.join()

    for d in data:
        try:
            image = Image.open(BytesIO(d[0]))
            dst.paste(image, (d[1], d[2]))
            image.close()
        except Exception as e:
            helpers.write_log(os.getcwd(), '{} : {}'.format(map_id, e))
            map_builder(map_id, save_path, conn)

dst is the main image created earlier in the script using the total dimensions of the image, then each piece is pasted in based on its coords.

working perfectly most of the time. i just cant seem to find the reason for the missing tiles.



Comments

Popular posts from this blog

Today Walkin 14th-Sept

Spring Elasticsearch Operations

Hibernate Search - Elasticsearch with JSON manipulation