If you've heard about multiprocessing in Python and would like to learn how to do it, or you haven't but are learning to code and are wondering how you can make your scripts run faster and get the most out of your computer's processor, you may find this post helpful.
TL;DR
You need the following:
A queue in which you still store the jobs/tasks, a worker function that you will assign to each processing core to do the actual work you want to be done and a master function that will fill the queue and assign the worker functions the work. The multiprocessing package will help achieve this.
Prerequisites
Apart from the obvious, you should have BeautifulSoup4 installed with:
pip install bs4
Let's get into it
Python, like many other modern languages, allows you to tap into more of the processing potential of your CPU than is initially available to you through Multiprocessing.
I will not go into the details of what Multiprocessing is and what the full capabilities of the Multiprocessing package are; you can read the documentation for more details on that.
I'll get right to the point and give you an example of how you can use this in your scripts or programmes to run concurrent processes on more than one processing core of your CPU. This will help you write programmes that execute faster.
In this example, which I'd like to think is practical enough for many to encounter at some point, we would like to download a total of over 1,750 datasets (stored as .csv files) that we've stumbled upon and would like to store on our computer for future use.
We could solve this by writing a script that will fetch the HTML page, extract the URLs to the files we are looking for, and download the files locally. But to download all the files we want faster, we will use all available cores on our Processor to handle this task concurrently.
This is not the most elegant code, but it will get the point across. 😶
1. Import the packages we will use and declare a few variables we will need:
from multiprocessing import Process, Queue
from os import cpu_count
import requests
from bs4 import BeautifulSoup
import re
from termcolor import colored
source = 'https://vincentarelbundock.github.io/Rdatasets/datasets.html'
titles = []
filenames = []
2. Declare the Master function, and in it, deal with most of the data from the webpage that we need and assign jobs to the worker functions:
def master():
urls = []
workers = []
#send http request and fetch the webpage
response = requests.get(source).content
soup = BeautifulSoup(response, 'html.parser')
# store titles of datasets as listed in table. These will be useful if we have problems with the filenames
trs = soup.select("tr > td:nth-of-type(3)")
for tr in trs:
string = tr.string
titles.append(string[:len(string) - 1])
# get all the urls that contain .csv in them
for url in soup.find_all('a', href=re.compile('csv')):
urls.append(url.get('href'))
# parse urls to extract filenames as they were named orignally
for i, link in enumerate(urls):
url = str(url)
try:
name = re.search(r"AER\/(.*?)\.csv", link).group(1)
except AttributeError:
name = titles[i]
filenames.append(name)
# add all the urls to the dataset files to a queue from which the workers will get their allotment
work_queue = Queue()
for i, url in enumerate(urls):
work_queue.put((filenames[i],url))
# for as many processing cores as are available on the computer, create processes on each to perform our task.
for i in range(cpu_count()):
process = Process(target=download_csv, args=(work_queue,))
workers.append(process)
process.start()
# we will use None as a stop signal to each worker
for worker in workers:
work_queue.put(None)
# wait for workers to quit
for worker in workers:
worker.join()
print(colored('DONE', 'green'))
3. Declare a worker function that will go to the given URL, download the file and save it as the file name given:
def download_csv(work_queue_items):
while True:
work_item = work_queue_items.get()
if work_item is None:
break
fname = work_item[0]
url = work_item[1]
fname = re.sub('(\W+)','-', fname)
dest_dir = r'csvs/' + str(fname) + '.csv'
fx = open(dest_dir, 'wb')
try:
response = requests.get(url)
fx.write(response.content)
fx.close()
print(fname, 'succesfully downloaded and written')
except:
print(colored(f'An exception occurred file {fname} @ {url}', 'red'))
4. And finally, run it all with:
if __name__ == '__main__':
master()
Conclusion
A script that performs this task without multiprocessing took an average (out of five runs) of 20.856 minutes to complete the download of all the files.
The multiprocessing implementation took an average of 3.61 minutes to perform the same task.
I ran this code on a six-core Intel Core i5-9500 CPU clocked at 3Ghz.
How fast this runs on your system will depend on your system's configuration and your internet speed.
And that's it. I hope you learned a thing or two. And if there's any way I could improve this code or any mistakes you noticed in my code, you could leave a comment below, or you could submit a pull request on the repo.
Comments