In this notebook some small examples are given on how to use the web scraping utilities from cbs_utils. The following utilities are discussed:
The get_page_from_url function allows to obtain the contents of an url and store the results in cache. The next time you run the function again, the function is read from cache. The benefits of caching your data are:
Here, an small example is given. First start with importing the required modules:
import logging
from pathlib import Path
from bs4 import BeautifulSoup
from cbs_utils.misc import (create_logger, merge_loggers)
from cbs_utils.regular_expressions import (KVK_REGEXP, ZIP_REGEXP, BTW_REGEXP)
from cbs_utils.web_scraping import (get_page_from_url, UrlSearchStrings)
BeautifulSoup is used to parse the contents of the web site. The create_logger and merge_logger functions are used to quickly setup the logging system. The regular_expressions are standard regular expression we can use to find strings such as de postal code (Dutch form), tax number, etc.
Next, set up the logging module using the cbs_utils misc function create_logger
# set up logging
log_level = logging.DEBUG # change to DEBUG for more info
log_format = logging.Formatter('%(levelname)8s --- %(message)s')
logger = create_logger(console_log_level=log_level, formatter=log_format)
merge_loggers(logger, "cbs_utils.web_scraping", logger_level_to_merge=logging.INFO)
For this example a tmp directory is made in your working directory to store the cache. First make sure we clean this directory in case it still existed from the previous run
# create url name and clean previous cache file
cache_directory = Path("tmp")
clean_cache = True
if clean_cache:
if cache_directory.exists():
logger.info(f"Cleaning cache directory {cache_directory}")
for item in cache_directory.iterdir():
item.unlink()
cache_directory.rmdir()
else:
logger.info(f"Cache directory {cache_directory} was already removed")
Now we can demonstrate the get_page_from_url function.
%%time
# global needed to work around a jupyter-notebook bug where variables are only local to the cell when using %%time
global page, url
url = "https://www.example.com"
page = get_page_from_url(url, cache_directory=cache_directory)
As you can see, it took between 0.5 and 5.5 s (depending on your internet speed) to get all the information from the internet. Because we have added a cache_to_disk iterator to the get_page_from_url function, a cache file in the tmp directory was made:
for ii, item in enumerate(cache_directory.iterdir()):
logger.info(f"Cache file {ii}: {item}")
The contents of the url was stored in page and looks like this:
soup = BeautifulSoup(page.text, 'lxml')
logger.info(soup.body)
We can run the same function again. Since we now have a cache file, it will be much faster:
%%time
global page2
page2 = get_page_from_url(url, cache_directory=cache_directory)
Indeed the same function statement runs in with about 1 ms. Now compare the results:
soup2 = BeautifulSoup(page2.text, 'lxml')
logger.info("Contents is equal: {}".format(soup.body == soup2.body))
The UrlSearchString class can be used to recursively crawl a website and search for a list of regular expressions we want to obtain from the website. Again, the result is cached, so in case you want to run it again with different search strings it will run significantly faster.
Let's first set up our first search session, trying to retrieve the postal code and kvk number from a web page. The regular expression are obtained from the regular_expressions module of cbs_utils and are discussed below
# the regular expression are obtained from the cbs_utils.regular_expressions module
searches = dict(
postcode=ZIP_REGEXP,
kvknumber=KVK_REGEXP
)
url = "www.be-one.nl"
logger.info(f"Start crawling the url {url} and search for the folliwing regular expressions:")
for key, reg_exp in searches.items():
logger.info("{:10s}: {}".format(key, reg_exp))
logger.info("\n")
The postcode regular expression is quite clear: it matches any four digit number (where the first digit can not be a 0), plus 2 alphanumerica characters (must be capitals). There may be a space between the digits and the characters. So the following matches 1234AB, 4545 YZ
The kvk-number is a bit more complicated. The kvk-number is a 8 digit number which may have dots. Something like 123.456.78, or 12345678. Normally, we would use word boundaries (\b) around the 8 digits to prevent a 10 digit number to match as well. However, a hyphen (-) is a word boundary too, giving a match to for instance M-12345678. It appears that this type of strings occur frequently in url's, but these are not kvk numbers. To avoid to include hyphens in the word boundary, we have explicitly given the list of characters which belong to the word boundary, resulting in a better match for kvk-numbers.
Now let's crawl the domain for the first time
%%time
global url_analyse
url_analyse = UrlSearchStrings(url, search_strings=searches, cache_directory=cache_directory,
store_page_to_cache=True)
It took us between 20 second to 2 minutes (depending on your internet speed) to crawl the whole site. The results can be viewed by just printing the url_analyse object to screen:
logger.info(url_analyse)
So we have found one postal code and and kvk number. Now, let's assume we also would like to have the tax number (btw in Dutch). We can run our search again, but much faster because we have stored every thing in cache again. Now we are going to add the search to our searches dictionary and run again:
%%time
# add a new search string to our dictionary
searches["btwnummer"] = BTW_REGEXP
url_analyse = UrlSearchStrings(url, search_strings=searches, cache_directory=cache_directory,
store_page_to_cache=True, schema=url_analyse.schema,
ssl_valid=url_analyse.ssl_valid,
validate_url=False
)
This time we could run our search in about 5 seconds instead of two minutes. Note that we have explicitely added the url scheme "https" and gave a flag that the urls should not be validated. This was not needed the first time we ran the code because the scheme is determined internally. But since this take a lot of time, we switch it off and just impose it
The results can be seen by printing the object
logger.info(url_analyse)btwn
Indeed, a btw-number was added to the matches this time.
In case you want to access the search result: this is strored in the matches attribute which is just a normal dictionary
for key, value in url_analyse.matches.items():
logger.info(f"The search key {key} has the following matches: {value}")
There is one more trick to speed up your crawl sessions. In this example we just searched the whole domain to look for a string, which still takes a lot of time in case there are many internal hyper-references. In many cases the string we are looking for is found in standard locations. Information on the company for instance is found in many cases on a page with 'contact' or 'about-us' in the hyper ref.
We can make use of this information by giving a list of hyper ref names which we want to search first, before the rest of the hyper-references are crawled. Also we can stop with crawling as soon we have found a match. Let's have a look at an example.
First, we make a list of common hyper ref names were company information may be stored. The string in the hyper refs are regular expression so they don't have to be exact: if a part of the hyper ref contains the string in the list it will match and searched first.
sort_order_hrefs=[
"about",
"over",
"contact",
"privacy",
"algeme",
"voorwaarden",
"klanten",
"customer",
]
Now we can pass this list to your UrlSearchStrings class and crawl again. Note that we have also added 'btwnumber' to the stop_search_on_found_keys list. This arguments give a list of keys from our search_string dictionary for which we want to stop searching as soon as we have found a match.
%%time
# add a new search string to our dictionary
searches["btwnummer"] = BTW_REGEXP
url_analyse = UrlSearchStrings(url, search_strings=searches, cache_directory=cache_directory,
store_page_to_cache=True, schema=url_analyse.schema,
ssl_valid=url_analyse.ssl_valid,
validate_url=False,
sort_order_hrefs=sort_order_hrefs,
stop_search_on_found_keys=['btwnummer']
)
As you can see, this time we started searching in a hyper ref which we included in our sort_order_hrefs list. As a result we scraped the hyper ref klantenservice/algemene-voorwaarden.html first, which was almost the last page we crawled without the sort list. Since we have added 'stop_search_on_found_keys' as well, we inmediately stop crawling as soon as we found a match for btwnummer. Combined with the fact we were also obtaining the url contents from cache, this time our crawl only too 167 ms