Web scraping with cbs_utils

In this notebook some small examples are given on how to use the web scraping utilities from cbs_utils. The following utilities are discussed:

  1. get_page_from_url : retrieve contents of ulr from internet or cache
  2. UrlSearchStrings : crawl a domain and search for strings

Using the get_page_from_url function

The get_page_from_url function allows to obtain the contents of an url and store the results in cache. The next time you run the function again, the function is read from cache. The benefits of caching your data are:

  1. Significant speed up of processing time
  2. During development of a crawler you reduce the burden on a domain
  3. You can work off-line

Here, an small example is given. First start with importing the required modules:

In [1]:
import logging
from pathlib import Path

from bs4 import BeautifulSoup

from cbs_utils.misc import (create_logger, merge_loggers)
from cbs_utils.regular_expressions import (KVK_REGEXP, ZIP_REGEXP, BTW_REGEXP)
from cbs_utils.web_scraping import (get_page_from_url, UrlSearchStrings)

BeautifulSoup is used to parse the contents of the web site. The create_logger and merge_logger functions are used to quickly setup the logging system. The regular_expressions are standard regular expression we can use to find strings such as de postal code (Dutch form), tax number, etc.

Next, set up the logging module using the cbs_utils misc function create_logger

In [2]:
# set up logging
log_level = logging.DEBUG  # change to DEBUG for more info
log_format = logging.Formatter('%(levelname)8s --- %(message)s')
logger = create_logger(console_log_level=log_level, formatter=log_format)
merge_loggers(logger, "cbs_utils.web_scraping", logger_level_to_merge=logging.INFO)
Out[2]:
<Logger cbs_utils.web_scraping (INFO)>

For this example a tmp directory is made in your working directory to store the cache. First make sure we clean this directory in case it still existed from the previous run

In [3]:
# create url name and clean previous cache file
cache_directory = Path("tmp")
clean_cache = True
if clean_cache:
    if cache_directory.exists():
        logger.info(f"Cleaning cache directory {cache_directory}")
        for item in cache_directory.iterdir():
            item.unlink()
        cache_directory.rmdir()
    else:
        logger.info(f"Cache directory {cache_directory} was already removed")
    INFO --- Cleaning cache directory tmp

Now we can demonstrate the get_page_from_url function.

In [4]:
%%time
# global needed to work around a jupyter-notebook bug where variables are only local to the cell when using %%time
global page, url
url = "https://www.example.com"
page = get_page_from_url(url, cache_directory=cache_directory)
CPU times: user 23.7 ms, sys: 15.6 ms, total: 39.4 ms
Wall time: 411 ms

As you can see, it took between 0.5 and 5.5 s (depending on your internet speed) to get all the information from the internet. Because we have added a cache_to_disk iterator to the get_page_from_url function, a cache file in the tmp directory was made:

In [5]:
for ii, item in enumerate(cache_directory.iterdir()):
    logger.info(f"Cache file {ii}: {item}")
    INFO --- Cache file 0: tmp/get_page_from_url_https_www_example_com_.pkl

The contents of the url was stored in page and looks like this:

In [6]:
soup = BeautifulSoup(page.text, 'lxml')
logger.info(soup.body)
    INFO --- <body>
<div>
<h1>Example Domain</h1>
<p>This domain is established to be used for illustrative examples in documents. You may use this
    domain in examples without prior coordination or asking for permission.</p>
<p><a href="http://www.iana.org/domains/example">More information...</a></p>
</div>
</body>

We can run the same function again. Since we now have a cache file, it will be much faster:

In [7]:
%%time
global page2
page2 = get_page_from_url(url, cache_directory=cache_directory)
CPU times: user 2.01 ms, sys: 184 µs, total: 2.19 ms
Wall time: 1.42 ms

Indeed the same function statement runs in with about 1 ms. Now compare the results:

In [8]:
soup2 = BeautifulSoup(page2.text, 'lxml')
logger.info("Contents is equal: {}".format(soup.body == soup2.body))
    INFO --- Contents is equal: True

Using the UrlSearchStrings class

The UrlSearchString class can be used to recursively crawl a website and search for a list of regular expressions we want to obtain from the website. Again, the result is cached, so in case you want to run it again with different search strings it will run significantly faster.

Let's first set up our first search session, trying to retrieve the postal code and kvk number from a web page. The regular expression are obtained from the regular_expressions module of cbs_utils and are discussed below

In [9]:
# the regular expression are obtained from the cbs_utils.regular_expressions module
searches = dict(
    postcode=ZIP_REGEXP,
    kvknumber=KVK_REGEXP
)

url = "www.be-one.nl"

logger.info(f"Start crawling the url {url} and search for the folliwing regular expressions:")
for key, reg_exp in searches.items():
    logger.info("{:10s}: {}".format(key, reg_exp))
logger.info("\n")
    INFO --- Start crawling the url www.be-one.nl and search for the folliwing regular expressions:
    INFO --- postcode  : [1-9]\d{3}\s{0,1}[A-Z]{2}
    INFO --- kvknumber : ((?![-\w])|(\s|^))([\d][\.]{0,1}){7}\d((?![-\w])|(\s|^))
    INFO --- 

Some remarks on the regular expressions.

The postcode regular expression is quite clear: it matches any four digit number (where the first digit can not be a 0), plus 2 alphanumerica characters (must be capitals). There may be a space between the digits and the characters. So the following matches 1234AB, 4545 YZ

The kvk-number is a bit more complicated. The kvk-number is a 8 digit number which may have dots. Something like 123.456.78, or 12345678. Normally, we would use word boundaries (\b) around the 8 digits to prevent a 10 digit number to match as well. However, a hyphen (-) is a word boundary too, giving a match to for instance M-12345678. It appears that this type of strings occur frequently in url's, but these are not kvk numbers. To avoid to include hyphens in the word boundary, we have explicitly given the list of characters which belong to the word boundary, resulting in a better match for kvk-numbers.

crawling the domain

Now let's crawl the domain for the first time

In [10]:
%%time
global url_analyse
url_analyse = UrlSearchStrings(url, search_strings=searches, cache_directory=cache_directory,                         
                               store_page_to_cache=True)
    INFO --- Get (cached) page: https://www.be-one.nl/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/be-one/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/beaumont/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/mijn-account/verlanglijstje.html with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/mila-sierra-suede-sneaker-plato_cognac_19541.html with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/milestone/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/ml-collections/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/moment-by-moment/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/mos-mosh/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/mouwloos-bloemen_groen_19532.html with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/new-arrivals/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/no-man-s-land/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/oui/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/outlet/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/over-ons/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/raffaello-rossi-broek-candy_zand_19441.html with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/robert-friedman/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/rosemunde/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/rosner/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/sale/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/trvl-drss-by-zenggi/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/vhals-relax-a-wilmaveen_wit_19531.html with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/via-vai/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/mijn-account/inloggen.html with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/merken/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/margittes/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/japan-tky/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/bella-dahl/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/brax/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/cambio/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/caroline-biss/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/collectie/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/creenstone/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/fraas/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/giulia-e-tu/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/ilse-jacobsen/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/klantenservice/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/marc-cain/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/klantenservice/algemene-voorwaarden.html with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/klantenservice/cookieverklaring.html with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/klantenservice/disclaimer.html with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/klantenservice/gratis-shoppen.html with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/klantenservice/personal-shopping.html with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/klantenservice/privacy-statement.html with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/kyra-ko/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/lookbook/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/luisa-cerano/ with validate True
CPU times: user 5.94 s, sys: 175 ms, total: 6.12 s
Wall time: 16.3 s

It took us between 20 second to 2 minutes (depending on your internet speed) to crawl the whole site. The results can be viewed by just printing the url_analyse object to screen:

In [11]:
logger.info(url_analyse)
    INFO --- Matches in https://www.be-one.nl/
postcode : ['9206 BE']
kvknumber : ['01066434']

So we have found one postal code and and kvk number. Now, let's assume we also would like to have the tax number (btw in Dutch). We can run our search again, but much faster because we have stored every thing in cache again. Now we are going to add the search to our searches dictionary and run again:

In [12]:
%%time
# add a new search string to our dictionary
searches["btwnummer"] = BTW_REGEXP

url_analyse = UrlSearchStrings(url, search_strings=searches, cache_directory=cache_directory,
                               store_page_to_cache=True, schema=url_analyse.schema,
                               ssl_valid=url_analyse.ssl_valid,
                               validate_url=False
                              )
    INFO --- Get (cached) page: https://www.be-one.nl/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/be-one/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/beaumont/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/mijn-account/verlanglijstje.html with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/mila-sierra-suede-sneaker-plato_cognac_19541.html with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/milestone/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/ml-collections/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/moment-by-moment/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/mos-mosh/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/mouwloos-bloemen_groen_19532.html with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/new-arrivals/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/no-man-s-land/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/oui/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/outlet/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/over-ons/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/raffaello-rossi-broek-candy_zand_19441.html with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/robert-friedman/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/rosemunde/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/rosner/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/sale/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/trvl-drss-by-zenggi/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/vhals-relax-a-wilmaveen_wit_19531.html with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/via-vai/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/mijn-account/inloggen.html with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/merken/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/margittes/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/japan-tky/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/bella-dahl/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/brax/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/cambio/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/caroline-biss/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/collectie/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/creenstone/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/fraas/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/giulia-e-tu/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/ilse-jacobsen/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/klantenservice/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/marc-cain/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/klantenservice/algemene-voorwaarden.html with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/klantenservice/cookieverklaring.html with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/klantenservice/disclaimer.html with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/klantenservice/gratis-shoppen.html with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/klantenservice/personal-shopping.html with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/klantenservice/privacy-statement.html with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/kyra-ko/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/lookbook/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/luisa-cerano/ with validate True
CPU times: user 6.04 s, sys: 68.6 ms, total: 6.11 s
Wall time: 6.18 s

This time we could run our search in about 5 seconds instead of two minutes. Note that we have explicitely added the url scheme "https" and gave a flag that the urls should not be validated. This was not needed the first time we ran the code because the scheme is determined internally. But since this take a lot of time, we switch it off and just impose it

The results can be seen by printing the object

In [13]:
logger.info(url_analyse)btwn
    INFO --- Matches in https://www.be-one.nl/
postcode : ['9206 BE']
kvknumber : ['01066434']

Indeed, a btw-number was added to the matches this time.

In case you want to access the search result: this is strored in the matches attribute which is just a normal dictionary

In [14]:
for key, value in url_analyse.matches.items():
    logger.info(f"The search key {key} has the following matches: {value}")
    INFO --- The search key postcode has the following matches: ['9206 BE']
    INFO --- The search key kvknumber has the following matches: ['01066434']

Adding a search order in the domain

There is one more trick to speed up your crawl sessions. In this example we just searched the whole domain to look for a string, which still takes a lot of time in case there are many internal hyper-references. In many cases the string we are looking for is found in standard locations. Information on the company for instance is found in many cases on a page with 'contact' or 'about-us' in the hyper ref.

We can make use of this information by giving a list of hyper ref names which we want to search first, before the rest of the hyper-references are crawled. Also we can stop with crawling as soon we have found a match. Let's have a look at an example.

First, we make a list of common hyper ref names were company information may be stored. The string in the hyper refs are regular expression so they don't have to be exact: if a part of the hyper ref contains the string in the list it will match and searched first.

In [15]:
sort_order_hrefs=[
    "about",
    "over",
    "contact",
    "privacy",
    "algeme",
    "voorwaarden",
    "klanten",
    "customer",
]

Now we can pass this list to your UrlSearchStrings class and crawl again. Note that we have also added 'btwnumber' to the stop_search_on_found_keys list. This arguments give a list of keys from our search_string dictionary for which we want to stop searching as soon as we have found a match.

In [16]:
%%time
# add a new search string to our dictionary
searches["btwnummer"] = BTW_REGEXP

url_analyse = UrlSearchStrings(url, search_strings=searches, cache_directory=cache_directory,
                               store_page_to_cache=True, schema=url_analyse.schema,
                               ssl_valid=url_analyse.ssl_valid,
                               validate_url=False, 
                               sort_order_hrefs=sort_order_hrefs,
                               stop_search_on_found_keys=['btwnummer']
                              )
    INFO --- Get (cached) page: https://www.be-one.nl/ with validate True
    INFO --- Get (cached) page: https://www.be-one.nl/klantenservice/algemene-voorwaarden.html with validate True
    INFO --- Found a match for btwnummer at https://www.be-one.nl/klantenservice/algemene-voorwaarden.html
CPU times: user 185 ms, sys: 8.18 ms, total: 193 ms
Wall time: 197 ms

As you can see, this time we started searching in a hyper ref which we included in our sort_order_hrefs list. As a result we scraped the hyper ref klantenservice/algemene-voorwaarden.html first, which was almost the last page we crawled without the sort list. Since we have added 'stop_search_on_found_keys' as well, we inmediately stop crawling as soon as we found a match for btwnummer. Combined with the fact we were also obtaining the url contents from cache, this time our crawl only too 167 ms

In [ ]: