{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Web scraping with cbs_utils\n",
    "\n",
    "In this notebook some small examples are given on how to use the web scraping utilities from cbs_utils. The following utilities are discussed:\n",
    "\n",
    "1. [*get_page_from_url*](#get_page_from_url) : retrieve contents of ulr from internet or cache\n",
    "2. [*UrlSearchStrings*](#urlseachstrings) : crawl a domain and search for strings\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=get_page_from_url></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Using the get_page_from_url function\n",
    "\n",
    "The *get_page_from_url* function allows to obtain the contents of an url and store the results in cache. The next time you run the function again, the function is read from cache. The benefits of caching your data are:\n",
    "1. Significant speed up of processing time\n",
    "2. During development of a crawler you reduce the burden on a domain\n",
    "3. You can work off-line\n",
    "\n",
    "Here, an small example is given. First start with importing the required modules:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import logging\n",
    "from pathlib import Path\n",
    "\n",
    "from bs4 import BeautifulSoup\n",
    "\n",
    "from cbs_utils.misc import (create_logger, merge_loggers)\n",
    "from cbs_utils.regular_expressions import (KVK_REGEXP, ZIP_REGEXP, BTW_REGEXP)\n",
    "from cbs_utils.web_scraping import (get_page_from_url, UrlSearchStrings)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*BeautifulSoup* is used to parse the contents of the web site. The *create_logger* and *merge_logger* functions are used to quickly setup the logging system. The *regular_expressions* are standard regular expression we can use to find strings such as de postal code (Dutch form), tax number, etc.\n",
    "\n",
    "Next, set up the logging module using the cbs_utils misc function *create_logger*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<Logger cbs_utils.web_scraping (INFO)>"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# set up logging\n",
    "log_level = logging.DEBUG  # change to DEBUG for more info\n",
    "log_format = logging.Formatter('%(levelname)8s --- %(message)s')\n",
    "logger = create_logger(console_log_level=log_level, formatter=log_format)\n",
    "merge_loggers(logger, \"cbs_utils.web_scraping\", logger_level_to_merge=logging.INFO)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For this example a *tmp* directory is made in your working directory to store the cache. First make sure we clean this directory in case it still existed from the previous run"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "    INFO --- Cleaning cache directory tmp\n"
     ]
    }
   ],
   "source": [
    "# create url name and clean previous cache file\n",
    "cache_directory = Path(\"tmp\")\n",
    "clean_cache = True\n",
    "if clean_cache:\n",
    "    if cache_directory.exists():\n",
    "        logger.info(f\"Cleaning cache directory {cache_directory}\")\n",
    "        for item in cache_directory.iterdir():\n",
    "            item.unlink()\n",
    "        cache_directory.rmdir()\n",
    "    else:\n",
    "        logger.info(f\"Cache directory {cache_directory} was already removed\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can demonstrate the *get_page_from_url* function. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 23.7 ms, sys: 15.6 ms, total: 39.4 ms\n",
      "Wall time: 411 ms\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "# global needed to work around a jupyter-notebook bug where variables are only local to the cell when using %%time\n",
    "global page, url\n",
    "url = \"https://www.example.com\"\n",
    "page = get_page_from_url(url, cache_directory=cache_directory)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you can see, it took between 0.5 and 5.5 s (depending on your internet speed) to get all the information from the internet. Because we have added a *cache_to_disk* iterator to the *get_page_from_url* function, a cache file in the *tmp* directory was made:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "    INFO --- Cache file 0: tmp/get_page_from_url_https_www_example_com_.pkl\n"
     ]
    }
   ],
   "source": [
    "for ii, item in enumerate(cache_directory.iterdir()):\n",
    "    logger.info(f\"Cache file {ii}: {item}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The contents of the url was stored in page and looks like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "    INFO --- <body>\n",
      "<div>\n",
      "<h1>Example Domain</h1>\n",
      "<p>This domain is established to be used for illustrative examples in documents. You may use this\n",
      "    domain in examples without prior coordination or asking for permission.</p>\n",
      "<p><a href=\"http://www.iana.org/domains/example\">More information...</a></p>\n",
      "</div>\n",
      "</body>\n"
     ]
    }
   ],
   "source": [
    "soup = BeautifulSoup(page.text, 'lxml')\n",
    "logger.info(soup.body)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can run the same function again. Since we now have a cache file, it will be much faster:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 2.01 ms, sys: 184 µs, total: 2.19 ms\n",
      "Wall time: 1.42 ms\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "global page2\n",
    "page2 = get_page_from_url(url, cache_directory=cache_directory)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Indeed the same function statement runs in with about 1 ms. Now compare the results:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "    INFO --- Contents is equal: True\n"
     ]
    }
   ],
   "source": [
    "soup2 = BeautifulSoup(page2.text, 'lxml')\n",
    "logger.info(\"Contents is equal: {}\".format(soup.body == soup2.body))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=urlseachstrings></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Using the *UrlSearchStrings* class\n",
    "\n",
    "The *UrlSearchString* class can be used to recursively crawl a website and search for a list of regular expressions we want to obtain from the website. Again, the result is cached, so in case you want to run it again with different search strings it will run significantly faster. \n",
    "\n",
    "Let's first set up our first search session, trying to retrieve the postal code and kvk number from a web page. The regular expression are obtained from the *regular_expressions* module of *cbs_utils* and are discussed below"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "    INFO --- Start crawling the url www.be-one.nl and search for the folliwing regular expressions:\n",
      "    INFO --- postcode  : [1-9]\\d{3}\\s{0,1}[A-Z]{2}\n",
      "    INFO --- kvknumber : ((?![-\\w])|(\\s|^))([\\d][\\.]{0,1}){7}\\d((?![-\\w])|(\\s|^))\n",
      "    INFO --- \n",
      "\n"
     ]
    }
   ],
   "source": [
    "# the regular expression are obtained from the cbs_utils.regular_expressions module\n",
    "searches = dict(\n",
    "    postcode=ZIP_REGEXP,\n",
    "    kvknumber=KVK_REGEXP\n",
    ")\n",
    "\n",
    "url = \"www.be-one.nl\"\n",
    "\n",
    "logger.info(f\"Start crawling the url {url} and search for the folliwing regular expressions:\")\n",
    "for key, reg_exp in searches.items():\n",
    "    logger.info(\"{:10s}: {}\".format(key, reg_exp))\n",
    "logger.info(\"\\n\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Some remarks on the regular expressions. \n",
    "The postcode regular expression is quite clear: it matches any four digit number (where the first digit can not be a 0), plus 2 alphanumerica characters (must be capitals). There may be a space between the digits and the characters. So the following matches 1234AB, 4545 YZ\n",
    "\n",
    "The kvk-number is a bit more complicated. The kvk-number is a 8 digit number which may have dots. Something like 123.456.78, or 12345678. Normally, we would use word boundaries (\\b) around the 8 digits to prevent a 10 digit number to match as well. However, a hyphen (-) is a word boundary too, giving a match to for instance M-12345678. It appears that this type of strings occur frequently in url's, but these are not kvk numbers. To avoid to include hyphens in the word boundary, we have explicitly given the list of characters which belong to the word boundary, resulting in a better match for kvk-numbers. \n",
    "\n",
    "#### crawling the domain\n",
    "\n",
    "Now let's crawl the domain for the first time"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "    INFO --- Get (cached) page: https://www.be-one.nl/ with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/be-one/ with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/beaumont/ with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/mijn-account/verlanglijstje.html with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/mila-sierra-suede-sneaker-plato_cognac_19541.html with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/milestone/ with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/ml-collections/ with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/moment-by-moment/ with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/mos-mosh/ with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/mouwloos-bloemen_groen_19532.html with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/new-arrivals/ with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/no-man-s-land/ with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/oui/ with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/outlet/ with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/over-ons/ with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/raffaello-rossi-broek-candy_zand_19441.html with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/robert-friedman/ with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/rosemunde/ with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/rosner/ with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/sale/ with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/trvl-drss-by-zenggi/ with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/vhals-relax-a-wilmaveen_wit_19531.html with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/via-vai/ with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/mijn-account/inloggen.html with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/merken/ with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/margittes/ with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/japan-tky/ with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/bella-dahl/ with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/brax/ with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/cambio/ with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/caroline-biss/ with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/collectie/ with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/creenstone/ with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/fraas/ with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/giulia-e-tu/ with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/ilse-jacobsen/ with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/klantenservice/ with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/marc-cain/ with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/klantenservice/algemene-voorwaarden.html with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/klantenservice/cookieverklaring.html with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/klantenservice/disclaimer.html with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/klantenservice/gratis-shoppen.html with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/klantenservice/personal-shopping.html with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/klantenservice/privacy-statement.html with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/kyra-ko/ with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/lookbook/ with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/luisa-cerano/ with validate True\n",
      "CPU times: user 5.94 s, sys: 175 ms, total: 6.12 s\n",
      "Wall time: 16.3 s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "global url_analyse\n",
    "url_analyse = UrlSearchStrings(url, search_strings=searches, cache_directory=cache_directory,                         \n",
    "                               store_page_to_cache=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It took us between 20 second to 2 minutes (depending on your internet speed) to crawl the whole site. The results can be viewed by just printing the *url_analyse* object to screen:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "    INFO --- Matches in https://www.be-one.nl/\n",
      "postcode : ['9206 BE']\n",
      "kvknumber : ['01066434']\n"
     ]
    }
   ],
   "source": [
    "logger.info(url_analyse)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So we have found one postal code and and kvk number. Now, let's assume we also would like to have the tax number (btw in Dutch). We can run our search again, but much faster because we have stored every thing in cache again. Now we are going to add the search to our *searches* dictionary and run again:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "    INFO --- Get (cached) page: https://www.be-one.nl/ with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/be-one/ with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/beaumont/ with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/mijn-account/verlanglijstje.html with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/mila-sierra-suede-sneaker-plato_cognac_19541.html with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/milestone/ with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/ml-collections/ with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/moment-by-moment/ with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/mos-mosh/ with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/mouwloos-bloemen_groen_19532.html with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/new-arrivals/ with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/no-man-s-land/ with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/oui/ with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/outlet/ with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/over-ons/ with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/raffaello-rossi-broek-candy_zand_19441.html with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/robert-friedman/ with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/rosemunde/ with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/rosner/ with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/sale/ with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/trvl-drss-by-zenggi/ with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/vhals-relax-a-wilmaveen_wit_19531.html with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/via-vai/ with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/mijn-account/inloggen.html with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/merken/ with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/margittes/ with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/japan-tky/ with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/bella-dahl/ with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/brax/ with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/cambio/ with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/caroline-biss/ with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/collectie/ with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/creenstone/ with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/fraas/ with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/giulia-e-tu/ with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/ilse-jacobsen/ with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/klantenservice/ with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/marc-cain/ with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/klantenservice/algemene-voorwaarden.html with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/klantenservice/cookieverklaring.html with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/klantenservice/disclaimer.html with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/klantenservice/gratis-shoppen.html with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/klantenservice/personal-shopping.html with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/klantenservice/privacy-statement.html with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/kyra-ko/ with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/lookbook/ with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/luisa-cerano/ with validate True\n",
      "CPU times: user 6.04 s, sys: 68.6 ms, total: 6.11 s\n",
      "Wall time: 6.18 s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "# add a new search string to our dictionary\n",
    "searches[\"btwnummer\"] = BTW_REGEXP\n",
    "\n",
    "url_analyse = UrlSearchStrings(url, search_strings=searches, cache_directory=cache_directory,\n",
    "                               store_page_to_cache=True, schema=url_analyse.schema,\n",
    "                               ssl_valid=url_analyse.ssl_valid,\n",
    "                               validate_url=False\n",
    "                              )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This time we could run our search in about 5 seconds instead of two minutes. Note that we have explicitely added the url scheme \"https\" and gave a flag that the urls should not be validated. This was not needed the first time we ran the code because the scheme is determined internally. But since this take a lot of time, we switch it off and just impose it \n",
    "\n",
    "The results can be seen by printing the object"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "    INFO --- Matches in https://www.be-one.nl/\n",
      "postcode : ['9206 BE']\n",
      "kvknumber : ['01066434']\n"
     ]
    }
   ],
   "source": [
    "logger.info(url_analyse)btwn"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Indeed, a btw-number was added to the matches this time. \n",
    "\n",
    "In case you want to access the search result: this is strored in the *matches* attribute which is just a normal dictionary"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "    INFO --- The search key postcode has the following matches: ['9206 BE']\n",
      "    INFO --- The search key kvknumber has the following matches: ['01066434']\n"
     ]
    }
   ],
   "source": [
    "for key, value in url_analyse.matches.items():\n",
    "    logger.info(f\"The search key {key} has the following matches: {value}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Adding a search order in the domain\n",
    "\n",
    "There is one more trick to speed up your crawl sessions. In this example we just searched the whole domain to look for a string, which still takes a lot of time in case there are many internal hyper-references. In many cases the string we are looking for is found in standard locations. Information on the company for instance is found in many cases on a page with 'contact' or 'about-us' in the hyper ref.\n",
    "\n",
    "We can make use of this information by giving a list of hyper ref names which we want to search first, before the rest of the hyper-references are crawled. Also we can stop with crawling as soon we have found a match. Let's have a look at an example. \n",
    "\n",
    "First, we make a list of common hyper ref names were company information may be stored. The string in the hyper refs are  regular expression so they don't have to be exact: if a part of the hyper ref contains the string in the list it will match and searched first.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "sort_order_hrefs=[\n",
    "    \"about\",\n",
    "    \"over\",\n",
    "    \"contact\",\n",
    "    \"privacy\",\n",
    "    \"algeme\",\n",
    "    \"voorwaarden\",\n",
    "    \"klanten\",\n",
    "    \"customer\",\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can pass this list to your UrlSearchStrings class and crawl again. Note that we have also added 'btwnumber' to the *stop_search_on_found_keys* list. This arguments give a list of keys from our *search_string* dictionary for which we want to stop searching as soon as we have found a match. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "    INFO --- Get (cached) page: https://www.be-one.nl/ with validate True\n",
      "    INFO --- Get (cached) page: https://www.be-one.nl/klantenservice/algemene-voorwaarden.html with validate True\n",
      "    INFO --- Found a match for btwnummer at https://www.be-one.nl/klantenservice/algemene-voorwaarden.html\n",
      "CPU times: user 185 ms, sys: 8.18 ms, total: 193 ms\n",
      "Wall time: 197 ms\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "# add a new search string to our dictionary\n",
    "searches[\"btwnummer\"] = BTW_REGEXP\n",
    "\n",
    "url_analyse = UrlSearchStrings(url, search_strings=searches, cache_directory=cache_directory,\n",
    "                               store_page_to_cache=True, schema=url_analyse.schema,\n",
    "                               ssl_valid=url_analyse.ssl_valid,\n",
    "                               validate_url=False, \n",
    "                               sort_order_hrefs=sort_order_hrefs,\n",
    "                               stop_search_on_found_keys=['btwnummer']\n",
    "                              )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you can see, this time we started searching in a hyper ref which we included in our *sort_order_hrefs* list. As a result we scraped the hyper ref *klantenservice/algemene-voorwaarden.html* first, which was almost the last page we crawled without the sort list. Since we have added 'stop_search_on_found_keys' as well, we inmediately stop crawling as soon as we found a match for *btwnummer*. Combined with the fact we were also obtaining the url contents from cache, this time our crawl only too 167 ms"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
