cbs_utils package

Submodules

cbs_utils.global_vars module

Some global variable definitions

cbs_utils.mail module

class cbs_utils.mail.CBS_SMTP_Message(sender: str, adressee: str, subject: str = '', body: str = '', mail_server: str = 'mail.cbsp.nl')[source]

Bases: cbs_utils.mail.EmailFormat

CBS SMTP based email message.

Notes

  • [NL] CBS SMTP gebaseerd email bericht. Dit bericht object maakt het eenvoudig om een email bericht naar een gebruiker te sturen. Je kan mails verstuurt vanuit elk e-mail account waarvoor je gerechtigd bent vanuit de centrale server.
  • [EN] CBS SMTP based email message. This message object makes it easy to send messages to email recipients. By creating this object mails can be sent from any address the user is allowed to use according to the central server.
send()[source]

Send the prepared email message.

class cbs_utils.mail.EmailFormat[source]

Bases: abc.ABC

add_to_recipients(to)[source]
send()[source]
cbs_utils.mail.asbase64(msg)[source]

cbs_utils.misc module

Some miscellaneous functions used throughout many cbs modules

class cbs_utils.misc.CacheInfo(file_name, directory='.', file_type=None, reset_cache=False)[source]

Bases: object

Klasse om een informatie van de cache te bewaren

Parameters:
  • file_name (str) – Naam van de cache file
  • directory (str, optionalo) – Cache directory. Default = “.”
  • file_type (str) – type van de cache
make_file_name() → pathlib.Path[source]
set_read_from_cache_flag(reset_cache)[source]
class cbs_utils.misc.Chdir(new_path)[source]

Bases: object

Class which allows to move to a directory, do something, and move back when done

Parameters:new_path (str) – Location where you want to do something

Notes

Used on the Gompute cluster in the batch processing script to submit a job inside a directory and then move back to the higher directory in order to move to the next case

Examples

Go to a known directory (C:/)

>>> os.chdir("C:/")
>>> os.getcwd()
'C:\\'

With the Chdir command we move to the C:/Temp directory where we can do something.

>>> with Chdir("C:/Windows") as d:
...    # in this block we can do something in the directory Temp.
...    os.getcwd()
'C:\\Windows'

We have left the block under Chdir, so we are back at the directory where we started

>>> os.getcwd()
'C:\\'
class cbs_utils.misc.ConditionalDecorator(dec, condition)[source]

Bases: object

Add a decorator to a function only if the condition is True

Parameters:
  • dec (decorator) – The decorator which you want to add when condition is true
  • condition (bool) – Only add the decorator if this condition is True
class cbs_utils.misc.PackageInfo(module_object)[source]

Bases: object

A class to analyse the version properties of this package

Parameters:module_object (Module) – reference to the module for which want to to store the properties
get_bundle_version()[source]

Get the version of the current package from the _version_frozen module which was written by the build_executable script.

get_source_version()[source]

Get the version of the current package via the versioneer approach

class cbs_utils.misc.Timer(message='Elapsed time', name='routine', verbose=True, units='ms', n_digits=0, field_width=20)[source]

Bases: object

Class to measure the time it takes execute a section of code

Parameters:
  • message (str) – a string to use to the output line
  • name (str, optional) – The name of the routine timed.
  • verbose (bool, optional) – if True, produce output
  • units (str, optional) – time units to use. Default ‘ms’
  • n_digits (int, optional) – number of decimals to add to the timer units

Example

Use a with / as construction to enclose the section of code which need to be timed

Also, make sure that merge the logger to activate the logger function of the Timer class

>>> import logging
>>> from numpy import allclose
>>> from cbs_utils.misc import (Timer, merge_loggers)
>>> number_of_seconds = 1.0
>>> logger = logging.getLogger(__name__)
>>> merge_loggers(logger, "cbs_utils")
>>> with Timer(units="s", n_digits=0) as timer:
...    time.sleep(number_of_seconds)
Elapsed time         routine              :          1 s
>>> allclose(number_of_seconds, timer.secs, rtol=0.1)
True
cbs_utils.misc.clean_up_name(name)[source]

Verwijder alle vervelende chars zoals [ of ] of

Parameters:name (str) – String die schoon gemaakt moet worden
Returns:Schone naam
Return type:str
cbs_utils.misc.clear_argument_list(argv)[source]

Small utility to remove the ‘\r’ character from the last argument of the argv list appearing in cygwin

Parameters:argv (list) – The argument list stored in sys.argv
Returns:Cleared argument list
Return type:list
cbs_utils.misc.clear_path(path_name)[source]
routine to clear spurious dots and slashes from a path name
example bla/././oke becomes bla/oke
Parameters:path_name – return: clear_path as a string
Returns:clear_path as a string
Return type:type

Examples

>>> long_path = os.path.join(".", "..", "ok", "yoo", ".", ".", "") + "/"
>>> print(long_path)
.\..\ok\yoo\.\.\/
>>> print(clear_path(long_path))
..\ok\yoo
cbs_utils.misc.compare_objects(obj1, obj2, counter=0, max_recursion_depth=4)[source]

Compare if two object are equal

Parameters:
  • obj1 (class) – first object
  • obj2 (class) – second object
  • counter (int) – Current recursion depth. Keeps track of how many time we have recursively called this function
  • max_recursion_depth (int) – Maximum depth to which we are comparing the objects.

Notes

  • This function compares all the attributes of two object to see if their values are the same
  • An attribute field may be another object which we also want to compare with the same attribute of the other object. This is done by recursively calling this function again.
  • Due to the recursive call mechanism we may end up in a infinite loop. To prevent this, a maximum recursion depth can be given.
  • The test function test_sequence_tool of the sequence_tool_utils module uses this function to compare to SequenceToolSummary objects
Raises:AssertionError: – In case on of the object fields is not equal
cbs_utils.misc.create_logger(name='root', log_file=None, console_log_level=20, console_log_format_long=False, console_log_format_clean=False, file_log_level=20, file_log_format_long=True, redirect_stderr=True, formatter=None, formatter_file=None) → logging.Logger[source]

Create a console logger

Parameters:
  • name (str, optional) – Name of the logger. Default = “root”
  • log_file (str, optional) – The name of the log file in case we want to write it to file. If it is not specified, no file is created
  • console_log_level (int, optional) – The level of the console output. Defaults to logging.INFO
  • console_log_format_long (bool) – Use a long informative format for the logging output to the console
  • console_log_format_clean (bool) – Use a very clean format for the logging output. If given together with consosl_log_format_long an AssertionError is raised
  • file_log_level (int, optional) – In case the log file is used, specify the log level. Can be different from the console log level. Defaults to logging.INFO
  • file_log_format_long (bool, optional) – Use a longer format for the file output. Default to True
  • redirect_stderr (bool, optional) – If True the stderr output is written to a file with .err extension in stated of .out. Default = True
  • formatter (Formatter) – A formatter can also be explicitly passed
  • formatter_file (Formatter) – A formatter can also be explicitly passed
Returns:

The handle to the logger which we can use to create output to the screen using the logging module

Return type:

logging.Logger

Examples

Create a logger at the verbosity level, so no debug information is generated

>>> logger = create_logger()
>>> logger.debug("This is a debug message")

The info and warning message are both plotted

>>> logger.info("This is a information message")
  INFO : This is a information message
>>> logger.warning("This is a warning message")
WARNING : This is a warning message

Create a logger at the debug level

>>> logger = create_logger(console_log_level=logging.DEBUG)
>>> logger.debug("This is a debug message")
 DEBUG : This is a debug message
>>> logger.info("This is a information message")
  INFO : This is a information message
>>> logger.warning("This is a warning message")
WARNING : This is a warning message

Create a logger at the warning level. All output is suppressed, except for the warnings

>>> logger = create_logger(console_log_level=logging.WARNING)
>>> logger.debug("This is a debug message")
>>> logger.info("This is a information message")
>>> logger.warning("This is a warning message")
WARNING : This is a warning message

It is also possible to redirect the output to a file. The file name given without an extension, as two file are created: one with the extension .out and one with the extension .err, for the normal user generated out put and system errors output respectively.

>>> data_dir = os.path.join(os.path.split(__file__)[0], "..", "..", "data")
>>> file_name = os.path.join(data_dir, "log_file")
>>> logger = create_logger(log_file=file_name,  console_log_level=logging.INFO,
... file_log_level=logging.DEBUG, file_log_format_long=False)
>>> logger.debug("This is a debug message")
>>> logger.info("This is a information message")
  INFO : This is a information message
>>> logger.warning("This is a warning message")
WARNING : This is a warning message
>>> print("system normal message")
system normal message
>>> print("system error message", file=sys.stderr)

At this point, two files have been generated, log_file.out and log_file.err. The first contains the normal logging output whereas the second contains error message generated by other packages which do not use the logging module. Note that the normal print statement shows up in the console but not in the file, whereas the second print statement to the stderr output does not show on the screen but is written to log_file.err

To show the contents of the generated files we do

>>> with open(file_name+".out", "r") as fp:
...   for line in fp.readlines():
...       print(line.strip())
DEBUG : This is a debug message
INFO : This is a information message
WARNING : This is a warning message
>>> sys.stderr.flush()  # forces to flush the stderr output buffer to file
>>> with open(file_name + ".err", "r") as fp:
...   for line in fp.readlines():
...       print(line.strip())
system error message

References

https://docs.python.org/3/library/logging.html#levels

cbs_utils.misc.dataframe_clip_strings(df, max_width, include=None, exclude=None)[source]

Clip all strings in a dataframe

Parameters:
  • df (DataFrame) – Pandas data frame
  • max_width (int) – Clip strings to this width
  • include (list, optional) – give a list of column names to clip. Exclude the rest
  • include – give a list of column names not to clip. Include the rest
Returns:

Return type:

Pandas data frame with clip string columns

cbs_utils.misc.delete_module(modname, paranoid=None)[source]

Delete a module from memory which loaded before

Parameters:
  • modname (str) – The name of the module to remove
  • paranoid (list or None) – (Default value = None)
cbs_utils.misc.get_branch(default_branch=None)[source]

Get the current git version of this questionary

Parameters:default_branch (str) – De default naam die we aan het branch geven als we niks kunnen vinden
Returns:current branch version
Return type:str
cbs_utils.misc.get_clean_version(version) → str[source]

turns the full version string into a clean one without the build

Parameters:version (str) – The version string as return from versioneer.
Returns:The clean version string
Return type:str

Notes

The version string matches the following regular expression

“([.|d]+)([+]*)(.*)”

This function return the clean version string given by the part “([.|d]+)”

Examples

>>> get_clean_version("1.3")
'1.3'
>>> get_clean_version("2.5+dev.g43429")
'2.5'
>>> get_clean_version("4.3.1+dev.g43429-dirty")
'4.3.1'
cbs_utils.misc.get_dir_size(directory_name)[source]

Returns the size of the current directory in Bytes

Parameters:directory_name (str) – Name of the directory
Returns:Size of the directory in Buyt
Return type:int

Notes

  • Just of oneliner using the Pathlib
cbs_utils.misc.get_logger(name) → logging.Logger[source]

Get the logger of the current level and set the level based on the main routine. Then return it

Parameters:name (str) – the name of the logger to set.
Returns:log: a handle of the current logger
Return type:logging.Loggertype

Notes

This routine is used on top of each function to get the handle to the current logger and automatically set the verbosity level of the logger based on the main function

Examples

Assume you define a function which need to generate logging information based on the logger created in the main program. In that case you can do

>>> def small_function():
...    logger = get_logger(__name__)
...    logger.info("Inside 'small_function' This is information to the user")
...    logger.debug("Inside 'small_function' This is some debugging stuff")
...    logger.warning("Inside 'small_function' This is a warning")
...    logger.critical("Inside 'small_function' The world is collapsing!")

The logger can be created in the main program using the create_logger routine

>>> def main(logging_level):
...     main_logger = create_logger(console_log_level=logging_level)
...     main_logger.info("Some information in the main")
...     main_logger.debug("Now we are calling the function")
...     small_function()
...     main_logger.debug("We are back in the main function")

Let’s call the main fuction in DEBUGGING mode

>>> main(logging.DEBUG)
  INFO : Some information in the main
 DEBUG : Now we are calling the function
  INFO : Inside 'small_function' This is information to the user
 DEBUG : Inside 'small_function' This is some debugging stuff
WARNING : Inside 'small_function' This is a warning
CRITICAL : Inside 'small_function' The world is collapsing!
 DEBUG : We are back in the main function

You can see that the logging level inside the small_function is obtained from the main level. Do the same but now in the normal information mode

>>> main(logging.INFO)
  INFO : Some information in the main
  INFO : Inside 'small_function' This is information to the user
WARNING : Inside 'small_function' This is a warning
CRITICAL : Inside 'small_function' The world is collapsing!

We can call in the silent mode, suppressing all debugging and normal info, but not Warnings

>>> main(logging.WARNING)
WARNING : Inside 'small_function' This is a warning
CRITICAL : Inside 'small_function' The world is collapsing!

Finally, to suppress everything except for critical warnings

>>> main(logging.CRITICAL)
CRITICAL : Inside 'small_function' The world is collapsing!
cbs_utils.misc.get_path_depth(path_name)[source]

Get the depth of a path or file name

Parameters:path_name (str) – Path name to get the depth from
Returns:depth of the path
Return type:int

Examples

>>> get_path_depth("C:\Anaconda")
1
>>> get_path_depth("C:\Anaconda\share")
2
>>> get_path_depth("C:\Anaconda\share\pywafo")
3
>>> get_path_depth(".\imaginary\path\subdir\share")
4
cbs_utils.misc.get_python_version_number(version_info) → str[source]

Script to turn the version info as obtained with sys.version_info into a digit number

Parameters:version_info – return: a string with the current python version as a clear digit, i.e. 3.5.3
Returns:a string with the current python version as a clear digit, i.e. 3.5.3
Return type:str

Examples

>>> version_string = get_python_version_number(sys.version_info)
cbs_utils.misc.get_regex_pattern(search_pattern)[source]

Routine to turn a string into a regular expression which can be used to match a string

Parameters:search_pattern (str) – A regular expression in the form of a string
Returns:A regular expression as return by the re.compile fucntion or None in case a invalid regular expression was given
Return type:None or compiled regular expression

Notes

An empty string or an invalid search_pattern will yield a None return

cbs_utils.misc.get_time_stamp_from_string(string_with_date_time, yearfirst=True, dayfirst=False, timezone=None)[source]

Try to get a date/time stamp from a string

Parameters:
  • string_with_date_time (str) – The string to analyses
  • yearfirst (bool, optional) – if true put the year first. See dateutils.parser. Default = True
  • dayfirst (bool, optional) – if true put the day first. See dateutils.parser. Default = False
  • timezone (str or None, optional) – if given try to add this time zone:w
Returns:

Pandas data time string

Return type:

DateTime

Examples

The date time in the file ‘AMSBALDER_160929T000000’ is 29 sep 2016 and does not have a time zone specification. The returned time stamp does also not have a time zone

>>> file_name="AMSBALDER_160929T000000"
>>> time_stamp =get_time_stamp_from_string(string_with_date_time=file_name)
>>> print("File name {} has time stamp {}".format(file_name, time_stamp))
File name AMSBALDER_160929T000000 has time stamp 2016-09-29 00:00:00

We can also force to add a time zone. The Etc/GMT-2 time zone is UTC + 2 time zone which is the central europe summer time (CEST) or the Europe/Amsterdam Summer time.

>>> time_stamp =get_time_stamp_from_string(string_with_date_time=file_name,
...                                        timezone="Etc/GMT-2")
>>> print("File name {} has time stamp {}".format(file_name, time_stamp))
File name AMSBALDER_160929T000000 has time stamp 2016-09-29 00:00:00+02:00

This time we assume the file name already contains a time zone, 2 hours + UTC. Since we already have a time zone, the timezone option can only convert the date time to the specified time zone.

>>> file_name="AMSBALDER_160929T000000+02"
>>> time_stamp =get_time_stamp_from_string(string_with_date_time=file_name,
...                                        timezone="Etc/GMT-2")
>>> print("File name {} has time stamp {}".format(file_name, time_stamp))
File name AMSBALDER_160929T000000+02 has time stamp 2016-09-29 00:00:00+02:00

In case the time zone given by the timezone options differs with the time zone in the file name, the time zone is converted

>>> file_name="AMSBALDER_160929T000000+00"
>>> time_stamp =get_time_stamp_from_string(string_with_date_time=file_name,
...                                        timezone="Etc/GMT-2")
>>> print("File name {} has time stamp {}".format(file_name, time_stamp))
File name AMSBALDER_160929T000000+00 has time stamp 2016-09-29 02:00:00+02:00
cbs_utils.misc.get_value_magnitude(value, convert_to_base_units=True)[source]

Get the magnitude of value with Pint dimension in terms of its base units or just return a float if value does not have a dimension

Parameters:
  • value (Quantity or float or None) – A value with a Pint dimension or a normal float. In both cases, the value without dimension is returned
  • convert_to_base_units (bool, optional) – Before turning the value into a magnitude first turn the quantity into its SI base units. Default = True
Returns:

Magnitude of the value in case a Pint Quantity was added to the input or just the value itself. If convert_to_base_units was set to True the value is first converted to its SI base units

Return type:

float or None

Examples

Assume we have a value with a pint dimension

>>> velocity = Q_("2.5 m/s")
>>> print("Current velocity with dimension is: {}".format(velocity))
Current velocity with dimension is: 2.5 meter / second

We can now get the magnitude of velocity using this function as

>>> velocity_mag = get_value_magnitude(velocity)
>>> print("Velocity without dimension is: {}".format(velocity_mag))
Velocity without dimension is: 2.5

In case the input argument of the get_value_magnitude is a float and does not have a dimension, the value itself is returned

>>> velocity_mag2 = get_value_magnitude(velocity_mag)
>>> print("Velocity without dimension is: {}".format(velocity_mag2))
Velocity without dimension is: 2.5

In case we have a dimension in none SI units, the value is by default first converted to its SI base units.

>>> velocity_knots = Q_("1 knot")
>>> velocity_mag = get_value_magnitude(velocity_knots)
>>> print("Velocity {} is converted to its magnitude in m/s: {:.2f}"
...       "".format(velocity_knots, velocity_mag))
Velocity 1 knot is converted to its magnitude in m/s: 0.51

In case that the convert_to_base_units flag is False we just get the magnitude in the same units as the input argument

>>> velocity_knots = Q_("2.5 knot")
>>> velocity_mag = get_value_magnitude(velocity_knots, convert_to_base_units=False)
>>> print("Velocity {} is converted to its magnitude in knots: {:.2f}"
... "".format(velocity_knots, velocity_mag))
Velocity 2.5 knot is converted to its magnitude in knots: 2.50

Notes

  • This function is used inside other functions in which it is not know before hand if an input argument is passed with or without a Pint dimension and we only are interested in the magnitude of the value. Use this function to get the magnitude
cbs_utils.misc.get_version(default_version=None)[source]

Get the current git version of this questionary

Returns:current git version
Return type:str
cbs_utils.misc.is_exe(fpath)[source]

Test if a file is an executable

Parameters:fpath (str) – return true or false:
Returns:In case fpath is a file that can be executed return True, else False
Return type:bool

Notes

This function can only be used on Linux file systems as the which command is used to identity the location of the program.

cbs_utils.misc.is_postcode(postcode)[source]

kijk of een string een postcode is

Parameters:postcode (srt) – De string om te controleren
Returns:True als het een postcode is
Return type:bool
cbs_utils.misc.make_directory(directory)[source]

Create a directory in case it does not yet exist.

Parameters:directory (Path or str) – Name of the directory to create

Notes

This function is used to create directories without checking if it already exist. If the directory already exists, we can silently continue.

Example

If you want to create a directory ‘outdir’, just do:

make_directory("outdir")

The directory is created if it doesn’t exist, or, we just continue silently if it already exists

Raises:OSError – The OSError is only raised if it is not an EEXIST error. This implies that the creation of the directory failed due to another reason than the directory already being present. It could be that the file system is full or that we may not have write permission
cbs_utils.misc.merge_loggers(main_logger, logger_name_to_merge, logger_level_to_merge=20)[source]

Add the logger of an external module to the local logger

Parameters:
  • main_logger (Logger) – reference of the main logger
  • logger_name_to_merge (str) – Name of the logger we want to merge
  • logger_level_to_merge (int) – Level of the logger to merge
Returns:

merged logger

Return type:

Logger

Examples

In case you have created a logger in your script with the create_logger function

>>> logger = create_logger()

And also you have create a module file your_module.py with it’s own logger

>>> module_logger = logging.getLogger(__name__)

In this case you would use the __name__ variable in ‘your_module’, so this logger is called ‘your_module’

Now in case you want to add the logger of ‘your_module’ to the local logger of your script, do

>>> merge_loggers(logger, 'your_module')

Now all the logger statements in ‘your_logger’ are also added to logger output

cbs_utils.misc.move_script_path_to_back_of_search_path(script_file, append_at_the_end=True) → list[source]

Move the name of a script to the front or the back of the search path

Parameters:
  • script_file (str) – Name of the script to move
  • append_at_the_end (bool, optional, default=True) – Append the name of the script to the end. In case this flag is false, the script file is prepended to the path
Returns:

The new system path stored in a list

Return type:

list

Notes

This script is sometimes required if the __version string is messing up with another __version string

Examples

sys.path = move_script_path_to_back_of_search_path(__file__)

cbs_utils.misc.print_banner(title, top_symbol='-', bottom_symbol=None, side_symbol=None, width=80, to_stdout=False, no_top_and_bottom=False)[source]

Create a banner for plotting a bigger title above each section in the log output

Parameters:
  • title – The title to plot
  • top_symbol (str) – the symbol used for the top line. Default value = “-“
  • bottom_symbol (str) – the symbol used for the bottom line. Assume same as top if None is given (Default value = None)
  • side_symbol (str) – The side symbol. Assume same as top if None is given, except if top is -, then take | (Default value = None)
  • width (int) – the width of the banner (Default value = 80)
  • no_top_and_bottom (bool) – make a simple print without the top and bottom line (Default value = False)
  • to_stdout (bool, optional) – Print the banner to the standard output of the console instead of the logging system. Defaults to False

Examples

>>> logger = create_logger(console_log_format_clean=True)
>>> print_banner("This is the start of a section")
<BLANKLINE>
--------------------------------------------------------------------------------
| This is the start of a section                                               |
--------------------------------------------------------------------------------

Notes

Unless the option ‘to_stdout’ is set to True, the banner is printed via the logging system. Therefore, a logger needs to be created first using create_logger

cbs_utils.misc.query_yes_no(question, default_answer='no')[source]

Ask a yes/no question via raw_input() and return their answer.

Parameters:
  • question (str) – A question to ask the user
  • default_answer (str, optional) – A default answer that is given when only return is hit. Default to ‘no’
Returns:

“yes” or “no”, depending on the input of the user

Return type:

str

cbs_utils.misc.range1(start=None, stop=None)[source]

Return a range including the end value

Parameters:
  • start (int or None) – Start in case both start and stop are defined. Othersize start becomes stop
  • stop – Stop value incudling end in case also start is definefd.
Returns:

Range of integer values in betwween start and stop, including the stpo value

Return type:

list

cbs_utils.misc.read_settings_file(file_name) → dict[source]

Read the yaml file to get the setup information.

Parameters:file_name (str) – Name of the configuration file. Can be a full path name as well
Returns:All the settings as obtained from the yaml configuration file
Return type:dict

Notes

The file name of the yaml file is searched for in the following order

  1. The current directory where the script is executed. If a full path is given, this will be accepted too.
  2. The directory where the original script is located.

In this way, a default settings file can be put in the script directory and the user does not need to copy it except a setting values needs to be changed

Raises:AssertionError: – In case the file can not be found
cbs_utils.misc.read_value_from_dict_if_valid(dictionary, key, default_value=None)[source]

small routine to read a value from a dictionary. If the value is not set, just return the default value

Parameters:
  • dictionary – dictionary which is supposed to carry this key field
  • key – the name of the field to read the value from
  • default_value – default value in case we fail to read the key field (if it does not exist)
Returns:

value of the key field or the default value

Return type:

type

cbs_utils.misc.scan_base_directory(walk_dir='.', supplied_file_list=None, file_has_string_pattern='', file_has_not_string_pattern='', dir_has_string_pattern='', dir_has_not_string_pattern='', start_date_time=None, end_date_time=None, time_zone=None, time_stamp_year_first=True, time_stamp_day_first=False, extension=None, max_depth=None, sort_file_base_names=False)[source]

Recursively scan the directory walk_dir and get all files underneath obeying the search strings and/or date/time ranges

Parameters:
  • walk_dir (str, optional) – The base directory to start the import. Default = “.”
  • supplied_file_list (list, optional) – In case walk dir is not given we can explicitly pass a file list to analyse. Default = None
  • dir_has_string_pattern (str, optional) – Requires the directory name to have this pattern (Default value = “”). This selection is only made on the first directory level below the walk_dir
  • dir_has_not_string_pattern (str, optional) – Requires the directory name NOT to have this pattern (Default value = “”). This selection is only made on the first directory level below the walk_dir
  • file_has_string_pattern (str, optional) – Requires the file name to have this pattern (Default value = “”, i.e. matches all)
  • file_has_not_string_pattern (str, optional) – Requires the file name NOT to have this pattern (Default value = “”)
  • extension (str or None, optional) – Extension of the file to match. If None, also matches. Default = None
  • max_depth (int, optional) – Sets a maximum depth to which the search is carried out. Default = None, which does not limit the search depth. For deep file structures setting a limit to the search depth speeds up the search.
  • sort_file_base_names (bool, option) – If True, sort the resulting file list alphabetically based on the file base name. Default = False
  • start_date_time (DateTime or None, optional) – If given, get the date time from the current file name and only add the files with a date/time equal or large the start_date_time. Default is None
  • end_date_time (DateTime or None, optional) – If given, get the date time from the current file name and only add the files with a date/time smaller than the end_date_time. Default is None
  • time_zone (str or None, optional) – If given add this time zone to the file stamp. The start and end time should also have a time zone
  • time_stamp_year_first (bool, optional) – Passed to the datetime parser. If true, the year is first in the date/time string. Default = True
  • time_stamp_day_first (bool, optional) – Passed to the datetime parser. If true, the day is first in the date/time string. Default = False
Returns:

All the file names found below the input directory walk_dir obeying all the search strings

Return type:

list

Examples

Find all the python files under the share directory in the Anaconda installation folder

>>> scan_dir = "C:\Anaconda\share"
>>> file_list = scan_base_directory(scan_dir, extension='.py')

Find all the python files under the share directory in the Anaconda installation folder belonging to the pywafo directory

>>> file_list = scan_base_directory(scan_dir, extension='.py', dir_has_string_pattern="wafo")

Note that wafo matches on the directory ‘pywafo’, which is the first directory level below the scan directory. However, if we would match on ‘^wafo’ the returned list would be empty as the directory has to start with wafo.

In order to get all the files with “test” in the name with a directory depth smaller than 3 do

>>> file_list = scan_base_directory(scan_dir, extension='.py', dir_has_string_pattern="wafo",
...                                 file_has_string_pattern="test", max_depth=3)

Test the date/time boundaries. First create a file list from 28 sep 2017 00:00 to 5:00 with a hour interval and convert it to a string list

>>> file_names = ["AMS_{}.mdf".format(dt.strftime("%y%m%dT%H%M%S")) for dt in
...    pd.date_range("20170928T000000", "20170928T030000", freq="30min")]
>>> for file_name in file_names:
...     print(file_name)
AMS_170928T000000.mdf
AMS_170928T003000.mdf
AMS_170928T010000.mdf
AMS_170928T013000.mdf
AMS_170928T020000.mdf
AMS_170928T023000.mdf
AMS_170928T030000.mdf

Use the scan_base_directory to get the files within a specific date/time range

>>> file_selection = scan_base_directory(supplied_file_list=file_names,
...  start_date_time="20170928T010000", end_date_time="20170928T023000")
>>> for file_name in file_selection:
...     print(file_name)
AMS_170928T010000.mdf
AMS_170928T013000.mdf
AMS_170928T020000.mdf

Note that the selected range run from 1 am until 2 am; the end_date_time of 2.30 am is not included

cbs_utils.misc.set_default_dimension(parse_value, default_dimension=None, force_default_units=False)[source]

Add a pint dimension to a value

Parameters:
  • parse_value (ndarray or str or float) – Value with optional a dimension written in the form of a str. Can be an array or list of strings as well
  • default_dimension (str) – Required default dimension
  • force_default_units (bool) – If true the only allowed dimension is the default dimension. Raise an error in case this is not the case. Default = False
Returns:

Value with the quantity as give by the default

Return type:

Quantity

Raises:

AssertionError – In case the dimension of the parse_value argument is not not but:

  1. Its dimensionality is not the same as the dimensionality of the default_dimension
  2. Its units is not the same as the unit of the default_dimension and the force_default_units flag is set to True

Notes

  • This function is a add-on to the pint module, a package to define, operate and manipulate physical quantities: https://pypi.python.org/pypi/Pint.
  • This function is used to add a dimension to a value which is parsed from a text file.
  • It is checked if the value given in the text file has dimension already, for example that it was given as “1.0 m/s”.
  • If a dimension was given already: check if the dimensionality (in this case: Length/Time) is the same as the dimensionality of the default_dimension input argument.
  • In case the input value does not have an explicit dimension, the dimension given by default_dimension is added to the value.
  • This function works on both scalar and list values

Examples

Assume we want to read input values from a text file as plain numbers and we want to add a default dimension of meter to it in case the value do not have an explicit dimension yet. Just do

>>> logger = create_logger(console_log_level=logging.CRITICAL)
>>> value_without_dimension = 1.0  # this is the values as we read from the text file
>>> value_with_dimension = set_default_dimension(value_without_dimension, "meter")
>>> print(value_with_dimension)
1.0 meter

The variable value_with_dimension is now a pint quantity which carries the dimension meter.

In case the input variable already has a dimension, we should also be able to use this function. The only requirement is that the dimensionality is the same. So this should work

>>> value_with_dimension = set_default_dimension("2.5 meter", "meter")
>>> print(value_with_dimension)
2.5 meter

This should work as well

>>> value_with_dimension = set_default_dimension("5.0 mm", "meter")
>>> print(value_with_dimension)
5.0 millimeter

But this fails as the dimensionality of the input argument is not corresponding with the dimensionality of the default dimension

>>> try:
...    value_with_dimension = set_default_dimension("5.0 mm", "second")
... except AssertionError:
...    print("This fails because the dimensionality is not the same")
This fails because the dimensionality is not the same

This function should also work for arrays and list

>>> values_without_dimension = np.linspace(0, 1, num=5, endpoint=True)
>>> values_with_dimension = set_default_dimension(values_without_dimension, "meter/second^2")
>>> print(values_with_dimension)
[0.   0.25 0.5  0.75 1.  ] meter / second ** 2

Notes

  • Hz are not converted to rad/s as expected. Therefore do not try to use this to convert Hz -> rad/s
  • If the input argument parse_val is None, a None is returned as output as well
cbs_utils.misc.set_value_if_valid(value, new_value)[source]

small routine to set a value on if it is not none. Otherwise the original value is taken

Parameters:
  • value – the original value which you can pre-define with a default value
  • new_value – the new value. Only set this if it is not none
Returns:

net value or the original if new_value was None

Return type:

type

cbs_utils.misc.standard_postcode(postcode)[source]

Maak een standaard vorm van een postcode

Parameters:postcode (str) – Postcode string in niet standaard vorm, zoals 2613 AB, 2613ab, etc
Returns:Post code in standaard vorm: 2613AB
Return type:str
cbs_utils.misc.valid_date(s)[source]

Check if supplied data s is a valid date for the format Year-Month-Day

Parameters:s (str) – A valid date in the form of YYYY-MM-DD, so first the year, then the month, then the day
Returns:Date object with with the year, month, day obtained from the valid string representation
Return type:datetime
Raises:argparse.ArgumentTypeError:

Notes

This is a helper function for the argument parser module argparse which allows you to check if the argument passed on the command line is a valid date.

Examples

This is the direct usage of valid_date to see if the date supplied is of format YYYY-MM-DD

>>> try:
...     date = valid_date("1973-11-12")
... except argparse.ArgumentTypeError:
...     print("This date is invalid")
... else:
...     print("This date is valid")
This date is valid

In case an invalid date is supplied

>>> try:
...     date = valid_date("1973-15-12")
... except argparse.ArgumentTypeError:
...     print("This date is invalid")
... else:
...     print("This date is valid")
This date is invalid

Here it is demonstrated how to add a ‘–startdate’ command line option to the argparse parser which checks if a valid date is supplied

>>> parser = argparse.ArgumentParser()
>>> p = parser.add_argument("--startdate",
...                         help="The Start Date - format YYYY-MM-DD ",
...                         required=True,
...                         type=valid_date)

References

https://stackoverflow.com/questions/25470844/specify-format-for-input-arguments-argparse-python

cbs_utils.network module

class cbs_utils.network.ActiveDirectory(server: str = None)[source]

Bases: object

Active Directory representation. This object is intended to have all functionality that one would normally want/have from an Active Directory.

default_attrs = ['name', 'member', 'objectClass', 'adspath', 'primaryGroupToken', 'primaryGroupID']
default_groups = ['Administrators', 'Account Operators', 'Backup Operators', 'Server Operators', 'DnsAdmins', 'Domain Admins', 'Exchange Administrators', 'Exchange Services', 'DHCP Administrators']
get_group_info(name, searchRoot: str = None, category: str = 'Group', attributes: list = None) → Dict[str, str][source]

Obtain the group information for the given name :param name: profile name :param searchRoot: where to start looking :param category: what category to look in (by default the Group) :param attributes: additional attributes, examples are ‘name’, ‘member’, ‘objectClass’ :return: A Field : Value dictionary of all found information

get_group_members(strLdap: str, attributes: list = None) → List[Dict[str, object]][source]

Look up a group’s members. :param strLdap: groups adspath attribute. :param attributes: attributes to append to the search query :return: List of dictionaries, each dictionary item has a name and indicator of whether it is a group.

get_member_info(member: str, strLdap: str = None, attributes: List[str] = None) → Dict[str, str][source]

Returns user info. If there is no list of attributes given, it will use a default list (for testing purposes).

get_primary_group(token, searchRoot: str = None, header=' ')[source]

Used to look up Users whose Primary Group is set to one of the groups we’re looking up. This is necessary as AD uses that attribute to calculate a group’s membership. These type of users do not show up if you query the group’s member field directly.

searchRoot is the part of the LDAP tree that you want to start searching from. token is the groups primaryGroupToken.

cbs_utils.plotting module

Definition of CBS rbg colors. Based on the color rgb definitions from the cbs LaTeX template

class cbs_utils.plotting.CBSPlotSettings(fig_width_in_inch: float = None, fig_height_in_inch: float = None, number_of_figures_cols: int = 1, number_of_figures_rows: int = 2, text_width_in_pt: float = 392.64813, text_height_in_pt: float = 693, text_margin_bot_in_inch: float = 1.0, ratio_option='golden_ratio', plot_parameters: dict = None, color_palette: str = 'koel', font_size: int = 8)[source]

Bases: object

Class to hold the figure size for a standard document

Parameters:
  • number_of_figures_rows (int, optional) – Number of figure rows, default = 2
  • number_of_figures_cols (int, optional) – Number of figure cols, default = 1
  • text_width_in_pt (float, optional) – Width of the text in pt, default = 392.64
  • text_height_in_pt (float, optional) – Height of the text in pt: default = 693
  • text_margin_bot_in_inch (float, optional) – Space at the bottom in inch. Default = 1 inch
  • text_height_in_inch (float, optional) – Explicitly over rules the calculated text height if not None. Default = None
  • = None, (text_width_in_inch) – Explicitly over rules the calculated text height if not None. Default = None
  • plot_parameters (dict, optional) – Dictionary with plot settings. If None (default), take the cbs defaults
  • color_palette ({"koel", "warm"}, optional) – Pick color palette for the plot. Default is “koel”
  • font_size (int, optional) – Size of all fonts. Default = 8

Notes

  • The variables are set to make sure that the figure have the exact same size as the document, such that we do not have to rescale them. In this way the fonts will have the same size here as in the document
cbs_utils.plotting.add_axis_label_background(fig, axes, alpha=1, margin=0.05, x0=None, y0=None, loc='east', radius_corner_in_mm=1, logo_margin_x_in_mm=1, logo_margin_y_in_mm=1, add_logo=True, aspect=None)[source]

Add a background to the axis label

Parameters:
  • fig (mpl.figure.Figure object) – The total canvas of the Figure
  • axes (mpl.axes.Axes object) – The axes of the plot to add a box
  • alpha (float, optional) – Transparency of the box. Default = 1 (not transparent)
  • margin (float, optional) – The margin between the labels and the side of the gray box
  • loc ({"east", "south"}) – Location of the background. Default = “east” (left to y-axis. Only “east” and “south” are implemented
  • add_logo (bool, optional) – If true, add the cbs logo. Default = True
  • radius_corner_in_mm (float, optional) – Radius of the corner in mm. Default = 2
  • logo_margin_x_in_mm (float) – Distance from bottom of logo in mm. Default = 2
  • logo_margin_y_in_mm=2, – Distance from left of logo in mm. Default = 2
cbs_utils.plotting.add_cbs_logo_to_plot(fig, axes=None, margin_x_in_mm=6.0, margin_y_in_mm=6.0, x0=0, y0=0, width=None, height=None, zorder_start=1)[source]
cbs_utils.plotting.add_cbs_pnglogo_to_plot(fig, axes=None, image=None, margin_x=6, margin_y=6, loc='lower left', zorder=10, color='blauw', alpha=1.0, logo_width_in_mm=3.234, logo_height_in_mm=4.995, resample=False)[source]

Add a CBS logo to a plot

Parameters:
  • fig (mpl.pyplot.axes.Axes object) –
  • image (mpl.image or None) – To prevent reading the logo many time you can read it once and pass the return image as an argument in the next call
  • color ({"blauw", "wit", "grijs"}) – Color of the logo. Three colors are available: blauw (blue), wit (white) and grijs (grey). Default = “blauw”
  • margin_y (margin_x,) – The x/y image offset in mm.
  • alpha (None or float) – The alpha blending value.
  • loc ({"lower left", "upper left", "upper right", "lower right"} or tuple) – Location of the logo.
  • size (int) – Size of the icon in pixels
Returns:

The image of the logo

Return type:

mpl.image

cbs_utils.plotting.add_values_to_bars(axis, type='bar', position='c', format='{:.0f}', x_offset=0, y_offset=0, color='k', horizontalalignment='center', verticalalignment='center')[source]

Add the values of the bars as number in the center

Parameters:
  • axis (mpl.pyplot.axes.Axes object) – Axis containing the bar plot
  • position ({"c", "t", "l", "r", "b"}, optional) – Location of the numbers, where “c” is center, “t” is top, “l” is left, “r” is right and “b” is bottom. Default = “c”
  • type ({"bar", "barh"}) – Direction of the bars. Default = “bar”, meaning vertical bars. Alternatively you need to specify “barh” for horizontal bars.
  • format (str, optional) – Formatter to use for the numbers. Default = “{:.0f}” (remove digits from float)
  • x_offset (float, optional) – x offset in pt. Default = 0
  • y_offset (float, optional) – y offset in pt. Default = 0
  • color ("str", optional) – Color of the characters, Default is black
  • horizontalalignment (str, optional) – Horizontal alignment of the numbers. Default = “center”
  • verticalalignment (str, optional) – Vertical alignment of the numbers Default = “center”
  • )
cbs_utils.plotting.clean_up_artists(axis, artist_list)[source]

try to remove the artists stored in the artist list belonging to the ‘axis’. :param axis: clean artists belonging to these axis :param artist_list: list of artist to remove :return: nothing

cbs_utils.plotting.get_cbs_logo_points(logo_width_in_mm=3.234, logo_height_in_mm=4.995, rrcor=0.171)[source]

Maak een array met de letters van het CBS logog

Parameters:
  • logo_width_in_mm (float) – Breeedte van het logo
  • logo_height_in_mm (float) – Hoogte van het logo
  • rrcor (float) – Radius of corners
Returns:

List met 3 Nx2 arrays

Return type:

list

cbs_utils.plotting.get_color_palette(style='koel')[source]

Set the color palette

Parameters:style ({"koel", "warm"), optional) – Color palette to pick. Default = “koel”
Returns:cbs_color_cycle
Return type:mpl.cycler

Notes

in order to set the cbs color palette default:

import matplotlib as mpl
from cbs_utils.plotting import get_color_palette
mpl.rcParams.update({'axes.prop_cycle': get_color_palette("warm")}
cbs_utils.plotting.report_colors()[source]

cbs_utils.readers module

cbs_utils.regular_expressions module

This modules contains a collection of often used regular expressions

cbs_utils.sql_server_utils module

cbs_utils.string_measures module

Created on Tue Apr 17 12:06:13 2018

Author: Paul Keuren

Notes

  • Levenstein distances can also by calculated using the python-levenshtein module
cbs_utils.string_measures.levenshtein_distance(s: str, t: str) → int[source]

Calculate Levenshtein distance

Parameters:
  • s (str) – First string
  • t (str) – Second string
Returns:

Distance between strings

Return type:

int

Notes

  • [NL] Bereken de Levenshtein afstand tussen strings. Deze afstandsbepaling geeft aan hoeveel wijzgingen minimaal nodig zijn om van een string de andere string te gaan. Deze implementatie gebruikt een matrix met grootte len(s)*len(t).
  • [EN] Calculates the Levenshtein distance between strings. The Levenshtein distance computes the minimal number of changes (addition/removal/substitution) required to transform one string to the other string. This specific implementation uses a matrix with size len(s)*len(t).
  • For more information on the topic see wikipedialevenshtein
cbs_utils.string_measures.optimal_string_alignment_distance(s: str, t: str) → int[source]
Parameters:
  • s (str) – First string
  • t (str) – Second string
Returns:

Return type:

OSA distance

Notes

  • [NL] Het Optimal String Alignment (OSA) algoritme is een beperkte schatting van de Damerau- Levenshtein (DL) afstand. Het gebruikt geen alphabet (zoals bij DL), maar is beperkt in het aantal transposities wat deze kan meenemen. DL daarentegen neemt alle transposities mee, echter is dit vaak zeer duur en is de OSA goed genoeg.
  • [EN] The optimal string alignment (OSA) algorithm allows for a quick estimation of the Damerau-Levenshtein (DL) distance. It does not require an additional alphabet, but is therefore limited in its transposition detection/completion. This makes the algorithm cheaper than the DL distance, but also less accurate.
  • For more information on the topic see _wikipediadamerau

cbs_utils.web_scraping module

A collection of classes and utilities to assist with web scraping

Author: Eelco van Vliet

class cbs_utils.web_scraping.HRefCheck(href, url, valid_extensions=None, max_depth=1, branch_count=None, max_branch_count=50, schema=None, ssl_valid=True, validate_url=False)[source]

Bases: object

Class to check if a hyper ref obtained from a web page is a valid internal or external hyper-reference

Parameters:
  • href (str) – hyper-reference to check as found on the domain
  • url (str) – Main domain name. Used to check if we have a internal or external hyper-reference
  • valid_extensions (list, optional) – List of string with valid extensions. Default = [“.html”]
  • max_depth (int, optional) – Maximum search depth. Default = 1
  • branch_count (object) – collection.Counter object which keeps the current count of each branch. This is used to check how often subbranches of the domain are visited. In case the max_branch_count is exceeded we stop searching this branch
  • max_branch_count (int, optional) – Maximum number of time a branch in a domain is visit. For instance, in case we have ebay/cars/ as branch, there may be 100,000 cars under this branch which would be all visited. with branch counter. Now we can stop visiting this branch. Default = 50
  • schema (str, optional) – Either http or https. If not given (None) then the scheme will be obtained by doing requests to the side, however, in case we give a ‘schema’, this can be skipped and the given schema is used
  • ssl_valid (bool, optional) – In case of a https schema, this flag indicates if the certificate was valid.
  • validate_url (bool) – Validate each url if it gives a 200 code.
get_full_url(href)[source]

Test if this href could be a full url and if so, if it is valid

is_valid_href()[source]

Check if the current hyper-reference is valid such that we can follow it further

Returns:Flag which is True in case the hyperref is valid
Return type:bool
class cbs_utils.web_scraping.RequestUrl(url: str, session=None, timeout: float = 5.0, retries: int = 3, backoff_factor: float = 0.3, status_forcelist: list = (500, 502, 503, 504), schema=None, ssl_valid=None, validate_url=False)[source]

Bases: object

Add a protocol (https, http) if we don’t have any. Try which one fits

Parameters:
  • url (str) – Url to get the protocal from
  • session (optional) – Session object of an already open session can be passed
  • timeout (float, optional) – Time-out of the request. Default = 5 s
  • retries (int, optional) – Number of time we try to connect. Default = 3
  • backoff_factor (float, optional) – Time that we delay. Default = 0.3
  • status_forcelist (list, optional) – List of status codes which we force to stop. Default = (500, 502, 503, 504),
  • schema (str, optional) – Schema of the url (http or https). If given, this schema is used. Default = None, which means it will be obtained by the class
  • ssl_valid (bool, optional) – True in case the certificate is valid of a https
  • validate_url (bool, optional) – Make connection to the url to validate if it exists (has 200 code). Default=False

Examples

>>> req = RequestUrl("www.google.com")

This adds https to www.google.com as this is the first address that is valid

static add_schema_to_url(url, schema='https')[source]

create a full url link including http or https a

assign_protocol_to_url(url)[source]

Add http of https to an url and check if the tls is valid

make_contact_with_url(url, schema='https', verify=True)[source]

Connect to the url to see if it is valid

class cbs_utils.web_scraping.UrlSearchStrings(url, search_strings: dict, sort_order_hrefs: list = None, stop_search_on_found_keys: list = None, store_page_to_cache=False, cache_directory='cache', timeout=5.0, max_frames=10, max_hrefs=1000, max_depth=2, max_branch_count=10, max_cache_dir_size=None, scrape_url=True, timezone='Europe/Amsterdam', schema=None, ssl_valid=None, validate_url=None)[source]

Bases: object

Class to set up a recursive search of string on web pages

Parameters:
  • url (str) – Main url to start searching
  • search_strings (dict) –

    Dictionary with the searches performed per page. The form is:

    {
        "name_of_search_1": "search_string_1" ,
        "name_of_search_2": "search_string_2"
    }
    
  • store_page_to_cache (bool, optional) – Each page retrieved is also stored to cache if true. Default = False
  • timeout (float, optional) – Time in sec to wait on a request before going to the next. Default = 1.0
  • sort_order_hrefs (dict, optional) – Give an list of names of subdomain which we want to search first
  • stop_search_on_found_keys (list) – List of search keys from the search_strings dict for which we immediately stop with searching as soon as we found a match
  • store_page_to_cache – Store all the pages to cache
  • cache_directory (str, optional) – Name of the cache directory, default=”cache”
  • timeout – Stop requesting the page after timeout seconds. Default = 5.0 s
  • max_frames (int, optional) – Maximum number of frames we scrape. Default = 10
  • max_hrefs (int, optional) – Maximum number of hyper references we follow. Default = 1000
  • max_depth (int, optional) – Maximum depth we search the domain. Default = 1
  • max_branch_count (int, optional) – Maximum number of request per branch. Default = 10
  • max_cache_dir_size (int, optional) – Maximum size of the cache directory in Mb. If None, there is no maximum. If 0, no cache is written. If a finite number, each request before writing the cache, first the current directory size needs to be checked, so that slows down the code significantly. Default=None
  • scrape_url (bool, optional) – Flag to indicate if we want to scrape. If false, no scraping or any other access of internet is done. This allows to use the object with doing a scrape
  • timezone (str, optional) – Time zone of the scrape. Default = “Europe/Amsterdam”
  • str, optional (schema) – Protocal of the url, http or https. If None (default) it will be obtained
  • ssl_valid (bool, optional) – Flag to indicate if the tls encryption has a valid certificate
  • validate_url – Validate url to check if it exists
exists

Set flag True is url exists

Type:bool
matches

Dictionary containing the results of the searches defined by search_strings. The keys are derived from the search_strings key, the results are lists containing all the matches

Type:dict
number_of_iterations

Number of recursions

Type:int

Notes

  • This class can also handle web page with frames. Normally, these are not analysed by beautiful soup, however, by explicitly looking up all frames and following the links defined by the ‘src’ tag, we can access all the frames in an url

Examples

Let she we have a web site ‘www.example.com’ want to extract all the postcodes. Also, we want to get all the words with more than 10 characters. For this, store your regular expression for both searches in a dictionary and feed it to the UrlSearchStrings class

>>> url = "www.example.com"
>>> search = dict(postcode=r"\d{4}\s{0,1}[a-zA-Z]{2}", longwords=r"\w{11,}")
>>> url_analyse = UrlSearchStrings(url, search_strings=search)

The results are stored in the ‘matches’ attribute of the class and can be report by printing the class like:

>>> print(url_analyse)
Matches in https://www.example.com/
postcode : []
longwords : ['established', 'illustrative', 'coordination', 'information']

In our example, the matches with the postal codes is empty (for the example domain). and we have found 5 words with more than 10 characters

>>> postcodes = url_analyse.matches["postcode"]

Note that the keys of the matches dictionary are the same as the keys we used for the search

follow_frames(soup, url)[source]

In the current soup, find all the frames and for each frame start a new pattern search

Parameters:
  • soup (BeautifulSoup.soup) – The current soup
  • url (str) – The current url
follow_hrefs(soup)[source]

In the current soup, find all the hyper references and follow them if we stay in the domain

Parameters:
  • soup (BeautifulSoup.soup) – The current soup
  • url (str) – The current url
static get_patterns(soup, regexp) → list[source]

Retrieve all the pattern match in the soup obtained from the url with Beautifulsoup

Parameters:
  • soup (object:BeautifulSoup) – Return value of the beautiful soup of the page where we want to search
  • regexp (re.Pattern) – Compiled regular expression to find on this page
Returns:

List of matches with the regular expression

Return type:

list

make_href_df(links)[source]

Create a pandas dataframe of all the hyper reference on this page and keep track of the properties of the hrefs. At the end, a sort of the references is made

Parameters:links (list) – List of hyper references
make_soup(url)[source]

Get the beautiful soup of the page url

Search the ‘url’ for the patterns and continue of links to other pages are present

cbs_utils.web_scraping.cache_to_disk(func)[source]

Decorator which allows to cache the output of a function to disk

Parameters:
  • skip_cache (bool) – If True, always skip the cache, even the decorator was added
  • max_cache_dir_size (int or None) – If not None, check if the size of the cache directory is not exceeding the maximum given in Mb
  • cache_directory (str) – Name of the cache file output directory

Examples

Say you have a function that reads the contents of a web page from internet:

@cache_to_disk
def get_page_from_url(url, timeout=1.0):
    try:
        page = requests.get(url, timeout=timeout)
    except requests.exceptions.ConnectionError as err:
        page = None
    return page

Without the @cache_to_disk decorator, you would just read the contents of a html file with:

page = get_page_from_url("nu.nl")

However, because we have added the @cache_to_disk decorator, the first time the data is read from the website, but this is stored to a pickle file. All the next runs you just obtain the data from the pickle file.

The cache_to_disk decorator checks if some parameters are given. With the skip_cache flag you can prevent the cache being used even if the decorator was added In case the max_cache_dir_size is defined, the size of the cache directory is checked first and only new cache is written if the size of the directory in MB is smaller than the defined maximum. An example of using the maximum would be:

page = get_page_from_url("nu.nl", max_cache_dir_size=0)

In this example, we do not allow to add new cache files at all, but old cache files can still be read if present in the cache dir

cbs_utils.web_scraping.get_clean_url(url)[source]

Get the base of a url without the relative part

cbs_utils.web_scraping.get_page_from_url(url, session=None, timeout=1.0, skip_cache=False, raise_exceptions=False, max_cache_dir_size=None, headers=None, verify=True, cache_directory=None)[source]

Get the contents of url and immediately store the result to a cache file

Parameters:
  • url – str String with the url to fetch
  • session – object:Session: A session can be passed in case you want to keep it open
  • timeout – float Number of seconds you try to connect
  • skip_cache – bool If True, prevent that we are using the cache decorator
  • skip_cache – bool If True, do not write new cache.
  • raise_exceptions – bool If True, raise the exceptions of the requests
  • max_cache_dir_size – int Maximum size of cache in Mb. Stop writing cache as soon max_cache has been reached. If None, this test is skip and the cache is always written. If 0, we never write cache and therefore the check of the current directory size can be skipped, which significantly speeds up the code
  • headers – dict Headers to use for the request
  • verify – bool Forces to verify the certificate
  • cache_directory – str Name of the cache directory which is passed to the decorator
Returns:

The html page

Return type:

request.Page

Notes

  • The ‘cache_to_dist’ decorator takes care of caching the data to the directory cache

Examples

If you want to get the page using request.get with caching do the following

>>> url = "https://www.example.com"
>>> page = get_page_from_url(url, cache_directory="cache_test")
>>> soup = BeautifulSoup(page.text, 'lxml')
>>> body_text = re.sub('\s+', ' ', soup.body.text)
>>> print(body_text)
' Example Domain This domain is established to be used for illustrative examples in '         'documents. You may use this domain in examples without prior coordination or asking for '         'permission. More information... '

At this point also a directory cache_test has been create with a cache file name with the name get_page_from_url_https_www_example_com_.pkl

If you only want to read existing cache (in case it was written before) but do not want to write new cache, add the max_cache_dir_size=0 argument

>>> page = get_page_from_url(url, cache_directory="cache_test", max_cache_dir_size=0)
cbs_utils.web_scraping.is_url(url)[source]

Check if url is valid

cbs_utils.web_scraping.make_cache_file_name(function_name, args)[source]

Create a cache file name based on the function name + list of arguments

Parameters:
  • function_name (str) – name of the function to prepend
  • args (tuple) – arguments passed to the function
Returns:

Name of the cache file

Return type:

str

Notes

  • Used by cache_to_disk to make a name of a cache file based on its input arguments
  • To make sure that we get a valid file name, we remove all the special characters
cbs_utils.web_scraping.requests_retry_session(retries=1, backoff_factor=0.3, status_forcelist=(500, 502, 503, 504), session=None)[source]

Do request with retry

Parameters:
  • retries (int) – Number of retryres
  • backoff_factor
  • status_forcelist
  • session (object) –
Returns:

session linkk

Return type:

requests.Session

cbs_utils.web_scraping.strip_url_schema(url)[source]

Module contents