cbs_utils package¶
Submodules¶
cbs_utils.global_vars module¶
Some global variable definitions
cbs_utils.mail module¶
-
class
cbs_utils.mail.CBS_SMTP_Message(sender: str, adressee: str, subject: str = '', body: str = '', mail_server: str = 'mail.cbsp.nl')[source]¶ Bases:
cbs_utils.mail.EmailFormatCBS SMTP based email message.
Notes
- [NL] CBS SMTP gebaseerd email bericht. Dit bericht object maakt het eenvoudig om een email bericht naar een gebruiker te sturen. Je kan mails verstuurt vanuit elk e-mail account waarvoor je gerechtigd bent vanuit de centrale server.
- [EN] CBS SMTP based email message. This message object makes it easy to send messages to email recipients. By creating this object mails can be sent from any address the user is allowed to use according to the central server.
cbs_utils.misc module¶
Some miscellaneous functions used throughout many cbs modules
-
class
cbs_utils.misc.CacheInfo(file_name, directory='.', file_type=None, reset_cache=False)[source]¶ Bases:
objectKlasse om een informatie van de cache te bewaren
Parameters:
-
class
cbs_utils.misc.Chdir(new_path)[source]¶ Bases:
objectClass which allows to move to a directory, do something, and move back when done
Parameters: new_path (str) – Location where you want to do something Notes
Used on the Gompute cluster in the batch processing script to submit a job inside a directory and then move back to the higher directory in order to move to the next case
Examples
Go to a known directory (C:/)
>>> os.chdir("C:/") >>> os.getcwd() 'C:\\'
With the Chdir command we move to the C:/Temp directory where we can do something.
>>> with Chdir("C:/Windows") as d: ... # in this block we can do something in the directory Temp. ... os.getcwd() 'C:\\Windows'
We have left the block under Chdir, so we are back at the directory where we started
>>> os.getcwd() 'C:\\'
-
class
cbs_utils.misc.ConditionalDecorator(dec, condition)[source]¶ Bases:
objectAdd a decorator to a function only if the condition is True
Parameters: - dec (decorator) – The decorator which you want to add when condition is true
- condition (bool) – Only add the decorator if this condition is True
-
class
cbs_utils.misc.PackageInfo(module_object)[source]¶ Bases:
objectA class to analyse the version properties of this package
Parameters: module_object ( Module) – reference to the module for which want to to store the properties
-
class
cbs_utils.misc.Timer(message='Elapsed time', name='routine', verbose=True, units='ms', n_digits=0, field_width=20)[source]¶ Bases:
objectClass to measure the time it takes execute a section of code
Parameters: Example
Use a with / as construction to enclose the section of code which need to be timed
Also, make sure that merge the logger to activate the logger function of the Timer class
>>> import logging >>> from numpy import allclose >>> from cbs_utils.misc import (Timer, merge_loggers) >>> number_of_seconds = 1.0
>>> logger = logging.getLogger(__name__) >>> merge_loggers(logger, "cbs_utils")
>>> with Timer(units="s", n_digits=0) as timer: ... time.sleep(number_of_seconds) Elapsed time routine : 1 s >>> allclose(number_of_seconds, timer.secs, rtol=0.1) True
-
cbs_utils.misc.clean_up_name(name)[source]¶ Verwijder alle vervelende chars zoals [ of ] of
Parameters: name (str) – String die schoon gemaakt moet worden Returns: Schone naam Return type: str
-
cbs_utils.misc.clear_argument_list(argv)[source]¶ Small utility to remove the ‘\r’ character from the last argument of the argv list appearing in cygwin
Parameters: argv (list) – The argument list stored in sys.argv Returns: Cleared argument list Return type: list
-
cbs_utils.misc.clear_path(path_name)[source]¶ - routine to clear spurious dots and slashes from a path name
- example bla/././oke becomes bla/oke
Parameters: path_name – return: clear_path as a string Returns: clear_path as a string Return type: type Examples
>>> long_path = os.path.join(".", "..", "ok", "yoo", ".", ".", "") + "/" >>> print(long_path) .\..\ok\yoo\.\.\/ >>> print(clear_path(long_path)) ..\ok\yoo
-
cbs_utils.misc.compare_objects(obj1, obj2, counter=0, max_recursion_depth=4)[source]¶ Compare if two object are equal
Parameters: Notes
- This function compares all the attributes of two object to see if their values are the same
- An attribute field may be another object which we also want to compare with the same attribute of the other object. This is done by recursively calling this function again.
- Due to the recursive call mechanism we may end up in a infinite loop. To prevent this, a maximum recursion depth can be given.
- The test function test_sequence_tool of the sequence_tool_utils module uses this function to compare to SequenceToolSummary objects
Raises: AssertionError: – In case on of the object fields is not equal
-
cbs_utils.misc.create_logger(name='root', log_file=None, console_log_level=20, console_log_format_long=False, console_log_format_clean=False, file_log_level=20, file_log_format_long=True, redirect_stderr=True, formatter=None, formatter_file=None) → logging.Logger[source]¶ Create a console logger
Parameters: - name (str, optional) – Name of the logger. Default = “root”
- log_file (str, optional) – The name of the log file in case we want to write it to file. If it is not specified, no file is created
- console_log_level (int, optional) – The level of the console output. Defaults to logging.INFO
- console_log_format_long (bool) – Use a long informative format for the logging output to the console
- console_log_format_clean (bool) – Use a very clean format for the logging output. If given together with consosl_log_format_long an AssertionError is raised
- file_log_level (int, optional) – In case the log file is used, specify the log level. Can be different from the console log level. Defaults to logging.INFO
- file_log_format_long (bool, optional) – Use a longer format for the file output. Default to True
- redirect_stderr (bool, optional) – If True the stderr output is written to a file with .err extension in stated of .out. Default = True
- formatter (Formatter) – A formatter can also be explicitly passed
- formatter_file (Formatter) – A formatter can also be explicitly passed
Returns: The handle to the logger which we can use to create output to the screen using the logging module
Return type: Examples
Create a logger at the verbosity level, so no debug information is generated
>>> logger = create_logger() >>> logger.debug("This is a debug message")
The info and warning message are both plotted
>>> logger.info("This is a information message") INFO : This is a information message >>> logger.warning("This is a warning message") WARNING : This is a warning message
Create a logger at the debug level
>>> logger = create_logger(console_log_level=logging.DEBUG) >>> logger.debug("This is a debug message") DEBUG : This is a debug message >>> logger.info("This is a information message") INFO : This is a information message >>> logger.warning("This is a warning message") WARNING : This is a warning message
Create a logger at the warning level. All output is suppressed, except for the warnings
>>> logger = create_logger(console_log_level=logging.WARNING) >>> logger.debug("This is a debug message") >>> logger.info("This is a information message") >>> logger.warning("This is a warning message") WARNING : This is a warning message
It is also possible to redirect the output to a file. The file name given without an extension, as two file are created: one with the extension .out and one with the extension .err, for the normal user generated out put and system errors output respectively.
>>> data_dir = os.path.join(os.path.split(__file__)[0], "..", "..", "data") >>> file_name = os.path.join(data_dir, "log_file") >>> logger = create_logger(log_file=file_name, console_log_level=logging.INFO, ... file_log_level=logging.DEBUG, file_log_format_long=False) >>> logger.debug("This is a debug message") >>> logger.info("This is a information message") INFO : This is a information message >>> logger.warning("This is a warning message") WARNING : This is a warning message >>> print("system normal message") system normal message >>> print("system error message", file=sys.stderr)
At this point, two files have been generated, log_file.out and log_file.err. The first contains the normal logging output whereas the second contains error message generated by other packages which do not use the logging module. Note that the normal print statement shows up in the console but not in the file, whereas the second print statement to the stderr output does not show on the screen but is written to log_file.err
To show the contents of the generated files we do
>>> with open(file_name+".out", "r") as fp: ... for line in fp.readlines(): ... print(line.strip()) DEBUG : This is a debug message INFO : This is a information message WARNING : This is a warning message >>> sys.stderr.flush() # forces to flush the stderr output buffer to file >>> with open(file_name + ".err", "r") as fp: ... for line in fp.readlines(): ... print(line.strip()) system error message
References
-
cbs_utils.misc.dataframe_clip_strings(df, max_width, include=None, exclude=None)[source]¶ Clip all strings in a dataframe
Parameters: Returns: Return type: Pandas data frame with clip string columns
-
cbs_utils.misc.delete_module(modname, paranoid=None)[source]¶ Delete a module from memory which loaded before
Parameters:
-
cbs_utils.misc.get_branch(default_branch=None)[source]¶ Get the current git version of this questionary
Parameters: default_branch (str) – De default naam die we aan het branch geven als we niks kunnen vinden Returns: current branch version Return type: str
-
cbs_utils.misc.get_clean_version(version) → str[source]¶ turns the full version string into a clean one without the build
Parameters: version (str) – The version string as return from versioneer. Returns: The clean version string Return type: str Notes
The version string matches the following regular expression
“([.|d]+)([+]*)(.*)”This function return the clean version string given by the part “([.|d]+)”
Examples
>>> get_clean_version("1.3") '1.3' >>> get_clean_version("2.5+dev.g43429") '2.5' >>> get_clean_version("4.3.1+dev.g43429-dirty") '4.3.1'
-
cbs_utils.misc.get_dir_size(directory_name)[source]¶ Returns the size of the current directory in Bytes
Parameters: directory_name (str) – Name of the directory Returns: Size of the directory in Buyt Return type: int Notes
- Just of oneliner using the Pathlib
-
cbs_utils.misc.get_logger(name) → logging.Logger[source]¶ Get the logger of the current level and set the level based on the main routine. Then return it
Parameters: name (str) – the name of the logger to set. Returns: log: a handle of the current logger Return type: logging.Loggertype Notes
This routine is used on top of each function to get the handle to the current logger and automatically set the verbosity level of the logger based on the main function
Examples
Assume you define a function which need to generate logging information based on the logger created in the main program. In that case you can do
>>> def small_function(): ... logger = get_logger(__name__) ... logger.info("Inside 'small_function' This is information to the user") ... logger.debug("Inside 'small_function' This is some debugging stuff") ... logger.warning("Inside 'small_function' This is a warning") ... logger.critical("Inside 'small_function' The world is collapsing!")
The logger can be created in the main program using the create_logger routine
>>> def main(logging_level): ... main_logger = create_logger(console_log_level=logging_level) ... main_logger.info("Some information in the main") ... main_logger.debug("Now we are calling the function") ... small_function() ... main_logger.debug("We are back in the main function")
Let’s call the main fuction in DEBUGGING mode
>>> main(logging.DEBUG) INFO : Some information in the main DEBUG : Now we are calling the function INFO : Inside 'small_function' This is information to the user DEBUG : Inside 'small_function' This is some debugging stuff WARNING : Inside 'small_function' This is a warning CRITICAL : Inside 'small_function' The world is collapsing! DEBUG : We are back in the main function
You can see that the logging level inside the small_function is obtained from the main level. Do the same but now in the normal information mode
>>> main(logging.INFO) INFO : Some information in the main INFO : Inside 'small_function' This is information to the user WARNING : Inside 'small_function' This is a warning CRITICAL : Inside 'small_function' The world is collapsing!
We can call in the silent mode, suppressing all debugging and normal info, but not Warnings
>>> main(logging.WARNING) WARNING : Inside 'small_function' This is a warning CRITICAL : Inside 'small_function' The world is collapsing!
Finally, to suppress everything except for critical warnings
>>> main(logging.CRITICAL) CRITICAL : Inside 'small_function' The world is collapsing!
-
cbs_utils.misc.get_path_depth(path_name)[source]¶ Get the depth of a path or file name
Parameters: path_name (str) – Path name to get the depth from Returns: depth of the path Return type: int Examples
>>> get_path_depth("C:\Anaconda") 1 >>> get_path_depth("C:\Anaconda\share") 2 >>> get_path_depth("C:\Anaconda\share\pywafo") 3 >>> get_path_depth(".\imaginary\path\subdir\share") 4
-
cbs_utils.misc.get_python_version_number(version_info) → str[source]¶ Script to turn the version info as obtained with sys.version_info into a digit number
Parameters: version_info – return: a string with the current python version as a clear digit, i.e. 3.5.3 Returns: a string with the current python version as a clear digit, i.e. 3.5.3 Return type: str Examples
>>> version_string = get_python_version_number(sys.version_info)
-
cbs_utils.misc.get_regex_pattern(search_pattern)[source]¶ Routine to turn a string into a regular expression which can be used to match a string
Parameters: search_pattern (str) – A regular expression in the form of a string Returns: A regular expression as return by the re.compile fucntion or None in case a invalid regular expression was given Return type: None or compiled regular expression Notes
An empty string or an invalid search_pattern will yield a None return
-
cbs_utils.misc.get_time_stamp_from_string(string_with_date_time, yearfirst=True, dayfirst=False, timezone=None)[source]¶ Try to get a date/time stamp from a string
Parameters: - string_with_date_time (str) – The string to analyses
- yearfirst (bool, optional) – if true put the year first. See dateutils.parser. Default = True
- dayfirst (bool, optional) – if true put the day first. See dateutils.parser. Default = False
- timezone (str or None, optional) – if given try to add this time zone:w
Returns: Pandas data time string
Return type: DateTimeExamples
The date time in the file ‘AMSBALDER_160929T000000’ is 29 sep 2016 and does not have a time zone specification. The returned time stamp does also not have a time zone
>>> file_name="AMSBALDER_160929T000000" >>> time_stamp =get_time_stamp_from_string(string_with_date_time=file_name) >>> print("File name {} has time stamp {}".format(file_name, time_stamp)) File name AMSBALDER_160929T000000 has time stamp 2016-09-29 00:00:00
We can also force to add a time zone. The Etc/GMT-2 time zone is UTC + 2 time zone which is the central europe summer time (CEST) or the Europe/Amsterdam Summer time.
>>> time_stamp =get_time_stamp_from_string(string_with_date_time=file_name, ... timezone="Etc/GMT-2") >>> print("File name {} has time stamp {}".format(file_name, time_stamp)) File name AMSBALDER_160929T000000 has time stamp 2016-09-29 00:00:00+02:00
This time we assume the file name already contains a time zone, 2 hours + UTC. Since we already have a time zone, the timezone option can only convert the date time to the specified time zone.
>>> file_name="AMSBALDER_160929T000000+02" >>> time_stamp =get_time_stamp_from_string(string_with_date_time=file_name, ... timezone="Etc/GMT-2") >>> print("File name {} has time stamp {}".format(file_name, time_stamp)) File name AMSBALDER_160929T000000+02 has time stamp 2016-09-29 00:00:00+02:00
In case the time zone given by the timezone options differs with the time zone in the file name, the time zone is converted
>>> file_name="AMSBALDER_160929T000000+00" >>> time_stamp =get_time_stamp_from_string(string_with_date_time=file_name, ... timezone="Etc/GMT-2") >>> print("File name {} has time stamp {}".format(file_name, time_stamp)) File name AMSBALDER_160929T000000+00 has time stamp 2016-09-29 02:00:00+02:00
-
cbs_utils.misc.get_value_magnitude(value, convert_to_base_units=True)[source]¶ Get the magnitude of value with Pint dimension in terms of its base units or just return a float if value does not have a dimension
Parameters: Returns: Magnitude of the value in case a Pint Quantity was added to the input or just the value itself. If convert_to_base_units was set to True the value is first converted to its SI base units
Return type: Examples
Assume we have a value with a pint dimension
>>> velocity = Q_("2.5 m/s") >>> print("Current velocity with dimension is: {}".format(velocity)) Current velocity with dimension is: 2.5 meter / second
We can now get the magnitude of velocity using this function as
>>> velocity_mag = get_value_magnitude(velocity) >>> print("Velocity without dimension is: {}".format(velocity_mag)) Velocity without dimension is: 2.5
In case the input argument of the get_value_magnitude is a float and does not have a dimension, the value itself is returned
>>> velocity_mag2 = get_value_magnitude(velocity_mag) >>> print("Velocity without dimension is: {}".format(velocity_mag2)) Velocity without dimension is: 2.5
In case we have a dimension in none SI units, the value is by default first converted to its SI base units.
>>> velocity_knots = Q_("1 knot") >>> velocity_mag = get_value_magnitude(velocity_knots) >>> print("Velocity {} is converted to its magnitude in m/s: {:.2f}" ... "".format(velocity_knots, velocity_mag)) Velocity 1 knot is converted to its magnitude in m/s: 0.51
In case that the convert_to_base_units flag is False we just get the magnitude in the same units as the input argument
>>> velocity_knots = Q_("2.5 knot") >>> velocity_mag = get_value_magnitude(velocity_knots, convert_to_base_units=False) >>> print("Velocity {} is converted to its magnitude in knots: {:.2f}" ... "".format(velocity_knots, velocity_mag)) Velocity 2.5 knot is converted to its magnitude in knots: 2.50
Notes
- This function is used inside other functions in which it is not know before hand if an input argument is passed with or without a Pint dimension and we only are interested in the magnitude of the value. Use this function to get the magnitude
-
cbs_utils.misc.get_version(default_version=None)[source]¶ Get the current git version of this questionary
Returns: current git version Return type: str
-
cbs_utils.misc.is_exe(fpath)[source]¶ Test if a file is an executable
Parameters: fpath (str) – return true or false: Returns: In case fpath is a file that can be executed return True, else False Return type: bool Notes
This function can only be used on Linux file systems as the which command is used to identity the location of the program.
-
cbs_utils.misc.is_postcode(postcode)[source]¶ kijk of een string een postcode is
Parameters: postcode (srt) – De string om te controleren Returns: True als het een postcode is Return type: bool
-
cbs_utils.misc.make_directory(directory)[source]¶ Create a directory in case it does not yet exist.
Parameters: directory (Path or str) – Name of the directory to create Notes
This function is used to create directories without checking if it already exist. If the directory already exists, we can silently continue.
Example
If you want to create a directory ‘outdir’, just do:
make_directory("outdir")
The directory is created if it doesn’t exist, or, we just continue silently if it already exists
Raises: OSError– The OSError is only raised if it is not an EEXIST error. This implies that the creation of the directory failed due to another reason than the directory already being present. It could be that the file system is full or that we may not have write permission
-
cbs_utils.misc.merge_loggers(main_logger, logger_name_to_merge, logger_level_to_merge=20)[source]¶ Add the logger of an external module to the local logger
Parameters: Returns: merged logger
Return type: Logger
Examples
In case you have created a logger in your script with the create_logger function
>>> logger = create_logger()
And also you have create a module file your_module.py with it’s own logger
>>> module_logger = logging.getLogger(__name__)
In this case you would use the __name__ variable in ‘your_module’, so this logger is called ‘your_module’
Now in case you want to add the logger of ‘your_module’ to the local logger of your script, do
>>> merge_loggers(logger, 'your_module')
Now all the logger statements in ‘your_logger’ are also added to logger output
-
cbs_utils.misc.move_script_path_to_back_of_search_path(script_file, append_at_the_end=True) → list[source]¶ Move the name of a script to the front or the back of the search path
Parameters: Returns: The new system path stored in a list
Return type: Notes
This script is sometimes required if the __version string is messing up with another __version string
Examples
sys.path = move_script_path_to_back_of_search_path(__file__)
Create a banner for plotting a bigger title above each section in the log output
Parameters: - title – The title to plot
- top_symbol (str) – the symbol used for the top line. Default value = “-“
- bottom_symbol (str) – the symbol used for the bottom line. Assume same as top if None is given (Default value = None)
- side_symbol (str) – The side symbol. Assume same as top if None is given, except if top is -, then take | (Default value = None)
- width (int) – the width of the banner (Default value = 80)
- no_top_and_bottom (bool) – make a simple print without the top and bottom line (Default value = False)
- to_stdout (bool, optional) – Print the banner to the standard output of the console instead of the logging system. Defaults to False
Examples
>>> logger = create_logger(console_log_format_clean=True) >>> print_banner("This is the start of a section") <BLANKLINE> -------------------------------------------------------------------------------- | This is the start of a section | --------------------------------------------------------------------------------
Notes
Unless the option ‘to_stdout’ is set to True, the banner is printed via the logging system. Therefore, a logger needs to be created first using create_logger
-
cbs_utils.misc.query_yes_no(question, default_answer='no')[source]¶ Ask a yes/no question via raw_input() and return their answer.
Parameters: Returns: “yes” or “no”, depending on the input of the user
Return type:
-
cbs_utils.misc.range1(start=None, stop=None)[source]¶ Return a range including the end value
Parameters: Returns: Range of integer values in betwween start and stop, including the stpo value
Return type:
-
cbs_utils.misc.read_settings_file(file_name) → dict[source]¶ Read the yaml file to get the setup information.
Parameters: file_name (str) – Name of the configuration file. Can be a full path name as well Returns: All the settings as obtained from the yaml configuration file Return type: dict Notes
The file name of the yaml file is searched for in the following order
- The current directory where the script is executed. If a full path is given, this will be accepted too.
- The directory where the original script is located.
In this way, a default settings file can be put in the script directory and the user does not need to copy it except a setting values needs to be changed
Raises: AssertionError: – In case the file can not be found
-
cbs_utils.misc.read_value_from_dict_if_valid(dictionary, key, default_value=None)[source]¶ small routine to read a value from a dictionary. If the value is not set, just return the default value
Parameters: - dictionary – dictionary which is supposed to carry this key field
- key – the name of the field to read the value from
- default_value – default value in case we fail to read the key field (if it does not exist)
Returns: value of the key field or the default value
Return type:
-
cbs_utils.misc.scan_base_directory(walk_dir='.', supplied_file_list=None, file_has_string_pattern='', file_has_not_string_pattern='', dir_has_string_pattern='', dir_has_not_string_pattern='', start_date_time=None, end_date_time=None, time_zone=None, time_stamp_year_first=True, time_stamp_day_first=False, extension=None, max_depth=None, sort_file_base_names=False)[source]¶ Recursively scan the directory walk_dir and get all files underneath obeying the search strings and/or date/time ranges
Parameters: - walk_dir (str, optional) – The base directory to start the import. Default = “.”
- supplied_file_list (list, optional) – In case walk dir is not given we can explicitly pass a file list to analyse. Default = None
- dir_has_string_pattern (str, optional) – Requires the directory name to have this pattern (Default value = “”). This selection is only made on the first directory level below the walk_dir
- dir_has_not_string_pattern (str, optional) – Requires the directory name NOT to have this pattern (Default value = “”). This selection is only made on the first directory level below the walk_dir
- file_has_string_pattern (str, optional) – Requires the file name to have this pattern (Default value = “”, i.e. matches all)
- file_has_not_string_pattern (str, optional) – Requires the file name NOT to have this pattern (Default value = “”)
- extension (str or None, optional) – Extension of the file to match. If None, also matches. Default = None
- max_depth (int, optional) – Sets a maximum depth to which the search is carried out. Default = None, which does not limit the search depth. For deep file structures setting a limit to the search depth speeds up the search.
- sort_file_base_names (bool, option) – If True, sort the resulting file list alphabetically based on the file base name. Default = False
- start_date_time (DateTime or None, optional) – If given, get the date time from the current file name and only add the files with a date/time equal or large the start_date_time. Default is None
- end_date_time (DateTime or None, optional) – If given, get the date time from the current file name and only add the files with a date/time smaller than the end_date_time. Default is None
- time_zone (str or None, optional) – If given add this time zone to the file stamp. The start and end time should also have a time zone
- time_stamp_year_first (bool, optional) – Passed to the datetime parser. If true, the year is first in the date/time string. Default = True
- time_stamp_day_first (bool, optional) – Passed to the datetime parser. If true, the day is first in the date/time string. Default = False
Returns: All the file names found below the input directory walk_dir obeying all the search strings
Return type: Examples
Find all the python files under the share directory in the Anaconda installation folder
>>> scan_dir = "C:\Anaconda\share" >>> file_list = scan_base_directory(scan_dir, extension='.py')
Find all the python files under the share directory in the Anaconda installation folder belonging to the pywafo directory
>>> file_list = scan_base_directory(scan_dir, extension='.py', dir_has_string_pattern="wafo")
Note that wafo matches on the directory ‘pywafo’, which is the first directory level below the scan directory. However, if we would match on ‘^wafo’ the returned list would be empty as the directory has to start with wafo.
In order to get all the files with “test” in the name with a directory depth smaller than 3 do
>>> file_list = scan_base_directory(scan_dir, extension='.py', dir_has_string_pattern="wafo", ... file_has_string_pattern="test", max_depth=3)
Test the date/time boundaries. First create a file list from 28 sep 2017 00:00 to 5:00 with a hour interval and convert it to a string list
>>> file_names = ["AMS_{}.mdf".format(dt.strftime("%y%m%dT%H%M%S")) for dt in ... pd.date_range("20170928T000000", "20170928T030000", freq="30min")] >>> for file_name in file_names: ... print(file_name) AMS_170928T000000.mdf AMS_170928T003000.mdf AMS_170928T010000.mdf AMS_170928T013000.mdf AMS_170928T020000.mdf AMS_170928T023000.mdf AMS_170928T030000.mdf
Use the scan_base_directory to get the files within a specific date/time range
>>> file_selection = scan_base_directory(supplied_file_list=file_names, ... start_date_time="20170928T010000", end_date_time="20170928T023000")
>>> for file_name in file_selection: ... print(file_name) AMS_170928T010000.mdf AMS_170928T013000.mdf AMS_170928T020000.mdf
Note that the selected range run from 1 am until 2 am; the end_date_time of 2.30 am is not included
-
cbs_utils.misc.set_default_dimension(parse_value, default_dimension=None, force_default_units=False)[source]¶ Add a pint dimension to a value
Parameters: - parse_value (ndarray or str or float) – Value with optional a dimension written in the form of a str. Can be an array or list of strings as well
- default_dimension (str) – Required default dimension
- force_default_units (bool) – If true the only allowed dimension is the default dimension. Raise an error in case this is not the case. Default = False
Returns: Value with the quantity as give by the default
Return type: QuantityRaises: AssertionError– In case the dimension of the parse_value argument is not not but:- Its dimensionality is not the same as the dimensionality of the default_dimension
- Its units is not the same as the unit of the default_dimension and the force_default_units flag is set to True
Notes
- This function is a add-on to the pint module, a package to define, operate and manipulate physical quantities: https://pypi.python.org/pypi/Pint.
- This function is used to add a dimension to a value which is parsed from a text file.
- It is checked if the value given in the text file has dimension already, for example that it was given as “1.0 m/s”.
- If a dimension was given already: check if the dimensionality (in this case: Length/Time) is the same as the dimensionality of the default_dimension input argument.
- In case the input value does not have an explicit dimension, the dimension given by default_dimension is added to the value.
- This function works on both scalar and list values
Examples
Assume we want to read input values from a text file as plain numbers and we want to add a default dimension of meter to it in case the value do not have an explicit dimension yet. Just do
>>> logger = create_logger(console_log_level=logging.CRITICAL) >>> value_without_dimension = 1.0 # this is the values as we read from the text file >>> value_with_dimension = set_default_dimension(value_without_dimension, "meter") >>> print(value_with_dimension) 1.0 meter
The variable value_with_dimension is now a pint quantity which carries the dimension meter.
In case the input variable already has a dimension, we should also be able to use this function. The only requirement is that the dimensionality is the same. So this should work
>>> value_with_dimension = set_default_dimension("2.5 meter", "meter") >>> print(value_with_dimension) 2.5 meter
This should work as well
>>> value_with_dimension = set_default_dimension("5.0 mm", "meter") >>> print(value_with_dimension) 5.0 millimeter
But this fails as the dimensionality of the input argument is not corresponding with the dimensionality of the default dimension
>>> try: ... value_with_dimension = set_default_dimension("5.0 mm", "second") ... except AssertionError: ... print("This fails because the dimensionality is not the same") This fails because the dimensionality is not the same
This function should also work for arrays and list
>>> values_without_dimension = np.linspace(0, 1, num=5, endpoint=True) >>> values_with_dimension = set_default_dimension(values_without_dimension, "meter/second^2") >>> print(values_with_dimension) [0. 0.25 0.5 0.75 1. ] meter / second ** 2
Notes
- Hz are not converted to rad/s as expected. Therefore do not try to use this to convert Hz -> rad/s
- If the input argument parse_val is None, a None is returned as output as well
-
cbs_utils.misc.set_value_if_valid(value, new_value)[source]¶ small routine to set a value on if it is not none. Otherwise the original value is taken
Parameters: - value – the original value which you can pre-define with a default value
- new_value – the new value. Only set this if it is not none
Returns: net value or the original if new_value was None
Return type:
-
cbs_utils.misc.standard_postcode(postcode)[source]¶ Maak een standaard vorm van een postcode
Parameters: postcode (str) – Postcode string in niet standaard vorm, zoals 2613 AB, 2613ab, etc Returns: Post code in standaard vorm: 2613AB Return type: str
-
cbs_utils.misc.valid_date(s)[source]¶ Check if supplied data s is a valid date for the format Year-Month-Day
Parameters: s (str) – A valid date in the form of YYYY-MM-DD, so first the year, then the month, then the day Returns: Date object with with the year, month, day obtained from the valid string representation Return type: datetimeRaises: argparse.ArgumentTypeError: Notes
This is a helper function for the argument parser module argparse which allows you to check if the argument passed on the command line is a valid date.
Examples
This is the direct usage of valid_date to see if the date supplied is of format YYYY-MM-DD
>>> try: ... date = valid_date("1973-11-12") ... except argparse.ArgumentTypeError: ... print("This date is invalid") ... else: ... print("This date is valid") This date is valid
In case an invalid date is supplied
>>> try: ... date = valid_date("1973-15-12") ... except argparse.ArgumentTypeError: ... print("This date is invalid") ... else: ... print("This date is valid") This date is invalid
Here it is demonstrated how to add a ‘–startdate’ command line option to the argparse parser which checks if a valid date is supplied
>>> parser = argparse.ArgumentParser() >>> p = parser.add_argument("--startdate", ... help="The Start Date - format YYYY-MM-DD ", ... required=True, ... type=valid_date)
References
https://stackoverflow.com/questions/25470844/specify-format-for-input-arguments-argparse-python
cbs_utils.network module¶
-
class
cbs_utils.network.ActiveDirectory(server: str = None)[source]¶ Bases:
objectActive Directory representation. This object is intended to have all functionality that one would normally want/have from an Active Directory.
-
default_attrs= ['name', 'member', 'objectClass', 'adspath', 'primaryGroupToken', 'primaryGroupID']¶
-
default_groups= ['Administrators', 'Account Operators', 'Backup Operators', 'Server Operators', 'DnsAdmins', 'Domain Admins', 'Exchange Administrators', 'Exchange Services', 'DHCP Administrators']¶
-
get_group_info(name, searchRoot: str = None, category: str = 'Group', attributes: list = None) → Dict[str, str][source]¶ Obtain the group information for the given name :param name: profile name :param searchRoot: where to start looking :param category: what category to look in (by default the Group) :param attributes: additional attributes, examples are ‘name’, ‘member’, ‘objectClass’ :return: A Field : Value dictionary of all found information
-
get_group_members(strLdap: str, attributes: list = None) → List[Dict[str, object]][source]¶ Look up a group’s members. :param strLdap: groups adspath attribute. :param attributes: attributes to append to the search query :return: List of dictionaries, each dictionary item has a name and indicator of whether it is a group.
-
get_member_info(member: str, strLdap: str = None, attributes: List[str] = None) → Dict[str, str][source]¶ Returns user info. If there is no list of attributes given, it will use a default list (for testing purposes).
-
get_primary_group(token, searchRoot: str = None, header=' ')[source]¶ Used to look up Users whose Primary Group is set to one of the groups we’re looking up. This is necessary as AD uses that attribute to calculate a group’s membership. These type of users do not show up if you query the group’s member field directly.
searchRoot is the part of the LDAP tree that you want to start searching from. token is the groups primaryGroupToken.
-
cbs_utils.plotting module¶
Definition of CBS rbg colors. Based on the color rgb definitions from the cbs LaTeX template
-
class
cbs_utils.plotting.CBSPlotSettings(fig_width_in_inch: float = None, fig_height_in_inch: float = None, number_of_figures_cols: int = 1, number_of_figures_rows: int = 2, text_width_in_pt: float = 392.64813, text_height_in_pt: float = 693, text_margin_bot_in_inch: float = 1.0, ratio_option='golden_ratio', plot_parameters: dict = None, color_palette: str = 'koel', font_size: int = 8)[source]¶ Bases:
objectClass to hold the figure size for a standard document
Parameters: - number_of_figures_rows (int, optional) – Number of figure rows, default = 2
- number_of_figures_cols (int, optional) – Number of figure cols, default = 1
- text_width_in_pt (float, optional) – Width of the text in pt, default = 392.64
- text_height_in_pt (float, optional) – Height of the text in pt: default = 693
- text_margin_bot_in_inch (float, optional) – Space at the bottom in inch. Default = 1 inch
- text_height_in_inch (float, optional) – Explicitly over rules the calculated text height if not None. Default = None
- = None, (text_width_in_inch) – Explicitly over rules the calculated text height if not None. Default = None
- plot_parameters (dict, optional) – Dictionary with plot settings. If None (default), take the cbs defaults
- color_palette ({"koel", "warm"}, optional) – Pick color palette for the plot. Default is “koel”
- font_size (int, optional) – Size of all fonts. Default = 8
Notes
- The variables are set to make sure that the figure have the exact same size as the document, such that we do not have to rescale them. In this way the fonts will have the same size here as in the document
-
cbs_utils.plotting.add_axis_label_background(fig, axes, alpha=1, margin=0.05, x0=None, y0=None, loc='east', radius_corner_in_mm=1, logo_margin_x_in_mm=1, logo_margin_y_in_mm=1, add_logo=True, aspect=None)[source]¶ Add a background to the axis label
Parameters: - fig (mpl.figure.Figure object) – The total canvas of the Figure
- axes (mpl.axes.Axes object) – The axes of the plot to add a box
- alpha (float, optional) – Transparency of the box. Default = 1 (not transparent)
- margin (float, optional) – The margin between the labels and the side of the gray box
- loc ({"east", "south"}) – Location of the background. Default = “east” (left to y-axis. Only “east” and “south” are implemented
- add_logo (bool, optional) – If true, add the cbs logo. Default = True
- radius_corner_in_mm (float, optional) – Radius of the corner in mm. Default = 2
- logo_margin_x_in_mm (float) – Distance from bottom of logo in mm. Default = 2
- logo_margin_y_in_mm=2, – Distance from left of logo in mm. Default = 2
-
cbs_utils.plotting.add_cbs_logo_to_plot(fig, axes=None, margin_x_in_mm=6.0, margin_y_in_mm=6.0, x0=0, y0=0, width=None, height=None, zorder_start=1)[source]¶
-
cbs_utils.plotting.add_cbs_pnglogo_to_plot(fig, axes=None, image=None, margin_x=6, margin_y=6, loc='lower left', zorder=10, color='blauw', alpha=1.0, logo_width_in_mm=3.234, logo_height_in_mm=4.995, resample=False)[source]¶ Add a CBS logo to a plot
Parameters: - fig (mpl.pyplot.axes.Axes object) –
- image (mpl.image or None) – To prevent reading the logo many time you can read it once and pass the return image as an argument in the next call
- color ({"blauw", "wit", "grijs"}) – Color of the logo. Three colors are available: blauw (blue), wit (white) and grijs (grey). Default = “blauw”
- margin_y (margin_x,) – The x/y image offset in mm.
- alpha (None or float) – The alpha blending value.
- loc ({"lower left", "upper left", "upper right", "lower right"} or tuple) – Location of the logo.
- size (int) – Size of the icon in pixels
Returns: The image of the logo
Return type: mpl.image
-
cbs_utils.plotting.add_values_to_bars(axis, type='bar', position='c', format='{:.0f}', x_offset=0, y_offset=0, color='k', horizontalalignment='center', verticalalignment='center')[source]¶ Add the values of the bars as number in the center
Parameters: - axis (mpl.pyplot.axes.Axes object) – Axis containing the bar plot
- position ({"c", "t", "l", "r", "b"}, optional) – Location of the numbers, where “c” is center, “t” is top, “l” is left, “r” is right and “b” is bottom. Default = “c”
- type ({"bar", "barh"}) – Direction of the bars. Default = “bar”, meaning vertical bars. Alternatively you need to specify “barh” for horizontal bars.
- format (str, optional) – Formatter to use for the numbers. Default = “{:.0f}” (remove digits from float)
- x_offset (float, optional) – x offset in pt. Default = 0
- y_offset (float, optional) – y offset in pt. Default = 0
- color ("str", optional) – Color of the characters, Default is black
- horizontalalignment (str, optional) – Horizontal alignment of the numbers. Default = “center”
- verticalalignment (str, optional) – Vertical alignment of the numbers Default = “center”
- ) –
-
cbs_utils.plotting.clean_up_artists(axis, artist_list)[source]¶ try to remove the artists stored in the artist list belonging to the ‘axis’. :param axis: clean artists belonging to these axis :param artist_list: list of artist to remove :return: nothing
-
cbs_utils.plotting.get_cbs_logo_points(logo_width_in_mm=3.234, logo_height_in_mm=4.995, rrcor=0.171)[source]¶ Maak een array met de letters van het CBS logog
Parameters: Returns: List met 3 Nx2 arrays
Return type:
-
cbs_utils.plotting.get_color_palette(style='koel')[source]¶ Set the color palette
Parameters: style ({"koel", "warm"), optional) – Color palette to pick. Default = “koel” Returns: cbs_color_cycle Return type: mpl.cycler Notes
in order to set the cbs color palette default:
import matplotlib as mpl from cbs_utils.plotting import get_color_palette mpl.rcParams.update({'axes.prop_cycle': get_color_palette("warm")}
cbs_utils.readers module¶
cbs_utils.regular_expressions module¶
This modules contains a collection of often used regular expressions
cbs_utils.sql_server_utils module¶
cbs_utils.string_measures module¶
Created on Tue Apr 17 12:06:13 2018
Author: Paul Keuren
Notes
- Levenstein distances can also by calculated using the python-levenshtein module
-
cbs_utils.string_measures.levenshtein_distance(s: str, t: str) → int[source]¶ Calculate Levenshtein distance
Parameters: Returns: Distance between strings
Return type: Notes
- [NL] Bereken de Levenshtein afstand tussen strings. Deze afstandsbepaling geeft aan hoeveel wijzgingen minimaal nodig zijn om van een string de andere string te gaan. Deze implementatie gebruikt een matrix met grootte len(s)*len(t).
- [EN] Calculates the Levenshtein distance between strings. The Levenshtein distance computes the minimal number of changes (addition/removal/substitution) required to transform one string to the other string. This specific implementation uses a matrix with size len(s)*len(t).
- For more information on the topic see wikipedialevenshtein
-
cbs_utils.string_measures.optimal_string_alignment_distance(s: str, t: str) → int[source]¶ Parameters: Returns: Return type: OSA distance
Notes
- [NL] Het Optimal String Alignment (OSA) algoritme is een beperkte schatting van de Damerau- Levenshtein (DL) afstand. Het gebruikt geen alphabet (zoals bij DL), maar is beperkt in het aantal transposities wat deze kan meenemen. DL daarentegen neemt alle transposities mee, echter is dit vaak zeer duur en is de OSA goed genoeg.
- [EN] The optimal string alignment (OSA) algorithm allows for a quick estimation of the Damerau-Levenshtein (DL) distance. It does not require an additional alphabet, but is therefore limited in its transposition detection/completion. This makes the algorithm cheaper than the DL distance, but also less accurate.
- For more information on the topic see _wikipediadamerau
cbs_utils.web_scraping module¶
A collection of classes and utilities to assist with web scraping
Author: Eelco van Vliet
-
class
cbs_utils.web_scraping.HRefCheck(href, url, valid_extensions=None, max_depth=1, branch_count=None, max_branch_count=50, schema=None, ssl_valid=True, validate_url=False)[source]¶ Bases:
objectClass to check if a hyper ref obtained from a web page is a valid internal or external hyper-reference
Parameters: - href (str) – hyper-reference to check as found on the domain
- url (str) – Main domain name. Used to check if we have a internal or external hyper-reference
- valid_extensions (list, optional) – List of string with valid extensions. Default = [“.html”]
- max_depth (int, optional) – Maximum search depth. Default = 1
- branch_count (object) – collection.Counter object which keeps the current count of each branch. This is used to check how often subbranches of the domain are visited. In case the max_branch_count is exceeded we stop searching this branch
- max_branch_count (int, optional) – Maximum number of time a branch in a domain is visit. For instance, in case we have ebay/cars/ as branch, there may be 100,000 cars under this branch which would be all visited. with branch counter. Now we can stop visiting this branch. Default = 50
- schema (str, optional) – Either http or https. If not given (None) then the scheme will be obtained by doing requests to the side, however, in case we give a ‘schema’, this can be skipped and the given schema is used
- ssl_valid (bool, optional) – In case of a https schema, this flag indicates if the certificate was valid.
- validate_url (bool) – Validate each url if it gives a 200 code.
-
class
cbs_utils.web_scraping.RequestUrl(url: str, session=None, timeout: float = 5.0, retries: int = 3, backoff_factor: float = 0.3, status_forcelist: list = (500, 502, 503, 504), schema=None, ssl_valid=None, validate_url=False)[source]¶ Bases:
objectAdd a protocol (https, http) if we don’t have any. Try which one fits
Parameters: - url (str) – Url to get the protocal from
- session (optional) – Session object of an already open session can be passed
- timeout (float, optional) – Time-out of the request. Default = 5 s
- retries (int, optional) – Number of time we try to connect. Default = 3
- backoff_factor (float, optional) – Time that we delay. Default = 0.3
- status_forcelist (list, optional) – List of status codes which we force to stop. Default = (500, 502, 503, 504),
- schema (str, optional) – Schema of the url (http or https). If given, this schema is used. Default = None, which means it will be obtained by the class
- ssl_valid (bool, optional) – True in case the certificate is valid of a https
- validate_url (bool, optional) – Make connection to the url to validate if it exists (has 200 code). Default=False
Examples
>>> req = RequestUrl("www.google.com")
This adds https to www.google.com as this is the first address that is valid
-
class
cbs_utils.web_scraping.UrlSearchStrings(url, search_strings: dict, sort_order_hrefs: list = None, stop_search_on_found_keys: list = None, store_page_to_cache=False, cache_directory='cache', timeout=5.0, max_frames=10, max_hrefs=1000, max_depth=2, max_branch_count=10, max_cache_dir_size=None, scrape_url=True, timezone='Europe/Amsterdam', schema=None, ssl_valid=None, validate_url=None)[source]¶ Bases:
objectClass to set up a recursive search of string on web pages
Parameters: - url (str) – Main url to start searching
- search_strings (dict) –
Dictionary with the searches performed per page. The form is:
{ "name_of_search_1": "search_string_1" , "name_of_search_2": "search_string_2" }
- store_page_to_cache (bool, optional) – Each page retrieved is also stored to cache if true. Default = False
- timeout (float, optional) – Time in sec to wait on a request before going to the next. Default = 1.0
- sort_order_hrefs (dict, optional) – Give an list of names of subdomain which we want to search first
- stop_search_on_found_keys (list) – List of search keys from the search_strings dict for which we immediately stop with searching as soon as we found a match
- store_page_to_cache – Store all the pages to cache
- cache_directory (str, optional) – Name of the cache directory, default=”cache”
- timeout – Stop requesting the page after timeout seconds. Default = 5.0 s
- max_frames (int, optional) – Maximum number of frames we scrape. Default = 10
- max_hrefs (int, optional) – Maximum number of hyper references we follow. Default = 1000
- max_depth (int, optional) – Maximum depth we search the domain. Default = 1
- max_branch_count (int, optional) – Maximum number of request per branch. Default = 10
- max_cache_dir_size (int, optional) – Maximum size of the cache directory in Mb. If None, there is no maximum. If 0, no cache is written. If a finite number, each request before writing the cache, first the current directory size needs to be checked, so that slows down the code significantly. Default=None
- scrape_url (bool, optional) – Flag to indicate if we want to scrape. If false, no scraping or any other access of internet is done. This allows to use the object with doing a scrape
- timezone (str, optional) – Time zone of the scrape. Default = “Europe/Amsterdam”
- str, optional (schema) – Protocal of the url, http or https. If None (default) it will be obtained
- ssl_valid (bool, optional) – Flag to indicate if the tls encryption has a valid certificate
- validate_url – Validate url to check if it exists
-
matches¶ Dictionary containing the results of the searches defined by search_strings. The keys are derived from the search_strings key, the results are lists containing all the matches
Type: dict
Notes
- This class can also handle web page with frames. Normally, these are not analysed by beautiful soup, however, by explicitly looking up all frames and following the links defined by the ‘src’ tag, we can access all the frames in an url
Examples
Let she we have a web site ‘www.example.com’ want to extract all the postcodes. Also, we want to get all the words with more than 10 characters. For this, store your regular expression for both searches in a dictionary and feed it to the UrlSearchStrings class
>>> url = "www.example.com" >>> search = dict(postcode=r"\d{4}\s{0,1}[a-zA-Z]{2}", longwords=r"\w{11,}") >>> url_analyse = UrlSearchStrings(url, search_strings=search)
The results are stored in the ‘matches’ attribute of the class and can be report by printing the class like:
>>> print(url_analyse) Matches in https://www.example.com/ postcode : [] longwords : ['established', 'illustrative', 'coordination', 'information']
In our example, the matches with the postal codes is empty (for the example domain). and we have found 5 words with more than 10 characters
>>> postcodes = url_analyse.matches["postcode"]
Note that the keys of the matches dictionary are the same as the keys we used for the search
-
follow_frames(soup, url)[source]¶ In the current soup, find all the frames and for each frame start a new pattern search
Parameters: - soup (BeautifulSoup.soup) – The current soup
- url (str) – The current url
-
follow_hrefs(soup)[source]¶ In the current soup, find all the hyper references and follow them if we stay in the domain
Parameters: - soup (BeautifulSoup.soup) – The current soup
- url (str) – The current url
-
static
get_patterns(soup, regexp) → list[source]¶ Retrieve all the pattern match in the soup obtained from the url with Beautifulsoup
Parameters: - soup (object:BeautifulSoup) – Return value of the beautiful soup of the page where we want to search
- regexp (re.Pattern) – Compiled regular expression to find on this page
Returns: List of matches with the regular expression
Return type:
-
cbs_utils.web_scraping.cache_to_disk(func)[source]¶ Decorator which allows to cache the output of a function to disk
Parameters: Examples
Say you have a function that reads the contents of a web page from internet:
@cache_to_disk def get_page_from_url(url, timeout=1.0): try: page = requests.get(url, timeout=timeout) except requests.exceptions.ConnectionError as err: page = None return page
Without the @cache_to_disk decorator, you would just read the contents of a html file with:
page = get_page_from_url("nu.nl")
However, because we have added the @cache_to_disk decorator, the first time the data is read from the website, but this is stored to a pickle file. All the next runs you just obtain the data from the pickle file.
The cache_to_disk decorator checks if some parameters are given. With the skip_cache flag you can prevent the cache being used even if the decorator was added In case the max_cache_dir_size is defined, the size of the cache directory is checked first and only new cache is written if the size of the directory in MB is smaller than the defined maximum. An example of using the maximum would be:
page = get_page_from_url("nu.nl", max_cache_dir_size=0)
In this example, we do not allow to add new cache files at all, but old cache files can still be read if present in the cache dir
-
cbs_utils.web_scraping.get_page_from_url(url, session=None, timeout=1.0, skip_cache=False, raise_exceptions=False, max_cache_dir_size=None, headers=None, verify=True, cache_directory=None)[source]¶ Get the contents of url and immediately store the result to a cache file
Parameters: - url – str String with the url to fetch
- session – object:Session: A session can be passed in case you want to keep it open
- timeout – float Number of seconds you try to connect
- skip_cache – bool If True, prevent that we are using the cache decorator
- skip_cache – bool If True, do not write new cache.
- raise_exceptions – bool If True, raise the exceptions of the requests
- max_cache_dir_size – int Maximum size of cache in Mb. Stop writing cache as soon max_cache has been reached. If None, this test is skip and the cache is always written. If 0, we never write cache and therefore the check of the current directory size can be skipped, which significantly speeds up the code
- headers – dict Headers to use for the request
- verify – bool Forces to verify the certificate
- cache_directory – str Name of the cache directory which is passed to the decorator
Returns: The html page
Return type: request.Page
Notes
- The ‘cache_to_dist’ decorator takes care of caching the data to the directory cache
Examples
If you want to get the page using request.get with caching do the following
>>> url = "https://www.example.com" >>> page = get_page_from_url(url, cache_directory="cache_test")
>>> soup = BeautifulSoup(page.text, 'lxml') >>> body_text = re.sub('\s+', ' ', soup.body.text) >>> print(body_text) ' Example Domain This domain is established to be used for illustrative examples in ' 'documents. You may use this domain in examples without prior coordination or asking for ' 'permission. More information... '
At this point also a directory cache_test has been create with a cache file name with the name get_page_from_url_https_www_example_com_.pkl
If you only want to read existing cache (in case it was written before) but do not want to write new cache, add the max_cache_dir_size=0 argument
>>> page = get_page_from_url(url, cache_directory="cache_test", max_cache_dir_size=0)
-
cbs_utils.web_scraping.make_cache_file_name(function_name, args)[source]¶ Create a cache file name based on the function name + list of arguments
Parameters: Returns: Name of the cache file
Return type: Notes
- Used by cache_to_disk to make a name of a cache file based on its input arguments
- To make sure that we get a valid file name, we remove all the special characters