=========== SR3 OPTIONS =========== ------------------------------ SR3 Configuration File Format ------------------------------ :manual section: 7 :Date: |today| :Version: |release| :Manual group: MetPX-Sarracenia SYNOPSIS ======== :: name value name value for use name value_${substitution} . . . DESCRIPTION =========== Options are placed in configuration files, one per line, in the form:: option For example:: debug true debug sets the *debug* option to enable more verbose logging. If no value is specified, the value true is implicit, so the above are equivalent. A second example:: broker amqps://anonymous@dd.weather.gc.ca In the above example, *broker* is the option keyword, and the rest of the line is the value assigned to the setting. Configuration files are a sequence of settings, one per line. Note: * the files are read from top to bottom, most importantly for *directory*, *strip*, *mirror*, and *flatten* options apply to *accept* clauses that occur after them in the file. * The forward slash (/) as the path separator in Sarracenia configuration files on all operating systems. Use of the backslash character as a path separator (as used in the cmd shell on Windows) may not work properly. When files are read on Windows, the path separator is immediately converted to the forward slash, so all pattern matching, in accept, reject, strip etc... directives should use forward slashes when a path separator is needed. * **#** is the prefix for lines of non-functional descriptions of configurations, or comments. Same as shell and/or python scripts. * **All options are case sensitive.** **Debug** is not the same as **debug** nor **DEBUG**. Those are three different options (two of which do not exist and will have no effect, but will generate an ´unknown option' warning). The file has an inherent order, in that it is read from top to bottom, so options set on one line often affect later lines:: mirror off directory /data/just_flat_files_here_please accept .*flatones.* mirror on directory /data/fully_mirrored accept .* In the above snippet the *mirror* setting is off, then the directoy value is set, so files whose name includes *flatones* will all be place in the */data/just_flat_files_here_please* directory. For files which don't have that name, they will not be picked up by the first accept, and so the mirror on, and the new directory setting will tak over, and those other files will land in /data/fully_mirrored. A second example: sequence #1:: reject .*\.gif accept .* sequence #2:: accept .* reject .*\.gif .. note:: FIXME: does this match only files ending in 'gif' or should we add a $ to it? will it match something like .gif2 ? is there an assumed .* at the end? In sequence #1, all files ending in 'gif' are rejected. In sequence #2, the accept .* (which accepts everything) is encountered before the reject statement, so the reject has no effect. Some options have global scope, rather than being interpreted in order. For thoses cases, the last declaration overrides the ones higher in the file.. Variables ========= One can include substitutions in option values. They are represented by ${name}. The name can be an ordinary environment variable, or a chosen from a number of built-in ones. For example:: varTimeOffset -5m directory /mylocaldirectory/${%Y%m%d}/mydailies accept .*observations.* rename hoho.${%o-1h%Y%m%d_%H%M%S.%f}.csv In the last example above, the *varTimeOffset* will modify the evaluation of YYYYMMDD to be 5m in the past. In the rename option, the time to be substituted is one hour in the past. One can also specify variable substitutions to be performed on arguments to the directory option, with the use of *${..}* notation: * %... - a `datetime.strftime() `_ * compatible date/time formatting string augmented by an offset duration suffix (o- for in the past, o+ for in the future) * example (complex date): ${%Y/%m/%d_%Hh%M:%S.%f} --> 2022/12/04_17h36:34.014412 * example (add offset): ${%o-1h%Y/%m/%d_%Hh%M:%S.%f} --> 2022/12/04_16h36:34.014412 * time offset begin a strtime pattern with %o for an offset +-1(s/m/h/d/w) units. * SOURCE - the amqp user that injected data (taken from the notification message.) * BD - the base directory * BUP - the path component of the baseUrl (or: baseUrlPath) * BUPL - the last element of the baseUrl path. (or: baseUrlPathLast) * PBD - the post base dir * *var* - any environment variable. * BROKER_USER - the user name for authenticating to the broker (e.g. anonymous) * POST_BROKER_USER - the user name for authenticating to the post_broker (e.g. anonymous) * PROGRAM - the name of the component (subscribe, shovel, etc...) * CONFIG - the name of the configuration file being run. * HOSTNAME - the hostname running the client. * RANDID - a random id that will be consistent within a single invocation. The %Y%m%d and %h time stamps refer to the time at which the data is processed by the component, it is not decoded or derived from the content of the files delivered. All date/times in Sarracenia are in UTC. use the varTimeOffset setting to adjust from the current time. Refer to *sourceFromExchange* for a common example of usage. Note that any sarracenia built-in value takes precedence over a variable of the same name in the environment. Note that flatten settings can be changed between directory options. Note:: the ${% date substitutions are present, the interpretation of % patterns in filenames by strftime, may mean it is necessary to escape precent characters them via doubling: %% Sundew Compatible Substituions ------------------------------ In `MetPX Sundew <../Explanation/Glossary.html#sundew>`_, there is a much more strict file naming standard, specialised for use with World Meteorological Organization (WMO) data. Note that the file naming convention predates, and bears no relation to the WMO file naming convention currently approved, but is strictly an internal format. The files are separated into six fields by colon characters. The first field, DESTFN, gives the WMO (386 style) Abbreviated Header Line (AHL) with underscores replacing blanks:: TTAAii CCCC YYGGGg BBB ... (see WMO manuals for details) followed by numbers to render the product unique (as in practice, though not in theory, there are a large number of products which have the same identifiers). The meanings of the fifth field is a priority, and the last field is a date/time stamp. The other fields vary in meaning depending on context. A sample file name:: SACN43_CWAO_012000_AAA_41613:ncp1:CWAO:SA:3.A.I.E:3:20050201200339 If a file is sent to sarracenia and it is named according to the Sundew conventions, then the following substitution fields are available:: ${T1} replace by bulletin's T1 ${T2} replace by bulletin's T2 ${A1} replace by bulletin's A1 ${A2} replace by bulletin's A2 ${ii} replace by bulletin's ii ${CCCC} replace by bulletin's CCCC ${YY} replace by bulletin's YY (obs. day) ${GG} replace by bulletin's GG (obs. hour) ${Gg} replace by bulletin's Gg (obs. minute) ${BBB} replace by bulletin's bbb ${RYYYY} replace by reception year ${RMM} replace by reception month ${RDD} replace by reception day ${RHH} replace by reception hour ${RMN} replace by reception minutes ${RSS} replace by reception second YYYYMMDD - the current daily timestamp. (v2 compat, prefer strftime %Y%m%d ) HH - the current hourly timestamp. (v2 compat, prefer strftime %h ) JJJ - the current hourly timestamp. (v2 compat, prefer strftime %j ) The 'R' fields come from the sixth field, and the others come from the first one. When data is injected into sarracenia from Sundew, the *sundew_extension* notification message header will provide the source for these substitions even if the fields have been removed from the delivered file names. Note:: The version 2 compatible date strings (e.g. YYYYMMDD) originate with obsolete WMO practices, and support will be removed at a future date. Please use strftime style patterns in new configurations. SR_DEV_APPNAME ~~~~~~~~~~~~~~ The SR_DEV_APPNAME environment variable can be set so that the application configuration and state directories are created under a different name. This is used in development to be able to have many configurations active at once. It enables more testing than always working with the developer´s *real* configuration. Example: export SR_DEV_APPNAME=sr-hoho... when you start up a component on a linux system, it will look in ~/.config/sr-hoho/ for configuration files, and write state files in the ~/.cache/sr-hoho directory. OPTION TYPES ============ sr3 options come in several types: count integer count type. duration a floating point number indicating a quantity of seconds (0.001 is 1 milisecond) modified by a unit suffix ( m-minute, h-hour, w-week ) flag an option that has only True or False values (aka: a boolean value) float a floating point number. list a list of string values, each succeeding occurrence catenates to the total. all v2 plugin options are declared of type list. set a set of string values, each succeeding occurrence is unioned to the total. size integer size. Suffixes k, m, and g for kilo, mega, and giga (base 2) multipliers. str an string value OPTIONS ======= The actual options are listed below. Note that they are case sensitive, and only a subset are available on the command line. Those that are available on the command line have the same effect as when specified in configuration files. The options available in configuration files: accelThreshold default: 0 (disabled.) --------------------------------------------------- The accelThreshold indicates the minimum size of file being transferred for which a binary downloader will be launched. accelXxxCommand ---------------- Can specify alternate binaries for downloaders to tune for specific cases. +-----------------------------------+--------------------------------+ | Option | Defaul value | +-----------------------------------+--------------------------------+ | accelWgetCommand | /usr/bin/wget %s -O %d | +-----------------------------------+--------------------------------+ | accelScpCommand | /usr/bin/scp %s %d | +-----------------------------------+--------------------------------+ | accelCpCommand | /usr/bin/cp %s %d | +-----------------------------------+--------------------------------+ | accelFtpgetCommand | /usr/bin/ncftpget %s %d | +-----------------------------------+--------------------------------+ | accelFtpputCommand | /usr/bin/ncftpput %s %d | +-----------------------------------+--------------------------------+ use the %s to stand-in for the name of the source file, and %d for the file being written. An example setting to override with:: accelCpCommand dd if=%s of=%d bs=4096k accept, reject and acceptUnmatched ---------------------------------- - **accept (optional) []** - **reject (optional)** - **acceptUnmatched (default: True)** The **accept** and **reject** options process regular expressions (regexp). The regexp is applied to the the notification message's URL for a match. If the notification message's URL of a file matches a **reject** pattern, the notification message is acknowledged as consumed to the broker and skipped. One that matches an **accept** pattern is processed by the component. In many configurations, **accept** and **reject** options are mixed with the **directory** option. They then relate accepted notification messages to the **directory** value they are specified under. After all **accept** / **reject** options are processed, normally the notification message is accepted for further processing. To override that default, set **acceptUnmatched** to False. The **accept/reject** settings are interpreted in order. Each option is processed orderly from top to bottom. For example: sequence #1:: reject .*\.gif accept .* sequence #2:: accept .* reject .*\.gif In sequence #1, all files ending in 'gif' are rejected. In sequence #2, the accept .* (which accepts everything) is encountered before the reject statement, so the reject has no effect. It is best practice to use server side filtering to reduce the number of notification messages sent to the component to a small superset of what is relevant, and perform only a fine-tuning with the client side mechanisms, saving bandwidth and processing for all. More details on how to apply the directives follow: The **accept** and **reject** options use regular expressions (regexp) to match URL. These options are processed sequentially. The URL of a file that matches a **reject** pattern is not published. Files matching an **accept** pattern are published. Again a *rename* can be added to the *accept* option... matching products for that *accept* option would get renamed as described... unless the *accept* matches one file, the *rename* option should describe a directory into which the files will be placed (prepending instead of replacing the file name). The **permDefault** option allows users to specify a linux-style numeric octal permission mask:: permDefault 040 means that a file will not be posted unless the group has read permission (on an ls output that looks like: ---r-----, like a chmod 040 command). The **permDefault** options specifies a mask, that is the permissions must be at least what is specified. The **regexp pattern** can be used to set directory parts if part of the notification message is put to parenthesis. **sender** can use these parts to build the directory name. The rst enclosed parenthesis strings will replace keyword **${0}** in the directory name... the second **${1}** etc. Example of use:: filename NONE directory /this/first/target/directory accept .*file.*type1.* directory /this/target/directory accept .*file.*type2.* accept .*file.*type3.* DESTFN=file_of_type3 directory /this/${0}/pattern/${1}/directory accept .*(2016....).*(RAW.*GRIB).* A selected notification message by the first accept would be delivered unchanged to the first directory. A selected notification message by the second accept would be delivered unchanged to the second directory. A selected notification message by the third accept would be renamed "file_of_type3" in the second directory. A selected notification message by the forth accept would be delivered unchanged to a directory. It's named */this/20160123/pattern/RAW_MERGER_GRIB/directory* if the notification message would have a notice like: **20150813161959.854 http://this.pump.com/ relative/path/to/20160123_product_RAW_MERGER_GRIB_from_CMC** acceptSizeWrong: (default: False) ------------------------------------------- When a file is downloaded and its size does not match the one advertised, it is normally rejected, as a failure. This option accepts the file even with the wrong size. helpful when file is changing frequently, and there is some queueing, so the file is changed by the time it is retrieved. attempts (default: 3) ----------------------------- The **attempts** option indicates how many times to attempt downloading the data before giving up. The default of 3 should be appropriate in most cases. When the **retry** option is false, the file is then dropped immediately. When The **retry** option is set (default), a failure to download after prescribed number of **attempts** (or send, in a sender) will cause the notification message to be added to a queue file for later retry. When there are no notification messages ready to consume from the AMQP queue, the retry queue will be queried. baseDir (default: /) ---------------------------- **baseDir** supplies the directory path that, when combined with the relative one in the selected notification gives the absolute path of the file to be sent. The default is None which means that the path in the notification is the absolute one. Sometimes senders subscribe to local xpublic, which are http url's, but sender needs a localfile, so the local path is built by concatenating:: baseDir + relative path in the baseUrl + relPath When used for reception, it specifies the root of the tree that upstream files are assumed to be from, to be replaced on download by either post_baseDir or the *directory* setting in effect. baseUrl_relPath (default: off) ------------------------------------- Normally, the relative path (baseUrl_relPath is False, appended to the base directory) for files which are downloaded will be set according to the relPath header included in the notification message. If *baseUrl_relPath* is set, however, the notification message's relPath will be prepended with the sub-directories from the notification message's baseUrl field. batch (default: 100) ---------------------------- The **batch** option is used to indicate how many files should be transferred over a connection, before it is torn down, and re-established. On very low volume transfers, where timeouts can occur between transfers, this should be lowered to 1. For most usual situations the default is fine. For higher volume cases, one could raise it to reduce transfer overhead. It is only used for file transfer protocols, not HTTP ones at the moment. blocksize default: 0 (auto) ----------------------------------- NOTE: **EXPERIMENTAL** sr3, expected to return in future version** This **blocksize** option controls the partitioning strategy used to post files. The value should be one of:: 0 - autocompute an appropriate partitioning strategy (default) 1 - always send entire files in a single part. - used a fixed partition size (example size: 1M ) Files can be announced as multiple parts. Each part has a separate checksum. The parts and their checksums are stored in the cache. Partitions can traverse the network separately, and in parallel. When files change, transfers are optimized by only sending parts which have changed. The *outlet* option allows the final output to be other than a post. See `sr3_cpump(1) `_ for details. broker ------ **broker [amqp|mqtt]{s}://:@[:port]/** A URI is used to configure a connection to a notification message pump, either an MQTT or an AMQP broker. Some Sarracenia components set a reasonable default for that option. provide the normal user,host,port of connections. In most configuration files, the password is missing. The password is normally only included in the `credentials.conf `_ file. Sarracenia work has not used vhosts, so **vhost** should almost always be **/**. for more info on the AMQP URI format: ( https://www.rabbitmq.com/uri-spec.html ) either in the default.conf or each specific configuration file. The broker option tell each component which broker to contact. **broker [amqp|mqtt]{s}://:@[:port]/** :: (default: None and it is mandatory to set it ) Once connected to an AMQP broker, the user needs to bind a queue to exchanges and topics to determine the notification messages of interest. bufsize (default: 1MB) ----------------------------- Files will be copied in *bufsize*-byte blocks. for use by transfer protocols. byteRateMax (default: 0) -------------------------------- **byteRateMax** is greater than 0, the process attempts to respect this delivery speed in kilobytes per second... ftp,ftps,or sftp) **FIXME**: byteRateMax... only implemented by sender? or subscriber as well, data only, or notification messages also? callback -------------------- **callback** appends a flowcallback class to the list of those to be called during processing. Most customizable processing or "plugin" logic, is implemented using the flow callback class. At different points in notification message processing, flow callback classes define entry_points that match that point in processing. for for every such point in the processing, there is a list of flow callback routines to call. `FlowCallback Reference `_ the *classSpec* is similar to an *import* statement from python. It uses the python search path, and also includes ~/.config/sr3/plugins. There is some shorthand to make usage shorter for common cases. for example: callback log Sarracenia will first attempt, to prepend *log* with *sarracenia.flowcb.log* and then instantiate the callback instance as an item of class sarracenia.flowcb.Log. If it does not find such a class, then it will attempt to find a class name *log*, and instantiate an object *log.Log.* More detail here `FlowCallback load_library `_ callback_prepend ---------------------------- identical to callback, but meant to specify functions to be executed early, that is prepended to the list of plugins to run. dangerWillRobinson (default: omitted) ------------------------------------- This option is only recognized as a command line option. It is specified when an operation is expected to have irreversibly destructive or perhaps unexpected effects. for example:: sr3 stop will stop running components, but not those that are being run in the foreground. Stopping those may be surprising to the analysts that will be looking at them, so that is not done by default:: sr3 --dangerWillRobinson stop stops stops all components, including the foreground ones. Another example would be the *cleanup* action. This option deletes queues and exchanges related to a configuration, which can be destructive to flows. By default, cleanup only operates on a single configuration at a time. One can specify this option to wreak greater havoc. declare ------- env NAME=Value On can also reference environment variables in configuration files, using the *${ENV}* syntax. If Sarracenia routines needs to make use of an environment variable, then they can be set in configuration files:: declare env HTTP_PROXY=localhost exchange exchange_name using the admin url, declare the exchange with *exchange_name* subscriber A subscriber is user that can only subscribe to data and return report notification messages. Subscribers are not permitted to inject data. Each subscriber has an xs_ named exchange on the pump, where if a user is named *Acme*, the corresponding exchange will be *xs_Acme*. This exchange is where an subscribe process will send its report notification messages. By convention/default, the *anonymous* user is created on all pumps to permit subscription without a specific account. source A user permitted to subscribe or originate data. A source does not necessarily represent one person or type of data, but rather an organization responsible for the data produced. So if an organization gathers and makes available ten kinds of data with a single contact email or phone number for questions about the data and its availability, then all of those collection activities might use a single 'source' account. Each source gets a xs_ exchange for injection of data notification messages, and, similar to a subscriber to send report notification messages about processing and receipt of data. Source may also have an xl_ exchange where, as per report routing configurations, report notification messages of consumers will be sent. feeder A user permitted to write to any exchange. Sort of an administrative flow user, meant to pump notification messages when no ordinary source or subscriber is appropriate to do so. Is to be used in preference to administrator accounts to run flows. User credentials are placed in the `credentials.conf `_ file, and *sr3 --users declare* will update the broker to accept what is specified in that file, as long as the admin password is already correct. debug ----- Setting option debug is identical to use **logLevel debug** delete (default: off) ------------------------------- When the **delete** option is set, after a download has completed successfully, the subscriber will delete the file at the upstream source. Default is false. discard (default: off) -------------------------------- The **discard** option,if set to true, deletes the file once downloaded. This option can be useful when debugging or testing a configuration. directory (default: .) ----------------------------- The *directory* option defines where to put the files on your server. Combined with **accept** / **reject** options, the user can select the files of interest and their directories of residence (see the **mirror** option for more directory settings). The **accept** and **reject** options use regular expressions (regexp) to match URL. These options are processed sequentially. The URL of a file that matches a **reject** pattern is never downloaded. One that matches an **accept** pattern is downloaded into the directory declared by the closest **directory** option above the matching **accept** option. **acceptUnmatched** is used to decide what to do when no reject or accept clauses matched. :: ex. directory /mylocaldirectory/myradars accept .*RADAR.* directory /mylocaldirectory/mygribs reject .*Reg.* accept .*GRIB.* destfn_script