flow API Example
The sarracenia.flow class provides built in accept/reject filtering for messages, supports built-in downloading in several protocols, retries on failure, and allows the creation of callbacks, to customize processing.
You need to provide a configuration as an argument when instantiating a subscriber. the sarracenia.config.no_file_config() returns an empty configuration without consulting any of the sr3 configuration file tree.
After adding the modifications needed to the configuration, the subscriber is then initiated and run.
[2]:
!mkdir /tmp/flow_demo
make a directory for the files you are going to download. the root of the directory tree to must exist.
[3]:
import re
import sarracenia.config
from sarracenia.flow.subscribe import Subscribe
import sarracenia.flowcb
import sarracenia.credentials
cfg = sarracenia.config.no_file_config()
cfg.broker = sarracenia.credentials.Credential('amqps://anonymous:anonymous@hpfx.collab.science.gc.ca')
cfg.topicPrefix = [ 'v02', 'post']
cfg.component = 'subscribe'
cfg.config = 'flow_demo'
cfg.action = 'start'
cfg.bindings = [ ('xpublic', ['v02', 'post'], ['*', 'WXO-DD', 'observations', 'swob-ml', '#' ]) ]
cfg.queueName='q_anonymous.subscriber_test2'
cfg.download=True
cfg.batch=1
cfg.messageCountMax=5
# set the instance number for the flow class.
cfg.no=0
# set other settings based on provided ones, so it is ready for use.
cfg.finalize()
# accept/reject patterns:
pattern=".*"
# to_match, write_to_dir, DESTFN, regex_to_match, accept=True,mirror,strip, pstrip,flatten
cfg.masks= [ ( pattern, "/tmp/flow_demo", None, re.compile(pattern), True, False, False, False, '/' ) ]
starters.
the broker, bindings, and queueName settings are explained in the moth notebook.
cfg.download
Whether you want the flow to download the files corresponding to the messages. If true, then it will download the files.
cfg.batch
Messages are processed in batches. The number of messages to retrieve per call to newMessages() is limited by the batch setting. We set it to 1 here so you can see each file being downloaded immediately when the corresponding message is downloaded. you can leave this blank, and it defaults to 25. Settings are matter of taste and use case.
cfg.messageCountMax
Normally we just leave this setting at it’s default (0) which has no effect on processing. for demonstration purposes, we limit the number of messages the subscriber will process with this setting. after messageCountMax messages have been received, stop processing.
cfg.masks
masks are a compiled form of accept/reject directives. a relPath is compared to the regex in the mask. If the regex matches, and accept is true, then the message is accepted for further processing. If the regex matches, but accept is False, then processing of the message is stopped (the message is rejected.)
masks are a tuple. the meaning can be looked up in the sr3(1) man page.
pattern_string, the input regular expression string, to be compiled by re routines.
directory, where to put the files downloaded (root of the tree, when mirroring)
fn, transformation of filename to do. None is the 99% use case.
regex, compiled regex version of the pattern_string
accept(True/False), if pattern matches then accept message for further processing.
mirror(True/False), when downloading build a complete tree to mirror the source, or just dump in directory
strip(True/False), modify the relpath by stripping entries from the left.
pstrip(True/False), strip entries based on patterm
flatten(char … ‘/’ means do not flatten.) )
cfg.no, cfg.pid_filename
These settings are needed because they would ordinarily be set by the sarracenia.instance class which is normally used to launch flows. They allow setting up of run-time paths for retry_queues, and statefiles, to remember settings if need be between runs.
[4]:
subscriber = sarracenia.flow.subscribe.Subscribe( cfg )
subscriber.run()
2024-01-29 15:00:37,351 [INFO] sarracenia.flow loadCallbacks flowCallback plugins to load: ['sarracenia.flowcb.gather.message.Message', 'sarracenia.flowcb.retry.Retry', 'sarracenia.flowcb.housekeeping.resources.Resources', 'log']
2024-01-29 15:00:37,354 [DEBUG] sarracenia.flowcb.retry __init__ sr_retry __init__
2024-01-29 15:00:37,354 [DEBUG] sarracenia.config add_option []0 retry_driver declared as type:<class 'str'> value:disk
2024-01-29 15:00:37,355 [DEBUG] sarracenia.diskqueue __init__ work_retry_00 __init__
2024-01-29 15:00:37,357 [DEBUG] sarracenia.config add_option []0 MemoryMax declared as type:<class 'int'> value:0
2024-01-29 15:00:37,357 [DEBUG] sarracenia.config add_option []0 MemoryBaseLineFile declared as type:<class 'int'> value:100
2024-01-29 15:00:37,358 [DEBUG] sarracenia.config add_option []0 MemoryMultiplier declared as type:<class 'float'> value:3
2024-01-29 15:00:37,359 [DEBUG] sarracenia.config add_option []0 logEvents declared as type:<class 'set'> value:{'after_work', 'on_housekeeping', 'after_accept', 'after_post'}
2024-01-29 15:00:37,359 [DEBUG] sarracenia.config add_option []0 logMessageDump declared as type:<class 'bool'> value:False
2024-01-29 15:00:37,359 [INFO] sarracenia.flowcb.log __init__ subscribe initialized with: logEvents: {'after_work', 'on_housekeeping', 'after_accept', 'after_post'}, logMessageDump: False
2024-01-29 15:00:37,360 [DEBUG] sarracenia.config check_undeclared_options missing defaults: {'post_exchangeSplit', 'follow_symlinks', 'topic', 'post_exchange', 'reconnect', 'sendTo', 'pollUrl', 'logMessageDump', 'MemoryBaseLineFile', 'exchange_suffix', 'force_polling', 'exchangeSplit', 'post_topic', 'save', 'inplace', 'retry_driver', 'post_on_start', 'header', 'blocksize', 'restore', 'MemoryMultiplier', 'report_exchange', 'cluster', 'nodupe_basis', 'post_exchangeSuffix', 'identity', 'MemoryMax', 'count', 'notify_only', 'feeder', 'realpathFilter'}
2024-01-29 15:00:37,360 [INFO] sarracenia.flow run callbacks loaded: ['sarracenia.flowcb.gather.message.Message', 'sarracenia.flowcb.retry.Retry', 'sarracenia.flowcb.housekeeping.resources.Resources', 'log']
2024-01-29 15:00:37,360 [INFO] sarracenia.flow run pid: 3567801 subscribe/flow_demo instance: 0
2024-01-29 15:00:37,448 [DEBUG] amqp _on_start Start from server, version: 0.9, properties: {'capabilities': {'publisher_confirms': True, 'exchange_exchange_bindings': True, 'basic.nack': True, 'consumer_cancel_notify': True, 'connection.blocked': True, 'consumer_priorities': True, 'authentication_failure_close': True, 'per_consumer_qos': True, 'direct_reply_to': True}, 'cluster_name': 'rabbit@hpfx2.collab.science.gc.ca', 'copyright': 'Copyright (c) 2007-2022 VMware, Inc. or its affiliates.', 'information': 'Licensed under the MPL 2.0. Website: https://rabbitmq.com', 'platform': 'Erlang/OTP 24.2.1', 'product': 'RabbitMQ', 'version': '3.9.13'}, mechanisms: [b'PLAIN', b'AMQPLAIN'], locales: ['en_US']
2024-01-29 15:00:37,493 [DEBUG] amqp __init__ using channel_id: 1
2024-01-29 15:00:37,514 [DEBUG] amqp _on_open_ok Channel open
2024-01-29 15:00:37,514 [DEBUG] amqp __init__ using channel_id: 2
2024-01-29 15:00:37,535 [DEBUG] amqp _on_open_ok Channel open
2024-01-29 15:00:37,634 [DEBUG] amqp _on_start Start from server, version: 0.9, properties: {'capabilities': {'publisher_confirms': True, 'exchange_exchange_bindings': True, 'basic.nack': True, 'consumer_cancel_notify': True, 'connection.blocked': True, 'consumer_priorities': True, 'authentication_failure_close': True, 'per_consumer_qos': True, 'direct_reply_to': True}, 'cluster_name': 'rabbit@hpfx2.collab.science.gc.ca', 'copyright': 'Copyright (c) 2007-2022 VMware, Inc. or its affiliates.', 'information': 'Licensed under the MPL 2.0. Website: https://rabbitmq.com', 'platform': 'Erlang/OTP 24.2.1', 'product': 'RabbitMQ', 'version': '3.9.13'}, mechanisms: [b'PLAIN', b'AMQPLAIN'], locales: ['en_US']
2024-01-29 15:00:37,681 [DEBUG] amqp __init__ using channel_id: 1
2024-01-29 15:00:37,699 [DEBUG] amqp _on_open_ok Channel open
2024-01-29 15:00:37,699 [DEBUG] amqp __init__ using channel_id: 2
2024-01-29 15:00:37,730 [DEBUG] amqp _on_open_ok Channel open
2024-01-29 15:00:37,749 [INFO] sarracenia.moth.amqp _queueDeclare queue declared q_anonymous.subscriber_test2 (as: amqps://anonymous@hpfx.collab.science.gc.ca), (messages waiting: 50000)
2024-01-29 15:00:37,749 [INFO] sarracenia.moth.amqp getSetup binding q_anonymous.subscriber_test2 with v02.post.*.WXO-DD.observations.swob-ml.# to xpublic (as: amqps://anonymous@hpfx.collab.science.gc.ca)
2024-01-29 15:00:37,765 [DEBUG] sarracenia.moth.amqp getSetup getSetup ... Done!
2024-01-29 15:00:37,786 [DEBUG] sarracenia.moth.amqp getNewMessage new msg: {'_format': 'v02', '_deleteOnPost': {'source', '_format', 'exchange', 'subtopic', 'local_offset', 'ack_id'}, 'sundew_extension': 'DMS:WXO_RENAMED_SWOB2:MSC:XML::20240129174355', 'from_cluster': 'DDSR.CMC', 'to_clusters': 'ALL', 'filename': 'msg_ddsr-WXO-DD_f2884f4dfeb89a44ec2ccbcc1c154702:DMS:WXO_RENAMED_SWOB2:MSC:XML::20240129174355', 'source': 'anonymous', 'mtime': '20240129T174356.779', 'atime': '20240129T174356.779', 'pubTime': '20240129T174356.779', 'baseUrl': 'https://hpfx.collab.science.gc.ca', 'relPath': '/20240129/WXO-DD/observations/swob-ml/20240129/CXCK/2024-01-29-1743-CXCK-AUTO-minute-swob.xml', 'subtopic': ['20240129', 'WXO-DD', 'observations', 'swob-ml', '20240129', 'CXCK'], 'identity': {'method': 'md5', 'value': 'sZvG3KgpfENZc15YSMHvbQ=='}, 'size': 9326, 'exchange': 'xpublic', 'ack_id': 1, 'local_offset': 0}
2024-01-29 15:00:37,787 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 8201.01 ) https://hpfx.collab.science.gc.ca /20240129/WXO-DD/observations/swob-ml/20240129/CXCK/2024-01-29-1743-CXCK-AUTO-minute-swob.xml
2024-01-29 15:00:37,787 [INFO] sarracenia.flow run now active on vip ['AnyAddressIsFine']
2024-01-29 15:00:37,788 [INFO] sarracenia.flow do_download missing destination directories, makedirs: /tmp/flow_demo/20240129/WXO-DD/observations/swob-ml/20240129/CXCK
2024-01-29 15:00:37,789 [DEBUG] sarracenia.config add_option []0 accelWgetCommand declared as type:<class 'str'> value:/usr/bin/wget %s -o - -O %d
2024-01-29 15:00:37,887 [INFO] sarracenia.flowcb.log after_work downloaded ok: /tmp/flow_demo/20240129/WXO-DD/observations/swob-ml/20240129/CXCK/2024-01-29-1743-CXCK-AUTO-minute-swob.xml
2024-01-29 15:00:37,912 [DEBUG] sarracenia.moth.amqp getNewMessage new msg: {'_format': 'v02', '_deleteOnPost': {'source', '_format', 'exchange', 'subtopic', 'local_offset', 'ack_id'}, 'sundew_extension': 'DMS:WXO_RENAMED_SWOB2:MSC:XML::20240129174355', 'from_cluster': 'DDSR.CMC', 'to_clusters': 'ALL', 'filename': 'msg_ddsr-WXO-DD_fc8051d6b19291e9b02b8da5f6fc3d2f:DMS:WXO_RENAMED_SWOB2:MSC:XML::20240129174355', 'source': 'anonymous', 'mtime': '20240129T174356.779', 'atime': '20240129T174356.779', 'pubTime': '20240129T174356.779', 'baseUrl': 'https://hpfx.collab.science.gc.ca', 'relPath': '/20240129/WXO-DD/observations/swob-ml/20240129/CZKD/2024-01-29-1743-CZKD-AUTO-minute-swob.xml', 'subtopic': ['20240129', 'WXO-DD', 'observations', 'swob-ml', '20240129', 'CZKD'], 'identity': {'method': 'md5', 'value': 'yU3e4yc2eVtN+qwiiohaLQ=='}, 'size': 9440, 'exchange': 'xpublic', 'ack_id': 2, 'local_offset': 0}
2024-01-29 15:00:37,912 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 8201.13 ) https://hpfx.collab.science.gc.ca /20240129/WXO-DD/observations/swob-ml/20240129/CZKD/2024-01-29-1743-CZKD-AUTO-minute-swob.xml
2024-01-29 15:00:37,913 [INFO] sarracenia.flow do_download missing destination directories, makedirs: /tmp/flow_demo/20240129/WXO-DD/observations/swob-ml/20240129/CZKD
2024-01-29 15:00:38,000 [INFO] sarracenia.flowcb.log after_work downloaded ok: /tmp/flow_demo/20240129/WXO-DD/observations/swob-ml/20240129/CZKD/2024-01-29-1743-CZKD-AUTO-minute-swob.xml
2024-01-29 15:00:38,024 [DEBUG] sarracenia.moth.amqp getNewMessage new msg: {'_format': 'v02', '_deleteOnPost': {'source', '_format', 'exchange', 'subtopic', 'local_offset', 'ack_id'}, 'sundew_extension': 'DMS:WXO_RENAMED_SWOB2:MSC:XML::20240129174355', 'from_cluster': 'DDSR.CMC', 'to_clusters': 'ALL', 'filename': 'msg_ddsr-WXO-DD_fe4c49c3c2cc0493ae7473d321d25199:DMS:WXO_RENAMED_SWOB2:MSC:XML::20240129174355', 'source': 'anonymous', 'mtime': '20240129T174356.779', 'atime': '20240129T174356.779', 'pubTime': '20240129T174356.779', 'baseUrl': 'https://hpfx.collab.science.gc.ca', 'relPath': '/20240129/WXO-DD/observations/swob-ml/20240129/CVBB/2024-01-29-1743-CVBB-AUTO-minute-swob.xml', 'subtopic': ['20240129', 'WXO-DD', 'observations', 'swob-ml', '20240129', 'CVBB'], 'identity': {'method': 'md5', 'value': 'Hwu7CE6asjaQMz7veEmUXA=='}, 'size': 8399, 'exchange': 'xpublic', 'ack_id': 3, 'local_offset': 0}
2024-01-29 15:00:38,025 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 8201.25 ) https://hpfx.collab.science.gc.ca /20240129/WXO-DD/observations/swob-ml/20240129/CVBB/2024-01-29-1743-CVBB-AUTO-minute-swob.xml
2024-01-29 15:00:38,025 [INFO] sarracenia.flow do_download missing destination directories, makedirs: /tmp/flow_demo/20240129/WXO-DD/observations/swob-ml/20240129/CVBB
2024-01-29 15:00:38,114 [INFO] sarracenia.flowcb.log after_work downloaded ok: /tmp/flow_demo/20240129/WXO-DD/observations/swob-ml/20240129/CVBB/2024-01-29-1743-CVBB-AUTO-minute-swob.xml
2024-01-29 15:00:38,139 [DEBUG] sarracenia.moth.amqp getNewMessage new msg: {'_format': 'v02', '_deleteOnPost': {'source', '_format', 'exchange', 'subtopic', 'local_offset', 'ack_id'}, 'sundew_extension': 'DMS:WXO_RENAMED_SWOB2:MSC:XML::20240129174356', 'from_cluster': 'DDSR.CMC', 'to_clusters': 'ALL', 'filename': 'msg_ddsr-WXO-DD_8067f0a1a5b4711ab86e481341b26590:DMS:WXO_RENAMED_SWOB2:MSC:XML::20240129174356', 'source': 'anonymous', 'mtime': '20240129T174357.781', 'atime': '20240129T174357.781', 'pubTime': '20240129T174357.781', 'baseUrl': 'https://hpfx.collab.science.gc.ca', 'relPath': '/20240129/WXO-DD/observations/swob-ml/20240129/CWLJ/2024-01-29-1743-CWLJ-AUTO-minute-swob.xml', 'subtopic': ['20240129', 'WXO-DD', 'observations', 'swob-ml', '20240129', 'CWLJ'], 'identity': {'method': 'md5', 'value': 'uDrzi9GLNnhEgGvSylHu9g=='}, 'size': 9428, 'exchange': 'xpublic', 'ack_id': 4, 'local_offset': 0}
2024-01-29 15:00:38,140 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 8200.36 ) https://hpfx.collab.science.gc.ca /20240129/WXO-DD/observations/swob-ml/20240129/CWLJ/2024-01-29-1743-CWLJ-AUTO-minute-swob.xml
2024-01-29 15:00:38,141 [INFO] sarracenia.flow do_download missing destination directories, makedirs: /tmp/flow_demo/20240129/WXO-DD/observations/swob-ml/20240129/CWLJ
2024-01-29 15:00:38,242 [INFO] sarracenia.flowcb.log after_work downloaded ok: /tmp/flow_demo/20240129/WXO-DD/observations/swob-ml/20240129/CWLJ/2024-01-29-1743-CWLJ-AUTO-minute-swob.xml
2024-01-29 15:00:38,262 [DEBUG] sarracenia.moth.amqp getNewMessage new msg: {'_format': 'v02', '_deleteOnPost': {'source', '_format', 'exchange', 'subtopic', 'local_offset', 'ack_id'}, 'sundew_extension': 'DMS:WXO_RENAMED_SWOB2:MSC:XML::20240129174355', 'from_cluster': 'DDSR.CMC', 'to_clusters': 'ALL', 'filename': 'msg_ddsr-WXO-DD_6f203257347d4f090abc1d7557864cb7:DMS:WXO_RENAMED_SWOB2:MSC:XML::20240129174355', 'source': 'anonymous', 'mtime': '20240129T174357.267', 'atime': '20240129T174357.267', 'pubTime': '20240129T174357.267', 'baseUrl': 'https://hpfx.collab.science.gc.ca', 'relPath': '/20240129/WXO-DD/observations/swob-ml/20240129/CAMS/2024-01-29-1743-CAMS-AUTO-minute-swob.xml', 'subtopic': ['20240129', 'WXO-DD', 'observations', 'swob-ml', '20240129', 'CAMS'], 'identity': {'method': 'md5', 'value': 'H/h4jm6MTzMSp1oCeDS1jA=='}, 'size': 9826, 'exchange': 'xpublic', 'ack_id': 5, 'local_offset': 0}
2024-01-29 15:00:38,263 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 8201.00 ) https://hpfx.collab.science.gc.ca /20240129/WXO-DD/observations/swob-ml/20240129/CAMS/2024-01-29-1743-CAMS-AUTO-minute-swob.xml
2024-01-29 15:00:38,263 [INFO] sarracenia.flow do_download missing destination directories, makedirs: /tmp/flow_demo/20240129/WXO-DD/observations/swob-ml/20240129/CAMS
2024-01-29 15:00:38,356 [INFO] sarracenia.flowcb.log after_work downloaded ok: /tmp/flow_demo/20240129/WXO-DD/observations/swob-ml/20240129/CAMS/2024-01-29-1743-CAMS-AUTO-minute-swob.xml
2024-01-29 15:00:38,357 [INFO] sarracenia.flow please_stop ok, telling 4 callbacks about it.
2024-01-29 15:00:38,357 [INFO] sarracenia.flow run starting last pass (without gather) through loop for cleanup.
2024-01-29 15:00:38,358 [INFO] sarracenia.flow please_stop ok, telling 4 callbacks about it.
2024-01-29 15:00:38,359 [INFO] sarracenia.flow run on_housekeeping pid: 3567801 subscribe/flow_demo instance: 0
2024-01-29 15:00:38,359 [INFO] sarracenia.flowcb.gather.message on_housekeeping messages: good: 5 bad: 0 bytes: 730 Bytes average: 146 Bytes
2024-01-29 15:00:38,359 [INFO] sarracenia.flowcb.retry on_housekeeping on_housekeeping
2024-01-29 15:00:38,360 [INFO] sarracenia.diskqueue on_housekeeping work_retry_00 on_housekeeping
2024-01-29 15:00:38,361 [INFO] sarracenia.diskqueue on_housekeeping No retry in list
2024-01-29 15:00:38,361 [INFO] sarracenia.diskqueue on_housekeeping on_housekeeping elapse 0.000548
2024-01-29 15:00:38,361 [INFO] sarracenia.diskqueue on_housekeeping post_retry_000 on_housekeeping
2024-01-29 15:00:38,362 [INFO] sarracenia.diskqueue on_housekeeping No retry in list
2024-01-29 15:00:38,362 [INFO] sarracenia.diskqueue on_housekeeping on_housekeeping elapse 0.000741
2024-01-29 15:00:38,363 [INFO] sarracenia.flowcb.housekeeping.resources on_housekeeping Current Memory cpu_times: user=0.76 system=0.17
2024-01-29 15:00:38,363 [INFO] sarracenia.flowcb.housekeeping.resources on_housekeeping Current mem usage: 790.2 MiB, accumulating count (5 or 5/100 so far) before self-setting threshold
2024-01-29 15:00:38,364 [INFO] sarracenia.flowcb.log stats version: 3.00.51rc6, started: a second ago, last_housekeeping: 1.0 seconds ago
2024-01-29 15:00:38,364 [INFO] sarracenia.flowcb.log stats messages received: 5, accepted: 5, rejected: 0 rate accepted: 100.0% or 5.0 m/s
2024-01-29 15:00:38,364 [INFO] sarracenia.flowcb.log stats files transferred: 5 bytes: 45.3 KiB rate: 45.1 KiB/sec
2024-01-29 15:00:38,365 [INFO] sarracenia.flowcb.log stats lag: average: 8200.95, maximum: 8201.25
2024-01-29 15:00:38,366 [INFO] sarracenia.flowcb.log on_housekeeping housekeeping
2024-01-29 15:00:38,366 [INFO] sarracenia.flow run clean stop from run loop
2024-01-29 15:00:38,385 [DEBUG] amqp collect Closed channel #1
2024-01-29 15:00:38,386 [DEBUG] amqp collect Closed channel #2
2024-01-29 15:00:38,386 [INFO] sarracenia.flowcb.gather.message on_stop closing
2024-01-29 15:00:38,386 [INFO] sarracenia.flow close flow/close completed cleanly pid: 3567801 subscribe/flow_demo instance: 0
Conclusion:
With the sarracenia.flow class, an async method of operation is supported, it can be customized using flowcb (flow callback) class to introduce specific processing at specific times. It is just like invocation of a single instance from the command line, except all configuration is done within python by setting cfg fields, rather than using the configuration language.
What is lost vs. using the command line tool:
ability to use the configuration language (slightly simpler than assigning values to the cfg object)
easy running of multiple instances,
co-ordinated monitoring of the instances (restarts on failure, and a programmable number of subscribers started per configuration.)
log file management.
The command line tool provides those additional features.