Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CMonitor : Added inotify support to cmonitor. #49

Open
wants to merge 29 commits into
base: master
Choose a base branch
from

Conversation

satyabratabharati
Copy link
Contributor

@satyabratabharati satyabratabharati commented Jun 1, 2022

CmonitorLauncher :

It will perform following steps:

    • Watch all files below a directory and notify an event for changes.
    • Retrieves all the process and extract the process name "/proc//stat.
    • check the process name against the white-list given in the filter list.
    • Execute command to launch CMonitor if the process name matches with the filter.

@@ -0,0 +1,195 @@
# =======================================================================================================
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this file (cmonitor_launcher.py) really deserve its own folder: tools/launcher.. tools/common-code is sort of "library" folder where we store Python code that is "importable" and usable as standalone library.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another cosmetic note: make sure to run "black" on this file to ensure consistent formatting

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we also need the python3 shebang:

#!/usr/bin/python3

as very first line

# - Execute command to launch CMonitor if the process name matches with the filter.
#
# =======================================================================================================
class CmonitorLauncher:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this class contains both the logic to watch for inotify events and to process them. We must split these 2 parts into 2 different classes:

  1. CgroupWatcher : this class will contain the inotify_events() method but it should also provide more "value" by containing the white filter logic on the process name; in particular it should insert in the "queue" only when an event is matching all filtering criterias (to avoid unblocking "by mistake" the worker thread)

  2. CmonitorLauncher : this class will contain the launch_cmonitor() method. I think using os.system() may raise subtle issues: for example if cmonitor_launcher.py is killed, there is no way for this script to terminate all the children cmonitor processes as well. By using instead subprocess.popen() or maybe even with subprocess.run() it should be instead possible to propagate the SIGTERM to all children cmonitor instances.
    Anyway this is something that needs to be tested properly (maybe SIGTERM works also with os.system() not sure!!!)

filename = event.get()
print("In process events entry:", entry, filename)
# print("In process events", filename)
time.sleep(50)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this sleep?

return allFiles

def process_task_files(self, dir):
time.sleep(5)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this sleep?

@f18m
Copy link
Owner

f18m commented Aug 2, 2022

Hi @satyabratabharati ,

I think this MR still needs some work. Major points to raise are:

  1. log file: we need this utility to write a log file using Python logging module so that we have each event time-stamped and available for debugging inside a logfile (--log CLI option could be used to define logging destination)
  2. unit tests: it should be possible to write some unit tests for CgroupWatcher asking a test instance to watch under /tmp/cmonitor_unit_test folder and then manipulating the filesystem there (with os.mkdir and writing a "tasks" file)... check out examples under e.g. tools/filter/tests
  3. clear sleep policy: each thread pool (the "watchers" and the "launchers") need to have a clear and well defined sleep policy: they need to drain their input queues (either the Inotify list of pending events or the list of items enqueued on "queue" ) as fast as possible and then return to sleep when all inputs have been processed
  4. function-level documentation, using Python triple quotes embedded docs, at least on public methods of the classes (all non public methods should start with double underscores)
  5. a README / documentation about the whole utility, with usage examples; I typically like to embed usage example in the --help output of each utility, see e.g. cmonitor_filter.py parse_command_line() function
  6. packaging: we need the Makefile "install" target to be able to install this new utility so it will be part of "cmonitor-tools" RPM

Generally speaking I think the design is good anyway: I like the idea of using ThreadPoolExecutor to watch on a folder and then launch (synchronoulsy) cmonitor instances. I admit however that's I should study how ThreadPoolExecutor actually works because I have a lot of doubts:
a) the thread running CgroupWatcher in my opinion should be just 1: it's doing very small amount of work (using inotify recursively on a folder) and filtering out non-interesting events. Having more than 1 also is tricky: how do we ensure we never watch the same event (cgroup creation) twice?
b) is the "queue" object thread-safe (it should be since Python is using GIL but again I'm not sure)
c) why did you use "max_workers=5" ? How does ThreadPoolExecutor decide how many threads to spawn?
d) where/how do we set the limit to the max instances of "cmonitor" can be running simultaneously?

@satyabratabharati
Copy link
Contributor Author

satyabratabharati commented Aug 19, 2022

Hi @satyabratabharati ,

I think this MR still needs some work. Major points to raise are:

  1. log file: we need this utility to write a log file using Python logging module so that we have each event time-stamped and available for debugging inside a logfile (--log CLI option could be used to define logging destination)
  2. unit tests: it should be possible to write some unit tests for CgroupWatcher asking a test instance to watch under /tmp/cmonitor_unit_test folder and then manipulating the filesystem there (with os.mkdir and writing a "tasks" file)... check out examples under e.g. tools/filter/tests
  3. clear sleep policy: each thread pool (the "watchers" and the "launchers") need to have a clear and well defined sleep policy: they need to drain their input queues (either the Inotify list of pending events or the list of items enqueued on "queue" ) as fast as possible and then return to sleep when all inputs have been processed
  4. function-level documentation, using Python triple quotes embedded docs, at least on public methods of the classes (all non public methods should start with double underscores)
  5. a README / documentation about the whole utility, with usage examples; I typically like to embed usage example in the --help output of each utility, see e.g. cmonitor_filter.py parse_command_line() function
  6. packaging: we need the Makefile "install" target to be able to install this new utility so it will be part of "cmonitor-tools" RPM

Generally speaking I think the design is good anyway: I like the idea of using ThreadPoolExecutor to watch on a folder and then launch (synchronoulsy) cmonitor instances. I admit however that's I should study how ThreadPoolExecutor actually works because I have a lot of doubts: a) the thread running CgroupWatcher in my opinion should be just 1: it's doing very small amount of work (using inotify recursively on a folder) and filtering out non-interesting events. Having more than 1 also is tricky: how do we ensure we never watch the same event (cgroup creation) twice? b) is the "queue" object thread-safe (it should be since Python is using GIL but again I'm not sure) c) why did you use "max_workers=5" ? How does ThreadPoolExecutor decide how many threads to spawn? d) where/how do we set the limit to the max instances of "cmonitor" can be running simultaneously?

Hi @f18m ,

I tried to implement all the points what you have suggested. Except for the unit test case from the above list. I will work on it.

Following points have been incorporated :

  1. logger for the logging events.
    example.log
  2. Done.
  3. regarding sleep policy, added sleep when the Queue is empty and there is no events to process.
    The sleep parameter can be provided now from the command line as a parameter. The default is set to 20 sec.
    The 20 sec sleep also used before processing the files ( meaning before applying whilte-list filter logic) , just to allow the
    task files to be created before applying the filter logic and store it to the Queue.
  4. Added function level comments as well for the classes.
  5. Added as suggested.
  6. Done.

Thank You.


exit_flag = False
# =======================================================================================================
# CgroupWatcher : Basic inotify class
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this class looks good except for one thing: the timeout / sleep logic: it should not be included in this class.. it doesn't really belong here. The caller may want to sleep for whatever reason by watching cgroup should be fast and imply no sleeps. So please remove the "timeout" field.

import logging
from datetime import datetime

exit_flag = False
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this flag should not be here... this class should not care about the whole app being requested to exit: it must just process the inotify events and fill a queue as fast as possible...that's it

self.path = path
self.filter = filter
self.timeout = timeout
self.myFileList = {}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.myFileList can be removed as variable

"""
self.path = path
self.filter = filter
self.timeout = timeout
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as mentioned before, remove self.timeout


"""
self.path = path
self.filter = filter
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from the documentation it's not clear what is the format of self.filter honestly. Is it a list of strings? please document

process_name = parts[1].strip("()")
return process_name

def __get_pid_list(self, filename):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename to __get_pid_list_from_cgroup_procs

list.append(line.strip())
return list

def __get_list_of_files(self, dir):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename to __get_files_recursively
and remove mention of "event" from the description... this function doesn't deal with any event by itself... also remove "watched dir"... this function is generic


"""
# time.sleep(20)
time.sleep(self.timeout)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this sleep as mentioned before


return allFiles

def __process_task_files(self, dir):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename to __check_new_cgroup_against_filter(self, cgroup_dir) and provide an example of "cgroup_dir" that this function expects

logging.info(f"CgroupWatcher event in Queue:{fileList}")
queue.put(fileList)
# global exit_flag
if exit_flag is True:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this "if"

i.add_watch(self.path)
try:
for event in i.event_gen():
if event is not None:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a comment like "if event is None, it means the Inotify() system has no more events that need to be processed... give control back to the caller"


"""
logging.info(f"CgroupWatcher calling inotify_event")
i = inotify.adapters.Inotify()
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a waste of resource to allocate the Inotify() class every time the inotify_events() method is invoked. this instance "i" should be allocated in the ctor and stored into "self.inotify_instance" member.

exit(1)

finally:
i.remove_watch(path)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this remove_watch() should be moved into the dtor of this class

logging.info(f"Launch cMonitor with command: {monitor_cmd }")
# monitor_process = subprocess.Popen(monitor_cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
monitor_process = subprocess.Popen(monitor_cmd, shell=True)
self.monitored_processes[process_name] = monitor_process
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my understanding of this class CmonitorLauncher is that it requires the caller to provide a static assigment "process_name -> IP address/port used by cmonitor with Prometheus frontend" right?
What if the same process_name is detected in e.g. 10 container instances? I think here we will overwrite the self.monitored_processes[process_name] each time with a new subprocess.Popen() instance, leaking the previous instance.
I think we need a more flexible IP address/port allocation scheme (when launching a cmonitor_collector configured to stream data to Prometheus) or more flexible JSON output name allocation scheme (when launching a cmonitor_collector configured to save data to JSON -- probably not a concern in this first implementation).
Perhaps we can have a meeting to discuss this but I suggest to ask to the user (through CLI options) a range of IPs/ports that can be used. Then CMonitorLauncher should have a pooling mechanism that keeps track of free IPadress/port pairs and used ones. When a new cmonitor_collector is launched we ask for next free IPaddress/port. When a cmonitor_launcher completes, we must release his IPaddress/port back to the pool.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants