Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDDS-11727. Block ozone repair when a node is running #7589

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

sarvekshayr
Copy link
Contributor

What changes were proposed in this pull request?

This PR adds a validation check to ensure that Ozone repair commands are only executed when the relevant services (e.g., OM, SCM, Datanode) are stopped.

  • check_running_ozone_services function scans for PID files of running Ozone services.
  • exports a list of detected running services to an environment variable, OZONE_RUNNING_SERVICES.

  • Subcommands under ozone repair om now validate whether OM is running on the host by reading the OZONE_RUNNING_SERVICES environment variable.
  • If OM is detected as running, the command fails and displays an error message prompting the user to stop the OM service before proceeding.
  • A --force flag is introduced to allow the user to override this check, useful in false-positive scenarios like co-located services during testing.

What is the link to the Apache JIRA

HDDS-11727

How was this patch tested?

OM is detected as running

When OM is detected as running, the repair command throws an error:

sarvekshayr@Sarvekshas-MacBook-Pro ozone % docker-compose exec om bash
bash-5.1$ ozone repair om fso-tree --db /data/metadata/om.db
ATTENTION: Running as user hadoop. Make sure this is the same user used to run the Ozone process. Are you sure you want to continue (y/N)? y
Run as user: hadoop
Error: OM is currently running on this host. Stop the OM service before running the repair tool.

Using --force to override the check

When the --force flag is used, the repair command proceeds despite OM being detected as running:

bash-5.1$ ozone repair om fso-tree --db /data/metadata/om.db --force
ATTENTION: Running as user hadoop. Make sure this is the same user used to run the Ozone process. Are you sure you want to continue (y/N)? y
Run as user: hadoop
Warning: --force flag used. Proceeding despite OM being detected as running.
FSO Repair Tool is running in debug mode
Creating database of reachable directories at /data/metadata/reachable.db
Processing volume: /s3v

Reachable:
        Directories: 0
        Files: 0
        Bytes: 0
Unreachable:
        Directories: 0
        Files: 0
        Bytes: 0
Unreferenced:
        Directories: 0
        Files: 0
        Bytes: 0

@sarvekshayr
Copy link
Contributor Author

@errose28 please review this patch.

@adoroszlai adoroszlai added the tools Tools that helps with debugging label Dec 18, 2024
@errose28 errose28 self-requested a review December 18, 2024 16:00
Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sarvekshayr for the patch.

The change in the shell script looks good (but I haven't tested it).

The check for service state should be more generic to support other types like Datanode and SCM. It should also be exposed to other repair subcommands.

In HDDS-11946 I plan to add a parent class for repair subcommands, I think that would be a good place to add this check. If you add the check here, I'll move it, along with the --force flag, to the parent class.

@sarvekshayr
Copy link
Contributor Author

sarvekshayr commented Dec 19, 2024

The check for service state should be more generic to support other types like Datanode and SCM. It should also be exposed to other repair subcommands.

Sure, we can add a generic method to check for specific running services.

In HDDS-11946 I plan to add a parent class for repair subcommands, I think that would be a good place to add this check. If you add the check here, I'll move it, along with the --force flag, to the parent class.

That sounds good. Depending on which PR gets merged first, we can coordinate to make the necessary adjustments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tools Tools that helps with debugging
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants