Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[nvidia] Skip SAI discovery on ports #1416

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

stepanblyschak
Copy link
Contributor

@stepanblyschak stepanblyschak commented Aug 21, 2024

Given that modern systems have lots of ports, performing SAI discovery takes very long time, e.g. (8 sec) for 256 port system. This has a big impact of fast-boot downtime and the discovery itself is not required for Nvidia platform fast-boot.

Same applies to Nvidia fastfast-boot (aka warm-boot), yet needs to be tested separately.

@stepanblyschak stepanblyschak force-pushed the skip-discovery-on-fast-boot branch from aa63de0 to 995d79b Compare August 22, 2024 22:04
Copy link
Collaborator

@kcudnik kcudnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will lead to inconsistency ASIC_DB vs what's on device, which will later on lead to crash

@@ -89,7 +89,8 @@ namespace syncd

virtual void onPostPortCreate(
_In_ sai_object_id_t port_rid,
_In_ sai_object_id_t port_vid) = 0;
_In_ sai_object_id_t port_vid,
_In_ bool discoverPortObjects = true) = 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is very strict to ports, if we decide later on to do something similar on other objects then this is not optimal solution

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function is meant to be used on ports. Considering current approach, I assume there will be onPostXCreate() functions for other object types. Then, if needed, they can accept a boolean flag in the same way. This is simple and gives required granularity.

syncd/Syncd.cpp Outdated
Comment on lines 5304 to 5308
#ifdef SKIP_SAI_PORT_DISCOVERY_ON_FAST_BOOT
const bool discoverPortObjectsInFastBoot = false;
#else
const bool discoverPortObjectsInFastBoot = true;
#endif
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fast boot cak be initiated after code was compiled which then this check will be hardcoded

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also there are no tests for testing this code

Copy link
Contributor Author

@stepanblyschak stepanblyschak Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fast boot cak be initiated after code was compiled which then this check will be hardcoded

This was the intention. For Nvidia - skip discover on ports in fast boot. The runtime check for fast boot is done in the condition below.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be runtime check

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kcudnik What is the benefit of runtime check here? Syncd is compiled per platform and on Nvidia we do not want to run discovery. We know this at compile time.

@stepanblyschak
Copy link
Contributor Author

stepanblyschak commented Nov 26, 2024

this will lead to inconsistency ASIC_DB vs what's on device, which will later on lead to crash

@kcudnik Yes, current design leads to performance problems on devices with lots of ports (could be 512, 1024 and more - tens of thousands keys to insert to ASIC_DB on init).
Could you suggest a test to cause a crash?

@kcudnik
Copy link
Collaborator

kcudnik commented Nov 27, 2024

this will lead to inconsistency ASIC_DB vs what's on device, which will later on lead to crash

@kcudnik Yes, current design leads to performance problems on devices with lots of ports (could be 512, 1024 and more - tens of thousands keys to insert to ASIC_DB on init). Could you suggest a test to cause a crash?

Discovery Is done only once at switch create, how long it takes in your case ? We need to have a full view of device since later on warm boo will try to delete objects which were not discovered at current stage it's hard to predict when cras will happen, but it will be 99% on warm boot

@stepanblyschak
Copy link
Contributor Author

stepanblyschak commented Nov 27, 2024

@kcudnik I got your point, please note that Nvidia does not use standard warm-reboot flow in syncd, instead uses fast-fast-boot mode which is quite similar to fast-boot. No comparison logic is involved on Nvidia, therefore in my testing I did not observe crashes even after doing consecutive fast-reboot and warm-reboot. Therefore, the change is limited to be Nvidia only.

I attach the log of discovery during fast-reboot. I measured 4.8 sec on 202405 dedicated to discovery which is 16% of time budget for fast-reboot. It is only a 256 port system and we expect the time to increase linearly with the number of ports which is growing with new HWSKUs.

logs.txt

@kcudnik
Copy link
Collaborator

kcudnik commented Nov 27, 2024

Logs are very consuming time please check without logging, also maybe there is also room on your reiver to optimize speed to return objects on ports ?

@kcudnik
Copy link
Collaborator

kcudnik commented Nov 27, 2024

fropm logs you pasted, it seems like you are creating those posts explicitly? why thsoe ports are not existing already when switch is created ?

@stepanblyschak
Copy link
Contributor Author

Logs are very consuming time please check without logging, also maybe there is also room on your reiver to optimize speed to return objects on ports ?

Logs are set to CRITICAL level during discovery and there are no many logs from syncd itself. We optimized various SAI calls. However, it does not address the main issue which is that on Nvidia platform there's no reason to do discovery.

fropm logs you pasted, it seems like you are creating those posts explicitly? why thsoe ports are not existing already when switch is created ?

It's a matter of user configuration. SAI provides default ports or none, user can override/breakout ports in CONFIG_DB. In later case ports are added using port bulk API. Either way, we would like to not have to discover objects on fast/warm boot as the number of objects and attributes grows with new platforms.

@kcudnik
Copy link
Collaborator

kcudnik commented Nov 29, 2024

Then w need deeper discussion here, @lguohan can you jump in? If there will be no warm boot performed on this particular platform, then we can add feature to exclude this specific platform via switch from comma done option

@lguohan
Copy link
Contributor

lguohan commented Dec 4, 2024

i am a little bit confuse, the PR title is about fast-boot. @kcudnik , while you are talking about warm reboot. what does this code impact?

@kcudnik
Copy link
Collaborator

kcudnik commented Dec 5, 2024

it does not matter whether is warm boot or fast boot this wiil cause inconsestincy in db and later on crash

@@ -948,46 +948,50 @@ void SaiSwitch::redisUpdatePortLaneMap(

void SaiSwitch::onPostPortCreate(
_In_ sai_object_id_t port_rid,
_In_ sai_object_id_t port_vid)
_In_ sai_object_id_t port_vid,
_In_ bool discoverPortObjects)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i feel like this eintrie change in this function is overcomplicated, it sholud be something like this:

if (object_type(oid) == SAI_OBJECT_TYPE_PORT && shouldSkipPorts)
  continue;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kcudnik
Regarding object_type(oid) == SAI_OBJECT_TYPE_PORT, the function is called onPostPortCreate so unless someone is calling it on object other than port I don't think this check is needed.

Do you mean early return? Like:

redisUpdatePortLaneMap(port_rid);

if (!discoverPortObjects)
{
    return;
}

...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, i thoung you also modify discover process, since it will also discover all objects on all ports, so i guess on cold boot you only need onpostportcreate, but this could still crash on next fast-boot

please do couple of fst-boot to fast-boot reboots with your patch to see if this will wrok

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking at taht code, you only need to modify SaiDiscovery process with flag to ignore port discovery, no else code is needed to be changed

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and ig you look on master, in SaiDiscovery.cpp file at line 34, you can actually pass new flag - to not discover port objects over VendorSaiOptions class to not forward all bool arguments to discover ports, if you want

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also add log warn message, that discovery port was disabled on nvidia platform

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kcudnik I applied your suggestion and skipped discovery for any type of boot on Nvidia.
Tested the following reboot flow with T0 topology on Nvidia platform:

cold -> fast -> fast -> warm -> warm -> fast -> warm

Didn't add log message since that would print it for every port.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok sound sgood

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a plan on Nvidia sdk to improve query time from saisdk those port attributes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kcudnik Yes, but not on get operations, since this is not required for fast/warm boot for us. Priority is on set operations.

Signed-off-by: Stepan Blyschak <[email protected]>
@stepanblyschak stepanblyschak force-pushed the skip-discovery-on-fast-boot branch from 56c4012 to 6bdf9bf Compare December 13, 2024 16:04
@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@stepanblyschak stepanblyschak changed the title [nvidia] Skip SAI discovery on ports on fast-boot [nvidia] Skip SAI discovery on ports Dec 13, 2024
kcudnik
kcudnik previously approved these changes Dec 14, 2024
@kcudnik
Copy link
Collaborator

kcudnik commented Dec 14, 2024

You could add this message in syncd constructor

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@@ -60,6 +60,13 @@ void SaiDiscovery::discover(

sai_object_type_t ot = m_sai->objectTypeQuery(rid);

#ifdef SKIP_SAI_PORT_DISCOVERY
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just small wu3stion, this check should be done at line 90 after putting port o jest id to discovered set

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Signed-off-by: Stepan Blyschak <[email protected]>
@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@stepanblyschak
Copy link
Contributor Author

/azpw run

@mssonicbld
Copy link
Collaborator

/AzurePipelines run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants