Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IdentifierHiding: Improve performance, address some false positives/false negatives #813

Merged
merged 21 commits into from
Dec 10, 2024

Conversation

lcartey
Copy link
Collaborator

@lcartey lcartey commented Dec 9, 2024

Description

Background

IdentifierHiding.qll has been frequently reported as one of the slowest performing queries in the Coding Standards suite. In principle, the query is trying to achieve something quite simple, which is the following:

exists(UserVariable hiddenVariable, UserVariable hidingVariable, Scope parentScope, Scope childScope |
  childScope.getAnAncestor() = parentScope
  hiddenVariable = parent.getAVariable() and
  hidingVariable = childScope.getAVariable() and
  hiddenVariable.getName() = hidingVariable.getName()
)

In words: find a pair of variables, where one is declared in a nested scope of the other, and where they have the same name. However, performance of this simple approach suffers on large codebases with a lot of name duplication, and a lot of variables declared in the global scope, because there is no ideal join order: joining by name first is expensive, as is joining by variables in a nested scope (which is the current approach, implemented in getOuterScopesOfVariable_candidate).

Updated algorithm

This PR therefore rewrites the variable hiding algorithm from scratch to avoid these performance concerns, by implementing the hiding detection using a phased approach:

  • Phase 1: compute for each scope the set of string names declared in this scope or a nested scope (Scope::isNameDeclaredInThisOrNestedScope(string name), Scope::isNameDeclaredInNestedScope(string name)), and use these to determine, for each scope, a set of variables that are potentially hidden in a nested scope (UserVariable Scope::getAPotentiallyHiddenVariable(string name)).
  • Phase 2: compute for each scope the set of UserVariables that are potentially hidden by a variable declared in this or a nested scope (UserVariable Scope::getAVariableHiddenByThisOrNestedScope(string name)).
  • Phase 3: compute for each scope a candidate set of hidden/hiding variables, where the hidden variable is declared in an outer scope, and the hiding variable is declared in this or a nested scope (Scope::hidesCandidate(UserVariable hiddenVariable, UserVariable hidingVariable, string name))

Each phase in this approach remains tractable, and avoids the large joins on either variables by name or by scope.

In making this change, I have incorporated the detection of identifier hiding directly into the scope hiding calculation. This reduces duplication and removes further opportunities for poor join ordering performance.

In the process of this performance change, I've fixed two FP/FN issues:

  • Variables declared in nested loops are now consistently identified. This is due to consistency improvements in the getParentScope calculation.
  • Variable declarations in lambda expressions now consider whether the variable they hide is in the same translation unit, reducing false positives.

Review

I would recommend a commit-wise review:

  • Preparatory commits to simplify the refactoring:
    • 27efc31 - remove unused Scope import
    • 0c92f2e - switch DCL53-CPP from hides to hidesStrict (no semantic difference for this query).
    • 90d4127 - remove hides, as it's no longer used.
  • Adjustments to getParentScope to address consistency issues:
    • 6b5021a - adds testing for the getParentScope predicate.
    • b8b4131 - fixes an issues which was causing some elements to have multiple parent scopes.
    • 1f8e2ba - address parent scope issues with loop initializers, conditions and increments.
    • dbd3fe6 - a later commit, but this fixes issues with the parents of catch blocks.
  • Initial performance adjustments:
    • 63a60b1 - improve performance of excludedViaNestedNamespace by inlining it late.
    • 1bd839c - implement a simplified version of the improved algorithm.
    • f25507e and b5ff407 - some refactoring improvements.
    • 61521e0 - refactor declaration order detection for block scope defined variables to avoid requiring transitive closure on the parent scope.
  • Integrate lambda expression hiding in the scope hiding calculation:
    • 47b3fb2 - modify the hiding algorithm to expose pairs of variables in hideCandidate. This enables as-we-go filtering, which is required to apply the lambda hiding rules.
    • 9b7e129 - implement a LambdaScope class to implement the lambda hiding rules in hidesCandidate. This is not used yet in this commit, because lambda expressions are not tied into the getParentScope hierarchy.
    • 117d0fb - capture lambda expressions in the getParentScope hierarchy.
    • 411ecde - remove the special casing of lambda expressions in IdentifierHiding.qll.

Performance testing

I used https://github.com/grpc/grpc for initial testing and development as it contains both a large number of variables with the same name and a large number of variables declared at the global scope. On my machine, before this PR, it took around 750 seconds to evaluate IdentifierHiding.ql from a cold cache, with the following slow predicates:

time evals max @ iter predicate
3m30s Scope::excludedViaNestedNamespaces/2#117d2dae@a8378djc
2m9s Scope::Scope.getADeclaration/0#dispred#900017c1#fb_10#join_rhs__Scope::UserVariable#715743a1_Variab_#shared@cfddd5k1
2m2s Scope::UserVariable#715743a1_Variable::Variable.getName/0#dispred#1b4a78c5_Variable::Variable.getNa_#shared@57fbdf1e
59.3s #Scope::Scope.getStrictParent/0#dispred#23e5f1eaPlus#sinkBound#4#2#3#2#4#Scope::Scope.getStrictPar__#higher_order_body@893f42kb
58.5s Scope::Scope.getADeclaration/0#dispred#900017c1#fb_10#join_rhs__Scope::UserVariable#715743a1_Variab_#shared#169d9cfa6
27.4s #Scope::Scope.getStrictParent/0#dispred#23e5f1eaPlus#sourceBound#4#3_Declaration::Declaration.hasSp_#higher_order_body@5f38a95e

After this PR, it takes around 45 seconds from a cold cache, with the following most expensive predicates:

time evals max @ iter predicate
7.8s 27 3.7s @ 11 Scope::Scope.getAVariableHiddenByThisOrNestedScope/1#f46b1a52@d5522xlq
6.2s 2 3.8s @ 2 Element::ElementBase.toString/0#dispred#6e016f86@277e1cso
901ms Element::Element.getFile/0#dispred#536cb5f3@f25789it

I've also run against the Top 10 C/C++ codebases on MRVA, and seen a similar performance improvement.

Change request type

  • Release or process automation (GitHub workflows, internal scripts)
  • Internal documentation
  • External documentation
  • Query files (.ql, .qll, .qls or unit tests)
  • External scripts (analysis report or other code shipped as part of a release)

Rules with added or modified queries

  • No rules added
  • Queries have been added for the following rules:
    • rule number here
  • Queries have been modified for the following rules:
    • A2-10-1
    • RULE-5-3
    • DCL53-CPP

Release change checklist

A change note (development_handbook.md#change-notes) is required for any pull request which modifies:

  • The structure or layout of the release artifacts.
  • The evaluation performance (memory, execution time) of an existing query.
  • The results of an existing query in any circumstance.

If you are only adding new rule queries, a change note is not required.

Author: Is a change note required?

  • Yes
  • No

🚨🚨🚨
Reviewer: Confirm that format of shared queries (not the .qll file, the
.ql file that imports it) is valid by running them within VS Code.

  • Confirmed

Reviewer: Confirm that either a change note is not required or the change note is required and has been added.

  • Confirmed

Query development review checklist

For PRs that add new queries or modify existing queries, the following checklist should be completed by both the author and reviewer:

Author

  • Have all the relevant rule package description files been checked in?
  • Have you verified that the metadata properties of each new query is set appropriately?
  • Do all the unit tests contain both "COMPLIANT" and "NON_COMPLIANT" cases?
  • Are the alert messages properly formatted and consistent with the style guide?
  • Have you run the queries on OpenPilot and verified that the performance and results are acceptable?
    As a rule of thumb, predicates specific to the query should take no more than 1 minute, and for simple queries be under 10 seconds. If this is not the case, this should be highlighted and agreed in the code review process.
  • Does the query have an appropriate level of in-query comments/documentation?
  • Have you considered/identified possible edge cases?
  • Does the query not reinvent features in the standard library?
  • Can the query be simplified further (not golfed!)

Reviewer

  • Have all the relevant rule package description files been checked in?
  • Have you verified that the metadata properties of each new query is set appropriately?
  • Do all the unit tests contain both "COMPLIANT" and "NON_COMPLIANT" cases?
  • Are the alert messages properly formatted and consistent with the style guide?
  • Have you run the queries on OpenPilot and verified that the performance and results are acceptable?
    As a rule of thumb, predicates specific to the query should take no more than 1 minute, and for simple queries be under 10 seconds. If this is not the case, this should be highlighted and agreed in the code review process.
  • Does the query have an appropriate level of in-query comments/documentation?
  • Have you considered/identified possible edge cases?
  • Does the query not reinvent features in the standard library?
  • Can the query be simplified further (not golfed!)

Any conflicting variable will be, by definition, in a different
scope.
There are no more consumers of hides(..).

In addition, it doesn't make sense conceptually to look for
variables in the same scope with the same name, as scopes will
prohibit using the same name in the same scope. Reviewing real
world cases where this occurs, they all seem to be extractor
oddities (multiple copies of parameters for the same function etc.)
which provides further evidence that this mode is not required.
 - Expose the internal getParentScope for testing.
 - Add test cases
We adjust the parent scope explicitly for loops, if statements and
switch statements, but, due to a logic bug, we previously retained
the existing results provided by Element.getParentScope().
All direct children of a for loop should have the for loop itself
as the scope.
Add pragma_inline to ensure we consider this as a post-filtering
step.
Improves performance by:
 - Capturing for each scope the list of names defined by nested
   scopes
 - Use that to determine hidden identifiers for a scope.
 - Separately determine the hiding identifiers for a scope.

This addresses performance issues in the now deleted predicate
getOuterScopesOfVariable_candidate().
We now tie the Handler into the TryStmt, and catch-block parameters
into the Handler for a consistent AST hierarchy.
Behaviour preserving refactor to allow future filtering of invalid
pairs of variables during the traversal algorithm.

For example, whether a variable declared within a lambda variable
hides an outer scope variable depends on the type and nature of the
variable.

By exposing pairs of candidate variables, we can more easily filter
on these conditions.
Lambda expressions have special visibility rules that affect
identifier hiding, which we incorporate into the Scope hiding
calculations.

Note: Lambda expressions are not currently tied into the parent
scope hierarchy, so this change doesn't affect calculations until
getParentScope(Element e) is extended to support them.
Lambda functions are tied into the parent statement of their
declaring lambda expression, which enables Scope's hiding
predicates to calculate hiding behaviour for lambda expressions.
This removes the special handling of lambda expressions, which
was causing performance issues.

Instead, we rely on the new behviour of the Scope library, which
calculates identifier hiding for lambda expressions as part of the
main calculation.

This has one semantic change - the new code applies
`isInSameTranslationUnit`, which reduces false positives where the
identifier "hiding" in a lambda occurred with an outer variable in
a different translation unit.
@lcartey lcartey requested a review from knewbury01 December 9, 2024 13:45
Copy link
Contributor

@knewbury01 knewbury01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lcartey just one comment for now, still reviewing

cpp/common/src/codingstandards/cpp/Scope.qll Show resolved Hide resolved
Scope no longer provides a suitable predicate for determining
variables in nested scopes.

Instead, first determine the set of conflicting names, then
identify a set of variables which are conflicting, and are hidden
within a nested scope.
Copy link
Contributor

@knewbury01 knewbury01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! thanks for all this work

@knewbury01 knewbury01 enabled auto-merge December 10, 2024 06:12
@knewbury01 knewbury01 added this pull request to the merge queue Dec 10, 2024
Merged via the queue into main with commit 9f106d7 Dec 10, 2024
25 checks passed
@knewbury01 knewbury01 deleted the lcartey/scope-performance-fix branch December 10, 2024 09:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants