-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IdentifierHiding
: Improve performance, address some false positives/false negatives
#813
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Any conflicting variable will be, by definition, in a different scope.
There are no more consumers of hides(..). In addition, it doesn't make sense conceptually to look for variables in the same scope with the same name, as scopes will prohibit using the same name in the same scope. Reviewing real world cases where this occurs, they all seem to be extractor oddities (multiple copies of parameters for the same function etc.) which provides further evidence that this mode is not required.
- Expose the internal getParentScope for testing. - Add test cases
We adjust the parent scope explicitly for loops, if statements and switch statements, but, due to a logic bug, we previously retained the existing results provided by Element.getParentScope().
All direct children of a for loop should have the for loop itself as the scope.
Add pragma_inline to ensure we consider this as a post-filtering step.
Improves performance by: - Capturing for each scope the list of names defined by nested scopes - Use that to determine hidden identifiers for a scope. - Separately determine the hiding identifiers for a scope. This addresses performance issues in the now deleted predicate getOuterScopesOfVariable_candidate().
We now tie the Handler into the TryStmt, and catch-block parameters into the Handler for a consistent AST hierarchy.
Behaviour preserving refactor to allow future filtering of invalid pairs of variables during the traversal algorithm. For example, whether a variable declared within a lambda variable hides an outer scope variable depends on the type and nature of the variable. By exposing pairs of candidate variables, we can more easily filter on these conditions.
Lambda expressions have special visibility rules that affect identifier hiding, which we incorporate into the Scope hiding calculations. Note: Lambda expressions are not currently tied into the parent scope hierarchy, so this change doesn't affect calculations until getParentScope(Element e) is extended to support them.
Lambda functions are tied into the parent statement of their declaring lambda expression, which enables Scope's hiding predicates to calculate hiding behaviour for lambda expressions.
This removes the special handling of lambda expressions, which was causing performance issues. Instead, we rely on the new behviour of the Scope library, which calculates identifier hiding for lambda expressions as part of the main calculation. This has one semantic change - the new code applies `isInSameTranslationUnit`, which reduces false positives where the identifier "hiding" in a lambda occurred with an outer variable in a different translation unit.
knewbury01
requested changes
Dec 9, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lcartey just one comment for now, still reviewing
Scope no longer provides a suitable predicate for determining variables in nested scopes. Instead, first determine the set of conflicting names, then identify a set of variables which are conflicting, and are hidden within a nested scope.
fjatWbyT
reviewed
Dec 10, 2024
fjatWbyT
reviewed
Dec 10, 2024
knewbury01
approved these changes
Dec 10, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm! thanks for all this work
Co-authored-by: Fernando Jose <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Background
IdentifierHiding.qll
has been frequently reported as one of the slowest performing queries in the Coding Standards suite. In principle, the query is trying to achieve something quite simple, which is the following:In words: find a pair of variables, where one is declared in a nested scope of the other, and where they have the same name. However, performance of this simple approach suffers on large codebases with a lot of name duplication, and a lot of variables declared in the global scope, because there is no ideal join order: joining by name first is expensive, as is joining by variables in a nested scope (which is the current approach, implemented in
getOuterScopesOfVariable_candidate
).Updated algorithm
This PR therefore rewrites the variable hiding algorithm from scratch to avoid these performance concerns, by implementing the hiding detection using a phased approach:
Phase 1
: compute for each scope the set of stringnames
declared in this scope or a nested scope (Scope::isNameDeclaredInThisOrNestedScope(string name)
,Scope::isNameDeclaredInNestedScope(string name)
), and use these to determine, for each scope, a set of variables that are potentially hidden in a nested scope (UserVariable Scope::getAPotentiallyHiddenVariable(string name)
).Phase 2
: compute for each scope the set ofUserVariables
that are potentially hidden by a variable declared in this or a nested scope (UserVariable Scope::getAVariableHiddenByThisOrNestedScope(string name)
).Phase 3
: compute for each scope a candidate set of hidden/hiding variables, where the hidden variable is declared in an outer scope, and the hiding variable is declared in this or a nested scope (Scope::hidesCandidate(UserVariable hiddenVariable, UserVariable hidingVariable, string name)
)Each phase in this approach remains tractable, and avoids the large joins on either variables by name or by scope.
In making this change, I have incorporated the detection of identifier hiding directly into the scope hiding calculation. This reduces duplication and removes further opportunities for poor join ordering performance.
In the process of this performance change, I've fixed two FP/FN issues:
getParentScope
calculation.Review
I would recommend a commit-wise review:
hides
tohidesStrict
(no semantic difference for this query).hides
, as it's no longer used.getParentScope
to address consistency issues:getParentScope
predicate.excludedViaNestedNamespace
by inlining it late.hideCandidate
. This enables as-we-go filtering, which is required to apply the lambda hiding rules.LambdaScope
class to implement the lambda hiding rules inhidesCandidate
. This is not used yet in this commit, because lambda expressions are not tied into thegetParentScope
hierarchy.getParentScope
hierarchy.IdentifierHiding.qll
.Performance testing
I used https://github.com/grpc/grpc for initial testing and development as it contains both a large number of variables with the same name and a large number of variables declared at the global scope. On my machine, before this PR, it took around 750 seconds to evaluate
IdentifierHiding.ql
from a cold cache, with the following slow predicates:After this PR, it takes around 45 seconds from a cold cache, with the following most expensive predicates:
I've also run against the Top 10 C/C++ codebases on MRVA, and seen a similar performance improvement.
Change request type
.ql
,.qll
,.qls
or unit tests)Rules with added or modified queries
A2-10-1
RULE-5-3
DCL53-CPP
Release change checklist
A change note (development_handbook.md#change-notes) is required for any pull request which modifies:
If you are only adding new rule queries, a change note is not required.
Author: Is a change note required?
🚨🚨🚨
Reviewer: Confirm that format of shared queries (not the .qll file, the
.ql file that imports it) is valid by running them within VS Code.
Reviewer: Confirm that either a change note is not required or the change note is required and has been added.
Query development review checklist
For PRs that add new queries or modify existing queries, the following checklist should be completed by both the author and reviewer:
Author
As a rule of thumb, predicates specific to the query should take no more than 1 minute, and for simple queries be under 10 seconds. If this is not the case, this should be highlighted and agreed in the code review process.
Reviewer
As a rule of thumb, predicates specific to the query should take no more than 1 minute, and for simple queries be under 10 seconds. If this is not the case, this should be highlighted and agreed in the code review process.