Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

C++: Clean up cpp/non-constant-format #15875

Merged

Conversation

MathiasVP
Copy link
Contributor

@MathiasVP MathiasVP commented Mar 11, 2024

This PR does a bunch of cleanups to the cpp/non-constant-format as a follow-up to the changes in #15516. In particular, two things is done:

  • It removes the pointer/pointee conflation that the query made use of by sprinkling [node.asExpr(), node.asIndirectExpr] everywhere.
  • It reintroduces the cannotContainString barrier that was (accidentially) removed
  • It excludes definition-less sources outside the source prefix root from the set of sources. This should hopefully mean that a bunch of unimportant functions are no longer sources, but things like parameters of DllMain continue to be sources.

With these two three changes, we're once again able to run the query on ImageMagick/ImageMagick even with the changes in #15599


I've looked at the DCA results, and here are my findings:

  • Most of the lost results (All of the Wireshark ones) are from no longer considering external functions outside the source root as sources.
  • Some of the lost results aren't actually lost: they're deduplications caused by us no longer doing pointer/pointee conflation in the query 🎉
  • The SAMATE lost results are because we no longer do pointer/pointee conflation, and thus we're no longer getting flow through std::vector.

The last one is a bit of a shame, but it should be properly fixed once we have MaD and can model precise collection flow (I'll make sure to backlink to this PR from the internal issue). And as the MaD work is coming closer to a mergeable state every day, I think we can live with this as this PR unblocks #15599 (which gives us additional new good flow)

@github-actions github-actions bot added the C++ label Mar 11, 2024
@MathiasVP MathiasVP added the no-change-note-required This PR does not need a change note label Mar 11, 2024
@MathiasVP MathiasVP force-pushed the bring-back-type-barriers-in-non-constant-format branch from 17d5db2 to ab6e2f9 Compare March 12, 2024 15:04
@MathiasVP MathiasVP marked this pull request as ready for review March 13, 2024 11:44
@MathiasVP MathiasVP requested a review from a team as a code owner March 13, 2024 11:44
Copy link
Contributor

@geoffw0 geoffw0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of points to discuss, otherwise changes LGTM. I haven't yet looked at results or performance. I assume performance is expected to be better on certain projects (like ImageMagick/ImageMagick), but unchanged in most other cases?

cpp/ql/src/Likely Bugs/Format/NonConstantFormat.ql Outdated Show resolved Hide resolved
cpp/ql/src/Likely Bugs/Format/NonConstantFormat.ql Outdated Show resolved Hide resolved
@MathiasVP
Copy link
Contributor Author

Couple of points to discuss, otherwise changes LGTM. I haven't yet looked at results or performance. I assume performance is expected to be better on certain projects (like ImageMagick/ImageMagick), but unchanged in most other cases?

Correct. ImageMagick/ImageMagick can't be run on #15599 without this change.

1. It fixes a logic error in the cannotContainString predicate.
2. It reverts the changes to the `isSource` predicate that required the external
function to be within the source root.

The change to `isSource` was meant to fix the a performance problem that occurred
because of the logic error in the cannotContainString predicate. However, now that
the logic error is fixed this is no longer necessary 🎉
@MathiasVP
Copy link
Contributor Author

@geoffw0 sorry about all the back-and-forth on this PR. 61597f5 fixes the logic error you spotted, and makes all the controversial source root changes unnecessary 🎉.

I've verified that this PR still fixes the performance problems on ImageMagick even when running on #15599.

I'm running a final DCA run now to make sure I didn't introduce any problems. I expect that the DCA run will still report some lost results because of the two latter notes I posted in the PR description. I still think that these are changes we're happy with, though.

Copy link
Contributor

@geoffw0 geoffw0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes and DCA LGTM. I'm fairly sure we could iterate to get some of those lost results back, but there actually aren't very many lost results at the moment as a proportion of the total number of results found by the query (and as you said, some of them are duplicates anyway).

not unspecified instanceof Char32Type and
// C often defines `wchar_t` as `unsigned short`
not unspecified instanceof UnsignedShort
|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In nelson, wchar_t appears to be a typedef for int (not even unsigned int). I'm not sure if we want to try and expand this to all integral types though, or cut our losses here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ouch that's strange. Thanks for spotting this. Let's just get this merged now and we can revisit this at some point later if necessary. I'll create an issue for it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really convinced it's worth fixing, unless it turns out to be common.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree

@MathiasVP
Copy link
Contributor Author

MathiasVP commented Mar 14, 2024

Changes and DCA LGTM. I'm fairly sure we could iterate to get some of those lost results back, but there actually aren't very many lost results at the moment as a proportion of the total number of results found by the query (and as you said, some of them are duplicates anyway).

Indeed, this gets rid of tons of duplication: All the SAMATE, wireshark, ImageMagick ones are result deduplication 🎉

I totally agree that we could iterate more on this, but I also don't think it's worth it.

FWIW, the lost results on vim are all because we no longer carry taint through a single char (which I think is actually a good thing): https://github.com/vim/vim/blob/a5a1ec1826c0e43d0282ba4d35c155a97bab3e27/src/message.c#L4440

So I'd say that these are FPs that are now fixed.

@MathiasVP MathiasVP merged commit 9aefdca into github:main Mar 14, 2024
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C++ no-change-note-required This PR does not need a change note
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants