-
Notifications
You must be signed in to change notification settings - Fork 569
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
binary ninja: optimize the order of computing LLIL to partially address #2402 #2509
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add bug fixes, new features, breaking changes and anything else you think is worthwhile mentioning to the master (unreleased)
section of CHANGELOG.md. If no CHANGELOG update is needed add the following to the PR description: [x] No CHANGELOG update needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, interested to see @xusheng6 thoughts and feedback
|
||
results: list[tuple[Any[Number, OperandNumber], Address]] = [] | ||
|
||
# TODO: try to move this out of line |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these function objects get created for each instruction in the program, which may be millions of times, so this may be expensive. we should profile and see if it makes any difference to pull the inner routine into a function thats defined a single time.
|
||
f: Function | ||
for f in self.bv.functions: | ||
for caller in f.callers: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
function.callers
doesn't filter the references by calls, just any code reference to this function. maybe this is ok. we should investigate if this loss in precision is meaningful or not.
Thx for the PR! I will have a look at it ASAP |
@williballenthin Below are some of my recent thoughts, basically I checked the places where I need to access the IL of the functions:
This should really be just implemented in binja's core analysis. In other words, binja should detect this is a stub function and treat it as such, so that the feature extractor does not need to bother with that. (Update: I just found we already have Vector35/binaryninja-api#111) Also, from your perspective, how often do you see a debug build like this when you analyze malware? If this is often enough, maybe we should actually prioritize this internally. Even before we get to a proper fix, we could first do a pass on all of the functions and calculate and cache the result of
This is actually a known issue in Vector35/binaryninja-api#3722. However, adding the extra check in the implementation of the property does not really help much, since it just moved the calculation from in capa to binja's API. A possible workaround is to relax the restriction here, and just allow things like non-calls to be included in it. The reason is, even if another funciton does not call this function explicitly, having a xref to the start of the function still means they have very close relationship, and even potentially using it as a callback or sth. If we want this, we will need to change the unit test for binja a bit to make it pass again
I am curious if we can change that a bit to be more flexible. I know other backends probably cannot tell this at the function level, but for binja, it is very straightforward |
It's not uncommon that we see this for samples we analyze manually - I'd guess 10-20%. However, looking at larger scale analysis (e.g. from last week) it's more around 1% (using the capa |
Thx for the valuable info! Also, are there any cases where the stub/thunk function take another form besides the normal one where it is just a Also, I just realized that I can actually directly get the LLIL of a particular instruction at a given address without first retrieving the IL function it belongs to. This means I no longer need to retrieve the IL of any other functions during the analysis of one function. This could bring drastic performance! Also, the stack string detection does not have to be done using MLIL. Previously, I already had an implementation of it without using the IL (just like what other backend does). This could even make the binja extractor work without accessing the IL of any function, which could make things even faster. For this, I think we can gate it behind a setting. |
did the other changes supersede this PR? |
i will rebase and see if there's any reason to continue these optimizations. without data, i'm not yet sure. |
I do not think #2511 alone can do it, but the combination of these three can probably do:
Even with all these done, it is still meaningful to see if this PR can push the performance further. However, I recommend @williballenthin to only start working on it after I have finished all changes from my side |
Hi @williballenthin , finally got some to look at your changes! It seems to me that binja: provide llil to instruction handlers via ctx implements the idea described in #2520 and it should also hopefully fix #2517. Could you please rebase that commit on top of the latest master? You can keep 999f91b and a909d02 as well if you would like to since I feel like they will be quite useful. Besides, you may wish to create a new branch because despite I think the other 3 commits are no longer useful on top of the latest master, we would better still keep it in case we want it later |
ref #2402
This PR does two things for the Binary Ninja backend:
With these two changes, against
321338196a46b600ea330fc5d98d0699.exe_
capa analysis time drops from 232s to 191s, a savings of around 18%.@xusheng6 please review.
Checklist