-
Notifications
You must be signed in to change notification settings - Fork 202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discussion: Qualitative properties and extended task definitions #994
Comments
Thank you for contacting us! I am glad to hear that you like BenchExec, and we are always happy when competitions adopt BenchExec and try to help.
Yes, from BenchExec's point of view these are mostly independent, and actually much is already possible. So I will respond below and try to clarify the overview and show what is already possible, such that you can then see where further discussion is necessary. Then separate issues for the side topics would be a good plan. What you interpret as a dependency on SV-COMP should be only a visibility issue. BenchExec definitively shouldn't have a dependency on SV-COMP, as we want it to be a general benchmarking utility, and I believe this is already the case. It is possible to run BenchExec without any connection to SV-COMP or a similar style of tasks (for example, property files, scoring and result classification inside BenchExec are all optional). It is just that features like score computations happen to be implemented only for the use case of SV-COMP so far. But we have for example tool-info modules for SMT solvers even here in the repository. You can in principle even let BenchExec do the benchmarking and handle scoring etc. outside of it (Test-Comp does this to a large degree, for example), but it would be a welcome addition to have these features inside BenchExec also for other contexts like QComp, and I would be glad to see this happen. Another general comment: You are suggesting the possibility to add more user-defined handlers in several places. I am a little bit hesitant about going that far, because defining a stable and useful API for such things is difficult. So my generally preferred strategy would be to instead add the requested support for e.g. QComp directly to BenchExec, like we have it for SV-COMP, and not as a replaceable module. (If in the future more and more of these use cases appear and flexibility becomes more important, we can still reconsider this and then define an API based on the more extensive experience and knowledge we will have at that point.) For distinguishing between usage in the different contexts, I suggest we continue to use the property files, then users don't need to manually specify this. This is already how BenchExec treats SV-COMP: If you provide an SV-COMP property file, then BenchExec will apply scoring as in SV-COMP, otherwise it doesn't compute scores. As you seem to also have property files in QComp, this seems like a good fit for QComp as well.
Yes, result classification and scoring for qualitative results are currently not implemented. But I don't see any blockers for adding what you describe here. At least as long as the result classification still results in something that can be mapped to correct/wrong/unknown/error plus an optional numeric score, changes should be relatively few and local. But we also already have a precedent for an extension of this result classification in SV-COMP. The expected result could be represented together with the precision in the task-definiton file, either as something like
I believe everything that you would like to do here is already possible. Additional columns, e.g., with statistics, are also possible, as long as they can be extracted from the tool's output. Their handling it is just not mixed with the result handling in
Yes, BenchExec already treats all numeric columns in the same way, no matter whether they contain data created by BenchExec or not, so you immediately get statistics and plots.
Arbitrary non-numeric columns are possible, BenchExec will simply treat any column does not only contain numbers as textual and skip it in plots and statistics. An example for all this is this table from CPAchecker's regression tests. It has arbitrary error results and further textual and non-textual columns, the latter even with and without units.
These seems to match relatively well with what the current format supports:
PS: If you happen to visit ETAPS this year, we should definitively meet there. |
I was also thinking this, but I also think right now a few things require expert knowledge / modifying the sources, which I would want to avoid. (See the point on handlers.)
Indeed! But having the benchmarking tool also directly produce plots (without needing to implement parsing on one's own) is suuuuper nice for non-expert users. So by having the possibility to easily specify scoring etc in custom code would significantly increase user-friendliness (I think)
I agree with the sentiment; I wouldn't want to go so far in one step! What I wanted to propose is that one can add several handlers only internally (i.e. with a static mapping etc.). So, if someone wants to add their own, they would modify the source and not be guaranteed to have forward compatibility. But there would be a concrete region to point them to (as in "if you want to have your own, provide a subclass of this class and put it into this lookup dictionary"). By internally disentangling things, one would get already a better view of what is required. On a related note: I definitely agree with the visibility issue, when I wanted to use it I was a bit deterred by this apparent alignment with SV-COMP and I felt like I have to hack / mis-use things for my application. By shifting perspectives and, instead of "... and this is used for SVCOMP" to "... this is modularized and we provide the module used for SV-COMP by default" much more suggests "I can use this for anything" to a standard user. Especially if the names are generic; in the sense that there wouldn't be an "svcomp" and "qcomp" handler, but a "equality" and "approximate" handler or something like that (which could then still be subclassed to meet specific demands). Ideally, it should eventually be possible for users to say "install standard benchexec and run it with
But they can have different formats (there is no fixed file-ending) etc., so it is difficult to auto-detect them. Also, some things are specified not through properties files, but directly via command line, and how to do that differs from tool to tool. Its very messy and non-standardized :-) (which we try to tackle on a Dagstuhl meeting soon, but the mess will remain for quite some time.) I would be in favor of providing a way to clearly specify it. There can be a best-effort guessing mechanism in case users do not specify it, but if I want to avoid "magic" I would like to have the option to specify what happens.
Definitely in favor of additional keys (in general, I am against magic inside strings :) ). For example, we also would want to specify relative or absolute error, or might consider statistical methods (which take two parameters). Just seems more extensible this way :)
I've seen this, but it seems a bit "magic". Migrating this to objects would feel cleaner and easier to manage/maintain I think (this is one of the cases where I would be happy to provide a draft to discuss)
Ah! I see. But this would mean parsing the tool's output n times, right? (Not that it usually matters)
Aha, the Column class had some number of digits field which led me to believe it always expects a number.
It is do-able, but feels "hackish". Again, my proposal would be: Modularize the "file-to-run"-parsing for now (so we can easily add new specifications internally) without presenting this as stable API. Then, it already is much easier for power-users to modify the sources and we can gather some experience on this. (On a related note: I dislike XML and prefer YML / JSON for such specifications so I'd also be willing (low priority) to write a separate file->tools+runs thing, but that could happen outside of the repo)
Unlikely, unfortunately (I didn't manage for TACAS in time), but I'm in Munich quite often :) So maybe sometime during my lecture break? (basically last week of march + first three weeks of april) |
Sure! That's why we have it for SV-COMP.
Yes, this would come more or less automatically once we have two different result handlers. Likely we just create subclasses of
If you have any concrete suggestions in this regard, I would like to tune the documentation and description of BenchExec. But currently SV-COMP is hardly mentioned at all in the BenchExec docs, so I don't know where I could improve that.
This should be no problem. You need a separate benchmark definition for each tool anyway, and you can specify arbitrary command-line options in there for the tool.
If there is no way to auto-detect it, the property kind (or however we name it) can of course be added to the task-definition format.
I am not sure what we would gain here, but feel free to make a concrete suggestion.
Yes. In many cases, the tool outputs many values and only a few of them are to be put into the table, so I guess searching a few times for those few values is even faster than parsing everything properly and then selecting a few. But if this turns out to be a bottleneck, it would be easy to add a method that can return all values at once.
We want to keep the existing task-definition format, just potentially enhance it to support more use cases. So there is just one format to parse, and a internal interface for which we have only a single implementation seems to just overcomplicate things.
I guess you mean generating the benchmark definition from a more high-level description? For something like this, please have a look at https://gitlab.com/sosy-lab/benchmarking/fm-tools, which aims at collecting information about tools and competition participation. SV-COMP's benchmark definitions will at one point also be generated from the information in this repo, but the project is not SV-COMP specific.
Yes, April outside of ETAPS time would be fine and it would be great if we can make that happen. |
I just want to support Tobi's core request, i.e. make benchexec more easily usable in the probabilistic verification context. |
Sure, this is great! Thanks for advertising BenchExec! I am eager to hear what is suggested there. If any of the people there have more questions and would like to meet me in person, you can offer this for ETAPS. I will be there for the whole time, including the RRRR workshop. |
Arnd Hartmanns (Main Organizer of QComp) is co-organizing RRRR and he will be at our Dagstuhl meeting. |
Ok, let me try to summarize what I think is the situation. Then we can derive actionable items. Point 1: Scoring / Point 2: Documentation. I didn't mean that SV-COMP is explicitly mentioned, but the workflow and design of specification files etc. has a certain "bias". This is something that will arise naturally. If I use benchexec and know that there is interest for it, I can organize my findings and eventually propose changes / extensions to the doc :) Point 3: Generalizing verdicts from string to objects. I just think this is flexibility for free. Right now, Point 4: Task-definition Format. This is, for example, one of the things that from the outside really feels very tailored towards SV-COMP / a particular style of parametrizing executions; especially its description, it talks about system and specifications, expected verdict is boolean, subproperty is something with conjunction of properties, ..., most of this doesn't make sense for a lot of tools that just want to execute something. And yes, you can press other applications into this format, but as an outsider it doesn't feel nice. A concrete example: In QComp, having multiple - model:
class: prism
const:
MAX_COUNT: 2
T: 12
path: models/qvbs/ctmc/embedded/embedded.prism
properties: models/qvbs/ctmc/embedded/embedded.props
states: 3478
type: ctmc
queries:
- spec:
expected: 0.08767819037331588
property: actuators
type: reach (This of course is also very tailored towards my case, and it would be cleaner to even have one property key only, i.e. flatten queries) Some changes I would find useful (I am not saying that these should be done, this is just my personal preference for my personal use case):
I think what should be important to discuss here: Are this task-definitions suppose to be written by hand or generated by tools? Inspired by this, one suggestion: Would it be possible to add one more task definition format, which is very "dumb": It effectively is a JSON that simply contains a list of all invocations that should be run, together with a) the tool's name, b) the invocations key and c) all relevant files that are needed inside the container (alternatively, if paths need to re-mapped, it could contain a "templated" invocation where remapped paths are then inserted, I can elaborate on this if needed). Then, if whatever format is present is not suitable, one can simply generate such a file and use this as a base, bypassing most of the execution generation. (After all this is the core task of benchexec; take a set of executions and run them, and this would be, I think, super nice. But I dare say that I am speaking rather as a power user here.) I do realize that this might make things tricky, because for extracting results from output a tool might need a certain amount of context, which I cannot judge right now. But I think that if that JSON file simply contains whatever the Thinking about this, I guess I would like this a lot :) Did I forget anything? |
Ah, you want this component to be able to add other columns to the table. That can be useful, of course. And it should be easy to add. We should just do this separately and step wise: First make the result classification depend on the
The specification (property) files are defined by SV-COMP, not BenchExec, and thus they are SV-COMP specific. But BenchExec only looks at them to decide whether it should compute SV-COMP scores, you can use property files with arbitrary content with BenchExec.
Sure, this is always welcome.
I do not yet see a convincing benefit that would result from this change. In any case, it seems pretty orthogonal to the goal of supporting the QComp use case for BenchExec, so let's please have this discussion separately. If it turns out that actually the string representation is problematic during implementation, then we can easily think about how to solve it. But right now this is hypothetical.
Absolutely, please do this. There is no need to change anything in BenchExec for this, because this is exactly what we already use for existing tools. Just have a look at this example.
If you "just want to execute something", don't use task-definition files at all, they don't provide value for that, and BenchExec doesn't require them for that. You need task-definition files if you want to tell BenchExec "here is a task that consists of some information that the tool should get as input and some information on how to interpret whether the result is correct". And of course expected verdict is currently boolean, but I have suggested already to just extend this.
Just always use one input file and property. The format doesn't force you to have more than one. Most of SV-COMP also uses only one input file, for example.
I don't understand everything in this example, but to me this looks like it would be well covered by the existing format plus and extension for quantitative expected results?
Not having this is actually on purpose, because it significantly simplifies handling of the tasks. We discussed this several years ago and had a proposal for how it could look like, but the current structure was preferred in the end. For example, in the result tables you need a clear identifier for every executed run. You also would need a way to tell BenchExec which of those multiple definitions in each file it should execute, because you don't want to always execute all of them. By having only a single definition in each file, all these kinds of problems and questions are avoided.
Isn't this just the same as the existing
The property file is something that BenchExec needs to know about, so it would be treated specially anyway. I can see that what you suggest would also be possible, but currently we have the existing format and so far I haven't seen anything that wouldn't fit into the current format plus the discussed extension, so why completely redesign the format? Honestly, this part of the discussion doesn't feel 100% productive right now. I would prefer to talk about format changes with concrete use cases and examples that do not fit into the existing format.
We don't suppose either. You can write them by hand or generate them, as you prefer.
BenchExec always needs tool-specific knowledge, e.g., for constructing the command line, and we cover this with the tool-info module. If you are suggesting something that avoids tool-info modules (I am not sure if I interpret this correctly), I don't see how this would make sense in practice. (And it would stray pretty far from the use case of BenchExec in QComp, because there you obviously need parsing of the tool output, so more tool-specific code.) If you are not suggesting to avoid tool-info modules, then we have already what you seem to want. You only need to write a benchmark definition and define the runs (input files + command-line arguments) there, no need at all for property files and task-definition files. Of course BenchExec then also cannot classify the result into correct/wrong, but everything else works. So there is no need for another format to cover the same use case. |
That's not what I meant. This was again hinting at the visibility issue. I know I can use them for something else. But at first glance this is not clear / was not clear to me. But lets skip this, I think this would be resolved by the other points.
Agree with the latter :)
It is coverable, but quite cumbersome / it does not feel correct to do it (again, the visibility point). (More below)
Isn't this a severe constraint? Why is a property file relevant? What makes it different from the system file? E.g. QComp will likely migrate to JANI, which has system and property description in one file.
Sorry if this so undirected! I think this partly is a rubber-ducking exercise, I am only now beginning to exactly understand my issue. Also: These suggestions are motivated by my personal experience but I am trying to anticipate what requirements others might have.
I want to be able to tell benchexec "I want to run tool name x, I require these files to be present, please pass the following information to the tool-info module to build the invocation, I expect the following result". No further information should be required. The overall point I want to make: I do not "understand" the point of, for example, categorically differentiating between property files and input files. This seems specific to a particular use case. I get where this comes from, but it seems unnecessarily complex for the main purpose (run stuff in a controlled environment) and not canonical for most applications where I would use it. (Quite often, tools might not even speak about properties at all but still produce a result.) Yes, probably all use cases can somehow be squeezed into the current format, but it doesn't feel right to me. Maybe that's just me. But, to me, that is the reason why I got the feeling that benchexec isn't meant to be used this way. I would expect that a standard task definition contains (i) a tool name (ii) a set of (named) files (iii) an optional reference result (+ how to handle it) (ideally this could be anything, i.e. also the path to a file, or a complex results like a witness strategy / certificate -- its the job of the tool / result handler to interpret the result) (iv) arbitrary further keys that are passed to the tool info, which is then responsible to assemble the command line (for what its worth, point (ii) could actually be dropped and the tool info can be responsible for exacting the set of paths). (The tool info could also be required to define a unique key) For example: tool: mychecker
limits:
cpu: 1
memory:
amount: 1
unit: GiB
# etc., these are fixed
instance: # all keys below are "free form" and passed to the tool-info as is. instance is not a good name here
system: foo.file
properties: prop.file
sub_property: unreach-call
expected: false # optional, any format, to be understood by tool info tool info would then report "these two files are needed", benchexec gives the absolute paths where these files will be put, tool info assembles the execution, e.g. Thinking about it, I guess all my points boil down to this. Again, I do see that this functionality is already possible with the current input format once you know how to do it. I am, however, suggesting that such a format makes benchexec easier to use. Maybe I am also just not understanding the generic |
Note that BenchExec only needs to know that the property file is the property file. It does not need to know or interpret its content. The property file can even be empty. And if the benchmarked tool doesn't accept a property file as input, BenchExec can use the property file without passing it on to the tool. So everything is possible, it is not a severe constraint. The reason for BenchExec knowing about the existence of property files is the following:
So in order to solve these requirements, we decided to have task definition files should support definitions of (model, {property -> expected verdict}). That creates the need to tell BenchExec "use this task definition and pick property X from it", so BenchExec needs a way to understand what property X is. But of course we do not want to restrict BenchExec to only certain hard-coded and known properties. This is solved by the current system: The property is defined in a file, the task definition refers to that file, and the benchmark definition (which defines which tasks are to be benchmarked) as well. BenchExec then just needs to match the references to the property file in benchmark definition and task definition in order to select tasks. For that it needs to know what the property file is, but not more. You can think of BenchExec using the property file only as an opaque key. Furthermore, not all tools actually support these property files, even for cases in SV-COMP where they exist and are defined by the rules. Keeping the property file separately from the rest of the input allows the tool to be benchmarked nevertheless. This is also important for example for taking the benchmark set from a competition with a property file and running a (potentially old) tool on it that has never heard of this property-file format. So it provides more flexibility and makes more use cases possible that the property file is separate from the model file! Of course this use case was motivated by SV-COMP and Test-Comp with their respective sets of properties, but that doesn't mean that it is specific to it. I know some communities do not have different properties, they might use only one property all the time. But these are not excluded, worst case is that they need to have a dummy property (a single file, can be empty) that is their "definition" of a property for the sake of BenchExec, and no other tool needs to know about it.
First, So the main use case of Second, as explained already, if you exclude the "I expect the following result" part of that requirement, But how do you define the expected result if you have thousands of tasks? We could add a way to define the expected result in the benchmark definition, but that would be pretty bad. You could not use wild cards anymore, instead you would have a huge list of individual items, each with input file and expected result. But the benchmark definition is tool-specific, so it is not reusable. As soon as you would want to benchmark two tools (and this is of course extremely common) you would have to duplicate these thousands of (input file, expected result) tuples for every single tool. We do not consider this acceptable. Furthermore, public availability of good and easy to use benchmark sets is good for research. And the mapping "input file -> expected result" is really part of the definition of the benchmark set, and it should be distributed together as part of the benchmark set in a way that is independent of any concrete benchmarking or benchmarked tool. So again, this mapping needs to be outside of the benchmark definition. There is an important conceptual difference between "benchmark set" and "definitions of one concrete benchmarking experiment", and expected results are part of the former. (Btw. you would end up with exactly the same argument if you start thinking about how to write down the definitions of many tasks, each of which consists of several input, i.e., sets of sets of input files.) So in the end, we absolutely need a way to write down many (input files, expected result) tuples somehow. This resulted in the creation of the existing task-definition format, because we are not aware of another existing format that would cover this. Now you might say that for a use case of "(I want to run tool name x, I require these files to be present, please pass the following information to the tool-info module to build the invocation, I expect the following result) times 1" this creates unnecessary overhead. Yes, in principle, but compared to "(I want to run this command line, I require these files to be present) times 1" and "(I want to run tool name x, I require these files to be present, please pass the following information to the tool-info module to build the invocation, I expect the following result) times 1000" we expect this particular case to be relatively rare. And after all it is completely doable with BenchExec, just with a little bit more of input needed.
Explained above.
Again: If you only want to "run stuff in a controlled environment", you don't even need it. You only need it you want more than that.
No! This would lead to a huge amount of duplication of data, prevent easy tool-independent definition of benchmark sets, etc. The benchmarked tool needs to be separate from the task definition. And if you want BenchExec to check expected results, then it also needs to know about stuff like And once you implement these two changes, you end up almost exactly what we have with the existing format.
No! The logic for classifying results and computing scores can be quite complex, it must not be duplicated for every single tool, but implemented only once. Furthermore, many tools should be benchmarkable in more than use case, e.g., for participation in two different competitions. Supporting all this in every tool-info module would create a quadratic number of cases. And it is really important that tool-info modules remain as small and easy to implement as possible, because even so developers sometimes struggle with writing them (plenty of PRs in this repo are the proof). From the point of the competition organizer, it is also important that they do not need to trust the implementers of the tool-info modules, otherwise the organizers would need to review (and re-review on every change) all the involved tool-info modules. Right now, this is not necessary, the tool-info module cannot do anything to cheat (all the code inside it is even executed in a container). So we are always glad to receive feedback and are open for suggestions, and we certainly want to extend BenchExec to cover new use cases and are also open for format changes. |
Your edits overlapped with my writing of the response, so my response is based on an older version of your post. I am not sure what the differences are, but I can additionally respond to the following paragraph:
Yes. Because this is really the most common case, for example in competitions or when running benchmarks with huge benchmark sets and many tools for a paper.
We expect this use case to be more rare than the above, and in particular exist only with much smaller numbers of both tasks and tools. Because if you have > 1000 tasks and > 10 tools, you would of course not think about manually defining a unique set of command-line options for every instance, right? But nevertheless, there are ways to do it with current BenchExec (for n tasks and m tools):
Of course I would recommend 2. (and it is the only practical way for a competition), but 1. and 3. are already possible. And I think that 1. and 3. are not significantly worse than the format ideas you have presented, or do you? |
Thank you for taking your time to answer! I hope my "resistance" doesn't come across in a negative way, I really appreciate the interaction, and just want to explain how I think about this / perceive it. I can follow the reasoning, but disagree with some of the points. But I have now also understood several things that weren't clear to me, which leads me to several concrete (and not so drastic) suggestions at the bottom. So bear with me :) I try to silently skip over most points that are now resolved for me. Let me start by saying that I think that quite a bit of the confusion comes from the fact that how you and I interpret the meaning of a property file is subtly different (I comment on this below)
Being devils advocate: Well that then really sounds like it is completely superfluous, doesn't it? "We require it, but you can always leave it empty." (I comment on that below)
Yes, but very often not in the form of a file (or a separate file). It seems weird to require that this is a file. Concrete example, it is likely that QComp will use the JANI format for inputs, then no tool will have a separate property file. All properties are written in the "model description" file. Yes, we can give every single definition the same empty dummy file, but it just seems weird then.
Yes, my use case would be to run a few thousand invocations several times each (maybe not x1000 but x5 or x10, and >20k executions in total). QComp isn't large by any means, but there are a few hundred model x property combinations and 10ish tools. We also are thinking about running the evaluation in a continuous fashion: Authors can upload a new version any time and then their new version is re-evaluated, so we also would get many repeats of that.
Yes, absolutely agree!
This is a separate point I wanted to make, sorry if this was not clear. EDIT: Ignore that, this is better solved by runexec I now realized (also commenting on this below)
Subclassing / composition could take care of this. But its not my main point, I don't really care if the tool definition or a separate result handler does this, it was just a side remark. (sorry for the edits :D replying to the second point.)
I really do not think so. Command line options specific to model-property-tool-combinations often are part of the input in my experience (also see below).
Sure! I did exactly that by effectively implementing the things I discussed as above. I have a few thousand instance definitions of the form roughly EDIT: I realized this is sort of what the Python API for runexec would give me. Sorry. Feel free to ignore that part. A side comment here: There already are a lot of domain-specific benchmark set definitions. It is much easier to export to such a low-level "front end" than it is to re-write all the definitions (which also might be much more compact and flexible than taskdefinitions or the benchmark.xml files; e.g. I don't see a simple way how to filter out very specific execution tuples). So it might anyway be cool to have a standard format to say "please run these N executions" for runexec. So something like
Aha! I misunderstood I think we are converging on something. At least I begin to understand what is happening. I see what your point with the task definitions is. I definitely agree with the sentiment. Effectively, I have built the same thing in spirit (effectively doing Option 2). Some points I have on this:
Thinking about 2+3+4: Isn't all this Summarizing: I propose to be able to write task-definitions like the following + allow tools to indicate that they do not support a particular query (I know that this opens some questions for how to generate tables) format_version: '3.0'
input_files: 'model.prism'
properties:
- name: bounded_reach
file: properties.props
options:
constants:
T: 12 # This is a step-bound specific to this *query*, with T: 20 the result would be different
expected_result: 0.89
- name: mp_serve
property:
type: mean_payoff
reward_structure: serve
coalition: [ 1, 2 ]
optimization: maximize
expected_result: # A precise fraction
numerator: 13254
denominator: 12213344
# Alternative, more flexible approach
- name: xy # This now is the globally unique key (or unique within this file and treat (model key, property key) as globally unique
expected_result: 1
definition: # After this is free form with some standardized key names
file: properties.props
constants: { T: 12 }
key: property_xy # the "subproperty"
options:
constants:
N: 12
reset: true I think that the intention and meaning of the keys should be clear. This would still lead to some duplication (one file for each instantiation of the model), but this is unavoidable I think. There is a subtle question, in particular for qualitative queries: Is the precision requirement (e.g. exact or approximate) part of the property or the tool configuration? I think the latter is "easier" because then the tool definition can declare a matching result handler. |
A second comment, because I find it important: Thinking more on this, I really think that runexec should be much more prominent. I daresay it appeals to a much wider audience than benchexec. Runexec is much simpler to use and (sort of) a drop-in replacement for running with If I understand it correctly, runexec already brings most of the things I want, and benchexec is "just" some organizational stuff around it that many people probably just want to hack on their own anyway. But this really is not apparent (at least it wasn't to me). I was under the impression that in order to get the full "how to do a benchmark correctly" thing, you need benchexec. But, at least this is how I understand it, this is not at all true, and benchexec is "just" runexec + a way to concisely write runs + a way to merge results into a table. The whole time I have talked about trying to disentangle things basically was to carve out what I now think that runexec is already doing 🤦♂️ So, the above suggestions for the task format are still valid, but now have a completely different quality for me. I'm very sorry for my misunderstandings that resulted in so much confusion. |
But that is not the case. You can leave it empty only if your tool doesn't care about it or does not support it.
We know. This is why there is no requirement at all for the tool to be aware of the property file.
Consider the alternatives. BenchExec anyway needs to support the concept of property files, because there are use cases that require BenchExec to use them for picking out the expected verdict from the task-definition file and for determining which scoring schema to use. So the alternative would not be to replace property files with some "non-file property", but only to support both property files and something else. Given that not every
Even it if is weird, you certainly agree that it doesn't hurt or complicates stuff in any relevant manner, right?
No problem at all, and exactly the intended use case for
You are confirming exactly my point, even though you claim the opposite. You are not manually defining a unique set of command-line options for every single instance. You are generating the command-line options from a few specific facts for each instance. This is exactly what the tool-info module cculd do based on Or in many cases you can do that just with benchmark definitions. Typically you have some command-line options that you always want to pass, some command-line options that you want to pass for a specific (set of) tasks, and some command-line options that you want to vary between (e.g., tool configurations that you want to compare or so). You can specify these three kind of command-line options in the benchmark definition (
No, don't use If you already have benchmark definitions in your own format, then instead just generate a benchmark definition for
Of course! This is also the case in SV-COMP, for example. BenchExec never takes all properties from a task definition and runs the tool with each of them. It only takes the property that you have explicitly chosen in your benchmark definition. So if you want to benchmark a tool on tasks with two different properties, you need to specify both properties in the benchmark definition. The properties that you don't want to run the tool for should simply not appear in the benchmark definition.
There are two ways to do this already now: Or you simply put all these property options into a file, and then simply use that as your property file in BenchExec. You are free to choose how this file looks like. This is what I would recommend. So we can discuss further whether such property options should be added to BenchExec (again better in a separate issue for this exact question), but given the last point about keeping often-changing logic out of the tool-info module, I right now think the "property options in property file" solution is preferable.
This is possible in principle, but I am not convinced yet. Note the minimal benefit and the above discussion on "non-file properties". If you want to discuss further, please open a separate issue for this exact suggestion.
Supporting something without an explicitly mentioned property at all but with classification of results is a different question. If you want to discuss further, please open a separate issue for this exact question.
Suggestions provided above.
Given that BenchExec uses the property file as some kind of identifier, letting BenchExec know about multiple files wouldn't work well. But you don't need to. You can simply have a file with the tuple (x,y) as your property file for BenchExec.
No,
Given that the paragraph seems to be based on a misunderstanding of
Note the important distinction between those options that (a) define the query (and which should be passed to the tool) and those options that (b) define the expected verdict (and should not be passed to the tool). But yes, apart from this important distinction, there is no further relevant difference between different "property options" inside (a). This is exactly why BenchExec does not look further into the parts of (a) nor does it want to bother with parts of (a), but just lets you define all of (a) in an arbitrary file in a format you choose. By using a file here, BenchExec gives you full flexibility and freedom and is as independent as possible from the shape or contents of the property. This seems like something that you would like. :-) (b) is relevant for BenchExec's result classification, of course, and thus defined in explicitly supported ways in the task definition. As discussed above, we could consider allowing to specify parts of (a) differently than inside a file, but it seems to mostly complicate stuff.
My suggestion would be to move all parts of the property except the
Sorry, I don't understand the intention of this part at all. But my suggestion above would likely remain the same.
The question is whether the tool should get this information or not (cf. (a) and (b) above). This defines the answer. If it is (b), the tool-info module must also not get it.
Well, we hope that people use
If I understand you correctly, you really want
This is basically correct. |
Summary at the bottom
It reduces clarity. This insight that "this is required but you can just ignore it" is strange at first glance. But with the points below I am beginning to see the point. Again, it is mostly an insight thing, for me "property file" has a very specific meaning which is different from what this
Yes and no. My point about these low-level run specifications is exactly to not manually define them, but generate them from however I generate my set of runs. In several cases this is something that I fundamentally cannot (reasonably) express in (hand written) task-definitions. (e.g. I need to filter out specific runs)
That is just a few hundred lines of python code, or? Much less with matplotlib / seaborn. Especially if runexec outputs results in a simple format (e.g. JSON) / I use the API. I could just replace my "run this command in docker" code with "run this command with runexec" in my existing frameworks. I don't really need the table stuff of benchexec, its a nice bonus but I would need to do separate data analysis anyway (e.g. run this in (textual) CI or aggregate specific sets in ways that benchexec cannot possibly know about). (This is not a criticism or lack of features of benchexec, I just believe there is simply no "fits all" solution to data analysis.) Also, I can't see how I can nicely give an ad-hoc generated set of runs to benchexec. Concrete example, I generate a base-set of runs from my underlying definitions, then filter by tool support, then filter by very specific tool-model-property-constants combinations (e.g. which I know to time out and need to filter to make the evaluation feasible), then filter by certain meta-data I derive for every model-tool-property-tuple, etc. (e.g. I want to say "run all of type reachability with number of states at most 10000, use these five tools and only run the combinations that are supported by all tools") The only way I see to do that is to then dump all of these specific executions in thousands of ad-hoc generated task-definition files (I also need verification of the results!) or have these task-definitions ready and then generate the benchmark.xml. Indeed, this would be exactly what I need / want from the (yes this requirement is specific and I don't ask benchexec to support it. I just think that this is more common than you might think, i.e. that people want to have full control over the set of runs instead of defining them as a "product" of tool and property definitions. Which is where runexec is probably the right thing, at least as I see it. For example, I might be interested in the effect an optimization has on the runtime and quickly run the tool on the few dozen models that relate to the change; that is where powerful filtering is helpful.)
So I need to write into the benchmark definition the precise list of all properties I want to test every time? That defeats the conciseness, i.e. reasonably writing these definitions by hand. (Concretely, that would be several thousand entries per tool, each property has a more or less globally unique name.) (again, as soon as we are in the territory of "generate the set of benchmark definitions" I see no advantage over RunExec - I don't mean this as a bad thing, I am happy with using it. I want to say that there are applications where I think using runexec is justified and should be advertised as such.)
I see your points and I am happy with this. It will be weird to determine which type of property file I am reading. There are different formats. But something that is doable.
Wait I am confused. Doesn't benchexec need to know the paths of all files that are needed by the tool-property-combination? I can see generating (uniquely named) "meta-files" in the style you proposed and in there include all the paths of the actually needed files. Only question now is how I can get these into the isolated container. Can tool-info have / report additional files that need to be there? And / or could a tool-info specify files to write into the container? That would be great.
Well, in my case, whether I put the logic into tool_info.py or into a run_tool.py is basically the same code. I would prefer to keep it in tool_info simply because the tools are then as "pristine" as possible (for evaluation I write the tool_infos on my own anyway).
Ok that now really seems like a very (SV-COMP) specific thing :) Is that then sort of like a "certificate" / "witness" for a result?
So I would specify which "XY" I want from the property file by using options? But this would make properties_file a non-unique key (I have many XYs in one specification file.) I can see using meta-files for this, as mentioned above.
I see the point and agree. But I am now worried about the "unique key" thing. Should a tool info maybe be able to report a key (if needed?). (This can be solved by having a uniquely named meta-file, but again needs a way to specify additional files needed. It would be good if the tool info can report it.)
The tool must know the precision requirements. They are, in a sense, just Also, note that it is reasonable to want to run the same tool with different precision requirements on different models / properties. This would probably boil down to defining several run-definitions and then splitting up the benchmarks along that dimension, too. (This is what my CI is doing.)
... but much better than not using either because they feel it is too much work to adapt to the specific format :) I would see runexec as a way to get "hooked" because it has a (comparatively) very low entry barrier and would drastically improve the quality of results already. (Presentation is something else, but you very likely wouldn't copy-paste the html tables from benchexec into a paper anyway.) Summary: So I think we are aligning (/ you are convincing me / showing me how to do things) Points:
|
If you don't need hardware allocation, then yes.
Yes, if you want all this including classification of results, that would currently be required. But the task definitions would not need to be ad-hoc. You could for example generate all of them once and then only generate the benchmark definition that selects a subset of the task definitions. And depending on how you structure your task definitions (with systematic file names and directories) and your property files, a lot of the selection and filtering could actually be done by BenchExec without the generator needing to list all tasks individually in the benchmark definition. For example, if you represent some of your important criteria in the file name or directory structure, you can use path patterns with wildcards to include a specific subset of tasks. And if you group these patterns nicely inside That leaves the question of adding a way to define tasks with classification of results but without task-definition files. If it is still desired, e.g., because the set of task-definition files is too huge to generate, we can discuss it further. We actually had this in the past, but the expected verdict was encoded in the file name of the input file to the tool, and this got really ugly once we had more than one property and expected verdict per input file. This is the reason why we created task-definition files and we certainly don't want to go back there. But we might decide to have a way to set the expected verdict in the benchmark definition. It seems, however, that this specific question of task definitions with expected verdicts but without task-definition files is separate enough from the rest that a continuation of the discussion should be in a separate issue if desired.
If there is a 1:1 correspondence between task definition and property file in your case, you can solve this by encoding this 1:1 correspondence in the file names and the benchmark definition (think along the line of having If not, we could add a feature of allowing wildcards resolving to an arbitrary number of property files for a given task, such that you could say "take
First, everything from your host is visible in the container unless you explicitly hide it (with the exception of
As long as you do your own evaluation, there is little difference, yes. It only becomes important if you share tool-info modules. Thus it seems relevant for competitions.
Should be resolved by the explanation above.
No. It is an addition to the expected verdict that is needed to determine whether the tool output is correct or wrong, just like the allowed error interval in #996.
No, each XY is represented as one file.
The problem of
Again, this is the argument for keeping everything inside property files, such that their name is necessarily unique.
Then it seems like it is conceptually part of the property: "give me the value of X with precision Y".
The result verifier always needs to decide based on the property.
If you want to benchmark different tools, there needs to be one tool-independent way for writing down these inputs/options and a mapping layer that translate it into a tool-specific command-line argument (or whatever). BenchExec would use the tool-independent value and the mapping would either be in the tool-info module or a wrapper script (or you could in theory define the property options independently in the property file and in a tool-specific way in the benchmark definition, but I would recommend that only if you generate everything).
These precision requirements (if relevant for result classification) are simply part of the property from the point of view of BenchExec, together with that you would call the property. You don't even need several run definitions for that.
Sure.
No, please don't do that, and neither for the plots. But it is really useful during research to open the HTML table and interactively click around, filter, define plots, etc. And, because we of course all care about artifact evaluation, it is extremely cool if in your artifact readme you have a link to an HTML table that directly opens a specific view of a plot that is the same plot as in the paper, such that reviewers can see that the plot in the paper is re-creatable from the raw data, and that with the same link reviewers can re-create the same plot after re-running the experiment to see if their data matches. We have this for example in this artifact, it is the links to figures and tables in the TL;DR section of the readme. |
Sorry I didn't mean it like that! What I meant is writing just the result aggregation needed for one specific evaluation is not too much work I daresay.
Oh! Yes I definitely need that (and practically any evaluation should do this IMO). But the doc says there is I have the luxury of not needing parallel runs, since we have few enough so I don't need this level of management. I see that allocating resources for parallel executions is completely different and even more involved. I "just" want single-core + memory limits, clean isolation (e.g. "reserving" one physical core and pinning the process to one logical core), and resource usage like peak memory.
Absolutely, exactly this is the (sole) reason why I want to use BenchExec. Getting resource isolation right is super hard.
Being able to add a list of "tags" to the properties would be super cool and non-intrusive for this! This would not be visible by anything except the filtering mechanism. This would allow effectively arbitrary filtering on the concrete instances (by putting whatever property you are interested in as a tag) while being in a canonical place. (Proposal) Filtering can then be done either in tasks definition or additionally on the command line (with
I think it is fine, given the above. It also really makes sense to have expected verdicts in a tool-independent location. I see now the two different perspectives generic tasks and task-definition have (i.e. one is not meant as a replacement for the other, they serve different purposes).
Oh I was assuming it is opt-in, sorry! Also makes more sense, given that the "container" doesn't come with system libraries etc. Then forget these points. One neat point would still be to be able to write files specifically for the execution, but this then can be done by a wrapper script.
Aha, I understand. This is where a lot of confusion came from. For me
Wouldn't it be cleaner to then say It could even be expected_verdict:
value: False
sub_property: xy or something like that, if proper types are desired.
Agree, but feels mainly useful for the intersection of "users that are technically adept enough to set up benchexec" and "users that are not fluent in data analysis" (which a not-so-small set!). That's why I did not consider it important for me personally. (I personally usually follow the process of having a script that reads raw results and converts them to CSVs and then these CSVs are rendered to plots by TikZ / pgpflots. Then even the visuals match precisely with the paper.) I think it is good if I track what I currently feel is still open, as above:
|
Memory limit is possible, and restricting to a given set of cores as well (this is what
If you want a single core, you can pick one manually and pass it to Also note that neither
Would be possible. Still a lot remains unclear, e.g., how to attach labels to property files, or whether this should be a more generic task filter or restricted to properties, and this would also be a good case for a separate issue. Precedent is #529, which got resolved by a filter on the expected verdict.
Are you aware how funny this is? ;-) Yes,
Sure, for the plots in the paper we do the same (and everyone should, we even provide scripts and templates). But writing pgfplots code for every plot that you would want to get a quick look at even before you have started to think about which plots should be in the paper is something that I definitively don't want to do. We also use the interactive tables in meetings with coauthors and even in internal presentations of ongoing work, and the audience often asks "can you please show XY" during the talk or the questions phase because everyone recognizes BenchExec tables and knows what they can do with a few clicks. :-)
I don't know what you consider resolved. For me #996 and #997 are open. The current issue is too "full" already and can serve for discussion, not as an action item for implementation. Every concrete request needs to get its own issue.
Easy, but I am unsure whether it makes sense conceptually. The tool-info module is not intended to be configuration of the benchmarking experiment, its goal is to be a (reusable and sharable) description of how to run the tool.
Cf. above, doing this purely for style would just be hassle for users.
Cf. above. |
yes, I just wanted to ask whether it is a sensible thing to talk about. Details go into a separate issue.
yes and no, these points are subtly different :)
Agree, but since the format of task definitions has to be changed anyway if one wants to support approximate verdicts, I see no reason why this shouldn't be "cleaned up" / unified in that same step. I don't want to change this solely for this matter. I find it better, style wise, to have everything related to defining the expected verdict in one place :)
Users don't need to change anything, since But I don't have a too strong opinion here. I just think if this the format is touched anyway, why not make it nice.
I'm not arguing that the plots of benchexec aren't useful!
Things where I realize I was wrong or we agree that this is something concrete to be discussed.
And, I would argue, a description of the tools capabilities! Example: The models I work with can have many different classes (DTMC, CTMC, MDP, CTMDP, SMG, ...) and properties can be of even more types. Some tools only support properties which are written in a specific format (even if it is of the same type) or with a particular configuration. It is possible but annoying to encode this in the filename (since every demand of every tool has to be anticipated when naming the files). Just for an example, this is a literal excerpt of the filtering code I used: def is_supported(self, model: Model, query: Query) -> bool:
if self.engine == "expl": # Tool specific option
return isinstance(query, Reachability) and model.model_type in {
ModelType.MDP,
ModelType.DTMC,
}
if self.engine == "hybrid":
return isinstance(query, Reachability)
if isinstance(query, MeanPayoff):
# For this tool, reward reference must be a string
return isinstance(query.rewards, str) and model.model_type in {
ModelType.MDP,
ModelType.DTMC,
}
if isinstance(query, Reachability):
return model.model_type != ModelType.SMG
return False This filters out more than half of the executions I would run. Example: Some tools only are run on ~12 out of >100 instances. Basically what I want is equivalent to running the tool and the giving the result EDIT: Also, checking if the execution actually terminates or fails with an error is not really helpful. Some tools are extensions of another for new property / model types, so to evaluate that tool we only want to run it on the new stuff (even if it supports the old ones by delegating to the base tool's codebase) One could solve it by tagging (e.g. "toolx-supported" "tooly-supported"), but it would IMO not be nice to have tool-specific information in the task definitions. |
I understand the use case. And we do have something like this already, but more related to the "shape" of the tasks instead of the content: For example, tool-info modules for tools that do not support more than one input file will throw For cases where "just" the property is unsupported (or the input language or whatever) I recommend to not hard-code this in the tool-info module, because it makes it more difficult to experiment with future versions of the tool that might support more features. Consider for example a developer working on a fork aiming at adding one more supported property format. They would be glad to be able to use BenchExec with the existing tool-info module. Another argument to consider here is that we always have to be careful when we just skip something in unwanted situations, instead of making BenchExec fail (the latter happens right now on So, in principle we could add something like
But such a case you would certainly have to resolve by structuring your benchmark definitions and tasks appropriately. Refusing some tasks in the tool-info module although the tool would support them is certainly not recommended.
Yes. |
Yes, and this might be the case already if the property is even supported but "just" in the wrong format. This needs to be hard-coded in some way because I need to translate to the tool-specific way of asking for a property anyway. If someone would extend the tool to support the new property, they would fundamentally need to modify the existing tool info. (I would argue there is, conceptually, not much difference between "shape" and "type".)
I see the point, but modifying the existing tool-info would be reasonable in this case. I see the overall point, but I think for usability either the tool info has to be able to filter or a powerful filtering mechanism is needed in cases where the "compatibility matrix" is really involved. Filtering by file name / wildcard is not really nice, since, as I understand it, the idea would be to have all properties related to a system should be in one file (not one system-property combination per task-def) but each tool might only support a subset of the properties. I can see adding all the filtering-relevant properties into a set of tags and write benchmark definitions that are something like "everything in this folder that has tags x y z". This however shifts the burden on writing the task definition files (whenever a new tool comes around that has a new compatibility dimension, every task-def needs to re-written). There is one point for explicitly treating this differently than filtering: If I can say "run every tool on every benchmark" and tools can explicitly report that they do not support something, I can see in the result tables how many things each tool did support (different from "I filtered these out beforehand"). |
In the case where the tool-info module cannot know what command line it should produce, it simply needs to throw
I don't think so. Consider you being a maintainer of tool X and some other developer Y working on feature Z. Would you be in favor of the tool-info module in BenchExec being changed according to the wishes/needs of Y for Z, even if one still doesn't know whether the code from Y will get merged into X? So a tool-info module should not impose restrictions that are not technically necessary. For the rest of your comment, it seems like you are talking mostly about filtering? Let's please keep that discussion in #998.
Sure, that sounds like a good thing, right? And tools of course can already explicitly report that they do not support something - they just have to output a proper error message (which they should do anyway for the sake of their users) and the tool-info module has to transform that into a useful status value like What we discuss is just optimizing away those trivial runs. So it would actually be nice to have numbers: A concrete use case, with X runs in total, out of which Y one would want to filter out upfront, and where those Y tasks cost Z amount of resources to execute. |
Fair point!
Oh, no, absolutely not, but I would also guess that most tool-info won't get merged into the benchexec distribution. I would even say it shouldn't for most artifact evaluations. I think it is sensible if tools provide their tool-info in their repo and only have tool-info of major, stable tools in benchexec. I am mainly seeing this from an artifact evaluation perspective, not a recurring competition. But anyway, I do see the argument.
That would be nice, yes :D
Exactly! These can be very substantial (potentially close to 80% of the executions in particular setups) However, I see the argument in the context of wrong I think that concludes the broad discussion. Thanks a lot for taking your time! |
Oh, one more point: I just realized that one I'm mainly curious about the reasoning why this is the case. This feels unnatural. |
It is explicitly a feature of BenchExec to work with tool-info modules that are defined outside of its own code. But we do want to encourage people to submit tool-info modules to BenchExec for inclusion, and the bars for this are low. The reason for this is that we are just as much in favor of people using BenchExec for all kinds of experiments, and the easier it is the more people will do it. We also have tool-info modules written by third parties (not the original tool developers), so if you perform a comparison against the state of the art for some paper, we invite you to submit your tool-info modules, as long as they do not hard-code specifics of your experiment. Competitions are of course a strong argument for having a public tool-info module, and btw. we like it very much if competitions make having a merged tool-info module a participation requirement. We also strive to help competitions in this regard by timely reviewing of PRs with such modules (which often catches errors that would have otherwise only be found during pre-runs) and by coordinating releases with the competition schedule (such that a competition can point to an official release of BenchExec and declare that this is the state with which the competition was executed).
You are welcome! We are always glad when more people adopt BenchExec and try to help! Especially for competitions we are interested in them using BenchExec, because they are a great way to spread this best practice.
We consider one "task" to be the tuple "(input files, property, expected verdict)" (the latter two parts are optional). For the combination of one task and one tool one "run" is created, i.e., one concrete execution of the tool. One task-definition file can thus define several tasks. I realized that our glossary did not precisely define this, but I added it now. |
EDIT: I changed the issue's title since in the discussion it became apparent that some of the raised points below are not valid or moved to a separate issue.
This might be wishing for (too) much, but I think what might profit benchexec quite a lot if it were less entangled with the specifics of SV-COMP.
For example,
result.py
makes strong assumptions about the nature of reference results and output constants: The result categories are effectively hard-coded and expected results assume that there is a (small) finite list of possible results. This is fine for qualitative problems (yes/no questions), but makes it hard to use for qualitative tools,especially with approximation. Concretely, the "correct" result might be a float between 0 and 1 and any result within 10^-6 of the exact value is considered correct.An abstract idea would be to provide, just as as with tool-info, a "result-handler" definition which handles the "interpretation" of results. These could than be referenced by name in the benchmark definitions. So something like
<handler><name>svcomp</name></handler>
or<handler><name>approximate</name><precision>1e-6</precision></handler>
(all non-name
fields would be passed to the__init__
askwargs
to the handler). Same as with tool info, users should then be able to register their own handlers and a set of standard handlers could be provided. For backwards compatibility, if that field is not specified, the standardsvcomp
is used. (I think is fine to require that a correct / incorrect / unknown verdict can be given in the end)On a related note, I think that the result handler should also specify how the result looks like. It could be a single value, but also a dictionary of multiple values + internal statistics. I believe that in general
determine_result
should not return a single string but aToolResult
object, which by default contains the status string, but potentially some further information (e.g. detailed reason for an error or statistics). (Note: Backwards compatibility is easy to ensure here). This would connect quite nicely to table generation: Result handlers could also provide additional columns for the result table. As a concrete example, I am dealing with finite state models and a useful number is how many states have been constructed and / or how many iterations have been applied etc. It feels likely that it should even be possible to automatically let these custom columns flow into the plots, depending on their type. (Here, it might be good to allow for non-numeric columns, e.g. it might be nice to see any warnings that a tool produced in the table.)I find benchexec a very very useful project, however using it for my cases felt like a hack because of these underlying assumptions. I want to push benchexec as the default for upcoming iterations of QComp (part of TOOLympics :-) ), and this is a major point (together with #993, but that one is less important)
I realize that this is a lot of (vague) ideas packed into one issue. However, I think these points are somewhat interlinked, so it makes sense to first view them as one block. I think it should be possible to break this up into several smaller pieces once the picture becomes clearer. I am very happy to discuss details and/or contribute code once we settled for something!
EDIT: One more thing (which is mostly independent, but I think it belongs to this general discussion): The task-definition files also introduce some assumptions on the structure benchmarks. For example, PRISM input definitions are (i) a model file, (ii) a list of string key-value pairs as "model constants", (iii) a property file, and (iv) a property reference. I think it should be reasonably easy to also register a "run builder" or similar, i.e. a way to construct the list of runs from a given file.
Concretely, before these lines:
benchexec/benchexec/model.py
Lines 619 to 640 in a0718af
it should be possible to add a
type
tag to the definition (e.g.sourcefilesTag.get("type")
), look in the registry if a handler of this type is known and then callcreate_run_from_definition(...)
).TaskDefinitionHandler
would be then be one of the instances of such a handler (backwards compatibility can again be ensured)The text was updated successfully, but these errors were encountered: