Dear stakeholders,
The meeting notes from the special-purpose art stakeholders meeting from a few weeks ago are attached below. Since this meeting was discussion-oriented, and often conversational in form, it is difficult to present the notes in a linear fashion. In terms of practicality, these are the salient points:
– an analysis-level product is desired that is not tied to the concept of event, subrun, or run.
– this product should be accumulatable so that input files with said product can be queried and an aggregate of all said products can be stored in one and only one output file.
– an accumulation function can be provided by the user
Note that the artists’ and stakeholders’ thoughts have evolved since this meeting, and these notes serve as a representation of the thoughts expressed at that time.
Regards,
Kyle (for the artists)
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Attendees:
Chris Backhouse, Kurt Biery, Martin Frank, John Freeman, Andrei Gaponenko, Steve Jones, Tom Junk, Rob Kutschke, Adam Lyon, Brian Rebel
artists: Chris Green, Kyle Knoepfel, Jim Kowalkowski, Marc Paterno
Agenda – one item: analysis support within art
Brian says he wants to get away from TTrees as there is no provenance tracking or how people are using those. He would like some persistible data product that resides outside of event/subrun/run. This would give one the ability to see how outputs are configured differently. There is currently no place to put a full aggregation of everything needed for analysis support.
Marc wondered if information needed to be saved at an entry-by-entry level. Since every event has info. re. how it was constructed, that can be a lot of information. Brian didn’t think such granularity was necessary. Adam, however, did state that the propagation of provenance is also important. Marc said the storage of the metadata can be altered to make the memory problem less dramatic. But, he still fears it will be a very large amount of undesired data.
There was a statement from NOvA proponents that the analysis product must be written/accessible for each file.
There was discussion about lack of clarity for a “run” object. Framework does not enforce anything re. subruns/runs. There’s no point where the framework knows that a run or subrun is done. One run can have it’s output written to many files. When you start combining data, you may start writing objects that represent the same run, but have different run information, which corresponds to a “run fragment”. This is an idea that has been considered for some time, but no effort has been expended to implement it…it would have far-reaching effects conceptually.
Suggestion that each output module would not be permitted to make stream of files but could only make one file.
Rob echoed the ideas proposed so far: suppose you make a file with one [analysis] data product in it. Created file1, put in data product, and do the same thing for file2. Now when I’m reading it to produce file3, if the data product has an accumulate method, then you can call that and file3 will contain the accumulation. And in addition, I can add new products if I like. That data is strictly what the accumulation operator on the predecessors do.
Looking in file3, in the case where data products have an accumulate function, I will see the accumulated data products and then I may or may not see additional data products that my job may or may not have chosen to add to it.
Marc then asked when you read a file that contains an analysis-level data product in the art file, when do you read it? Suggestion to introduce new interfaces beginFile and endFile:
– beginFile means new input file has been opened and read
– endFile means the one and only output file I’m allowed to write to is being prepared for closing
Experiments decide what things are accumulatable.
Suggestion from Marc that collaborations hold internal discussions to nail down further thoughts and use cases.
Brian requests list of points the artists may still not be clear on (re. user thoughts).
Chris G. said he suspected a bifurcation since MicroBoone prefers having provenance with ntuples (i.e. inside .root files produced by TFileService). This suggestion was not desired by the proponents at the meeting.