SFrame - A ROOT data analysis framework Wiki

Brought to you by: davidberge, johanneshaller, krasznaa

UserCodeCreation

Introduction
Code layout
Implementation of a cycle
NTuple reading and writing
Histogram handling
The structure of the configuration XML files

Introduction

The idea with basically any framework is that the user can take advantage of the features offered by the framework without having to touch the framework's code itself. The scheme is the same with SFrame. While doing the first tests by editing the files under SFrame/user is fine, this is not the model for creating a full-blown analysis code.

This page lists the "official" suggestions for creating a user analysis package, creating the code for it, and using some of the basic features of SFrame. The page only serves as a tutorial, the detailed explanations of the listed features may be found on other pages. For more advanced features, have a look at page [AdvancedFeatures].

Code layout

Let's assume that you checked out the SFrame code into a directory called $(HOME)/Analysis/SFrame/. This means that the compiled libraries and executables from the package ended up in the $(HOME)/Analysis/SFrame/lib and $(HOME)/Analysis/SFrame/bin directories, respectively.

Now you should create your own "SFrame package" to hold your analysis code. To make it easier, a script called sframe_new_package.sh is provided. To use it, go to the $(HOME)/Analysis/ directory, and type:

sframe_new_package.sh MyAnalysis

This will create a new directory called $(HOME)/Analysis/MyAnalysis/, and fill it with the directories/files that will serve as the skeleton for the new package. Notice, that the new package has the same layout as the SFrame/user directory. To easily add a new skeleton cycle to the package, the script sframe_create_cycle.py is provided. Go to the directory $(HOME)/Analysis/MyAnalysis/ and execute the following:

sframe_create_cycle.py -n MyCycle -l include/MyAnalysis_LinkDef.h

This will create the files $(HOME)/Analysis/MyAnalysis/include/MyCycle.h and $(HOME)/Analysis/MyAnalysis/src/MyCycle.cxx. It also adds a line to $(HOME)/Analysis/MyAnalysis/include/MyAnalysis_LinkDef.h, instructing CINT to generate a dictionary for this cycle.

At this point the package is basically ready for compilation. Go to the main directory of the package ($(HOME)/Analysis/MyAnalysis/) and simply execute:

make

The created library will be put alongside the other SFrame libraries, in the directory $(HOME)/Analysis/SFrame/lib/. This might seem counter-intuitive at first, but this organization has served us well so far. Notice, that you are not restricted to only using one package. A full-grown analysis code area could look something like this for instance:

$(HOME)/Analysis/SFrame/
$(HOME)/Analysis/CommonTools/
$(HOME)/Analysis/SelectionCycles/
$(HOME)/Analysis/AnalysisCycles/

The packages are even allowed to use code from each other. The only thing to keep in mind in this case is to properly load all the needed libraries/packages in the SFrame jobs. This usually also means loading the libraries/packages in the correct order. (If Pkg2 uses Pkg1, then Pkg1 has to be declared first in the configuration XML.)

Implementation of a cycle

Here we explain briefly how to add code to the cycle MyCycle created in the last step. All analysis cycles have to implement the ISCycleBase interface. This is done the easiest by making the user cycle inherit from SCycleBase. Notice that since SFrame only requires the user cycle to implement the ISCycleBase interface and not he SCycleBase one, it's possible to extend the functionality of SFrame by introducing new base classes that add more features. For an example of this, have a look at the [SFrameARA] page.

The helper script created the user cycle so that it inherits from SCycleBase. First off, let's review what the different virtual functions of the SCycleBase class do:

virtual void BeginCycle() throw( SError ): Function called once at the beginning of executing the cycle, before the first InputData block is "opened". You can use it to perform an initial configuration of the cycle. For instance if the cycle needs to read some local file for some information (good data ranges for example), that can be done best here. The function is always executed in the sframe_main process, even when running in PROOF mode.
virtual void EndCycle() throw( SError ): Function called once at the end of the cycle execution. Any finalisation steps should be done here. (Closure of some helper files opened by the user code for instance.) This function is again called in the '''sframe_main''' process.
virtual void BeginMasterInputData( const SInputData& ) throw( SError ): Function called before processing each InputData block, on the master PROOF node. For more information about the PROOF functionality, have a look at the page [[SFrame-PROOF]].
virtual void EndMasterInputData( const SInputData& ) throw( SError ): Function called after being finished processing one InputData block, on the master PROOF node. Notice that the PROOF master node receives the full statistics information from the InputData at this point. So this is a good place to print some summaries, do some final calculations on the created histograms (for instance fitting them), etc. For more information about the PROOF functionality, have a look at the page [[SFrame-PROOF]].
virtual void BeginInputData( const SInputData& ) throw( SError ): Function called on the PROOF worker nodes once before processing each of the input data types. SFrame creates one output file per input data type. If you need to initialise output objects (histograms, etc.) before the event-by-event execution, you should do that here. Also the declaration of the output variables has to be done here.
virtual void EndInputData( const SInputData& ) throw( SError ): Function called last on the PROOF worker nodes before the processing of the input data type is finished. Notice that in this function the code can only access the statistics processed by the one worker node, so most post-processing of the output objects is better placed in the '''EndMasterInputData(...)''' function.
virtual void BeginInputFile( const SInputData& ) throw( SError ): For each new input file the user has to connect his input variables. (More on this later.) This has to be performed in this function.
virtual void ExecuteEvent( const SInputData&, Double_t ) throw( SError ): This is the main analysis function that is called for each event. It receives the weight of the event, as it is calculated by the framework from the luminosities and generator cuts defined in the XML configuration.

NTuple reading and writing

The branches of the input and output ntuples are handled individually. This means that for each input branch that you want to use in your analysis, you have to declare a variable (a "simple" variable in case of primitives like '''Int_t''' or '''Double_t''', or a pointer in case of STL containers) and connect this variable to the appropriate branch in the '''BeginInputFile(...)''' function. The function for connecting variables to an input branch is:

template< typename T >
void ConnectVariable( const char* treeName, const char* branchName, T& variable ) throw( SError )

Let's say you have a branch in your input tree which is of type std::vector< double >. You can use this branch in your analysis by creating a member variable in your cycle with a pointer to such an object, and connecting it to the branch like this:

In the header:

std::vector< double >* m_variable;

In BeginInputFile(...):

ConnectVariable( "Reco0", "vec_var", m_variable );

Output variables are handled similarly. For each output primitive or object you have to create the object as a member of your cycle class, then you can declare it to be written to the output ntuple, with the function:

template< typename T >
TBranch* DeclareVariable( T& obj, const char* name, const char* treeName = 0 ) throw( SError )

To write out a simple Double_t variable to the output TTree, you have to do the following:

In the header:

Double_t m_out_var;

In BeginInputData(...):

DeclareVariable( m_out_var, "out_var" )

Note: that if you only declared one output TTree in your XML, then you don't have to specify the tree name for the function.

Note: For all the data types that you want to read or write from/to a TTree, you have to load the appropriate dictionary. For the basic STL classes (std::vector< double >, std::vector< int >, ...) ROOT has a built in dictionary. But if you want to write out a custom object for instance, you have to create a dictionary for this object, and load it in your SFrame job.

More detailed documentation on these functions can be found in the Doxygen pages, here.

Histogram handling

You can put basically any kind of ROOT object (inheriting from TObject) into the output ROOT file. There are three functions that you can use to put ROOT object to the output file:

template< typename T >
T* Book( const T& obj, const char* directory = 0 ) throw( SError )

template< typename T >
T* Retrieve( const char* name, const char* directory = 0 ) throw( SError )

TH1* Hist( const char* name, const char* directory = 0 ) throw( SError )

You can use the first in the following way to declare a 1 dimensional output histogram:

TH1* hist = Book( TH1D( "hist", "Histogram", 100, 0.0, 100.0 ) );

To access this histogram somewhere else in your code, you could do:

TH1* hist = Retrieve< TH1 >( "hist" );

or for 1-dimensional histograms it's much better to use:

TH1* hist = Hist( "hist" );

Note: The Book(...) and Retrieve(...) functions (because of the underlying ROOT implementations) are quite slow. So it's good practice to store the pointers to the output histograms in your cycle, and possibly never use Retrieve(...). The Hist(...) function is quicker, since it caches the pointers to the histograms for itself. If you run in PROOF mode, you should make sure you understand where each SCycleBase function is called, otherwise you might end up trying to access histogram pointer which have not been initialized in a specific cycle instance yet. For more details have a look at page [SFramePROOF].

The functions can be used to put non-ROOT-native objects in the output as well. A good example for this is the SH1 class. ([AdvancedFeatures])

More detailed documentation on these functions can be found in the Doxygen pages, here.

The structure of the configuration XML files

This section lists all the configuration options available in the XML files. The example file (FirstCycle_config.xml) gives a fair amount of documentation about most of the features, so that can in principle serve as template for any other configurations. The basic layout of the file is demonstrated in that example, only the meaning of the configuration options is explained here.

Defining the analysis cycle

Each analysis cycle can be defined just using its name, thanks to ROOT's dictionary generation capability. This means that the users can implement their own analysis cycles in their own shared libraries, and sframe_main will be able to load these libraries and find the cycle implementation just from a string name. The following properties can be specified for each <cycle ...=""> block:</cycle>

Name: The name of the class that should be run. Cycle classes can also be defined in namespaces, in which case the name of the class should be of the form "My::AnalysisCycle".
TargetLumi: The luminosity value that all InputData sets should be normalized to. Since the data events are never weighted, it makes most sense to specify here the same value as the total luminosity specified for the data input files. The unit of the luminosity is not actually specified by SFrame. As long as all values are specified consistently (in pb-1 units for instance), the weighting code will work.
OutputDirectory: If the output files should be placed somewhere else than the run directory, that can be specified in this variable. By default all output files are created in the directory where sframe_main is started.
PostFix: Specifies the string that should be added to the output file name for this cycle. This can be very useful when the same cycle has to be run with different configurations. The outputs of these cycles can then be separated using these postfixes.

The remainder of the options specify how/where the cycle should be run:

RunMode: Can have two values, "LOCAL" or "PROOF".
- In LOCAL mode the cycle is executed on the input files using TChain::Process(...), which runs the code on a single processor core.
- In PROOF mode the cycle is run using either a proper PROOF server, or using PROOF-Lite. When this mode is enabled, the rest of the configuration options should be set carefully as well.
ProofServer: The full URL of the PROOF server. In case of using PROOF-Lite, this can be either "lite://" or "". When using a proper PROOF server, this usually has the format: "username@machine.institute.org" When using PoD (PROOF on Demand), you can also put "pod://" here.
ProofWorkDir: This is a tricky option. When the analysis cycle writes a TTree as output, each worker node creates a local file with the events that it processed. At the end of the query the PROOF master collects these file fragments from the workers, and merges them into a single file. The master node has to put this merged file (containing the output TTree(s)) into a location that's both writable by the master node, and at least readable by the client machine (the machine running sframe_main). It is usually some sort of network drive. An example could be "root://username@machine.institute.org//workdir/". In PROOF-Lite mode it should just be left as an empty string.
ProofNodes: Number of worker nodes that should be requested for the analysis. By default it is set to "-1", which means to request as many worker nodes as the PROOF server is willing to give. But it can be a good idea to restrict the number of worker nodes for a number of reasons. In some of the latest ROOT versions specifying "-1" doesn't work. In these ROOT versions one has to give an explicit, positive number here.

Defining the input data

An InputData is regarded as a homogeneous set of events, which have to be handled in the same way by the analysis code. First and foremost, events belonging to the same InputData will be added to the result with the same weight. This means for instance that if your Monte Carlo data is composed of different datasets, each with different generator level cuts, then you will have to process these datasets in separate InputData definitions.

The following properties can/must be specified for the InputData object:

Type: A free-form string describing the "type" of the data. The one thing to note here is that the specified string will be part of the output file name, so it makes sense not to put spaces in it. The type "data" is handled in a special way. For InputData blocks that have "data" as Type, no weighting is used.
Version: A free-form string further describing the data. Two InputData definitions are regarded as coming from the same source, if both their type and version are the same. In this case SFrame will automatically add up the results of the separate InputData definitions into the same output file. This feature is needed for some unconventional MC samples.
Lumi: It is possible to specify the summed luminosity of the entire InputData. If it is not 0.0, it takes precedence over the values specified for the individual input files. It is most useful with real data, where calculating the luminosity of individual files/datasets may not be meaningful.
NEventsMax: Number of events that should be processed from the input data. It's main purpose is to do debugging on a restricted set of events. Note, that the event weights are calculated such that the results will still be normalized to the specified total luminosity.
NEventsSkip: Number of events that should be skipped from the beginning of the input data. Purely for debugging purposes. (Checking why a given event "misbehaved"...)
Cacheable: In order to make it possible to weight the events correctly even when only running on a subset of them, the code has to know how many events are stored in each input file. This is usually retrieved from the files before starting the execution of the user's code. However with large datasets (>1000 input files) this check can take a long time. Since in most cases this information is static, by turning this flag to "True", the code caches the needed information about the input files into a local "cache file". On subsequent executions this cache file is queried for the needed information instead of the input files. This change alone can reduce the initialization time of a large job from several minutes to a few seconds. Since in some cases caching should be prohibited -- for instance for input files which are often (re)created --, this feature is off by default. If you only need to recreate the cache once in a while, you can do so by deleting the cache file from the run directory. The cache files all have names like: .sframe.[Type].[Version].idcache.root where [Type] and [Version] stand for the type and version specified for the InputData object. Notice that you can change the composition of the InputData (add/remove files) as long as the files are unchanged. (So they are all identified uniquely by their full path names.) When a change in the InputData composition is detected, only the previously unknown files are investigated, and the cache is updated with the new information for the subsequent executions.
SkipValid: This flag controls if the files/datasets defined in this InputData should be validated at the beginning of the job. By default SFrame checks each input file/dataset for how many events they contain (so that the event weighting can be correct when running on a subset of the events), and if all of them have the same trees in them. When this flag is set to "True", the code assumes that all the defined files/datasets are available, and that they contain all the requested information. Note that this function only works when all the events from the InputData should be processed. (NEventsMax="-1", NEventsSkip="0") Otherwise the job prints a few lines of warning messages, and executes the validation nevertheless.

The following objects can be defined within an InputData definition:

In: Defines a new input file that should be processed.
DataSet: A PROOF dataset that should be processed. A mode detailed explanation of this feature can be found in [AdvancedFeatures].
InputTree: Defines a new TTree that is used by the user's cycle.
OutputTree: Defines a new TTree that should be created in the cycle's output.
GeneratorCut: Defines a generator cut that describes the MC dataset in this InputData. It is useful in some rare cases when MC datasets having different generator cuts have to be merged. In this case SFrame takes care of merging the events with the correct weighting. (Only to be used for "overlapping" generator cuts.)

Defining the cycle properties

Properties for the cycle can be defined in the <userconfig> block. (Again, see the example XML.) The format is very simple. Each property can be configured with one line like:</userconfig>

<Item Name="PropertyName" Value="PropertyValue" />

Properties can be declared to the SCycleBase base class with the function DeclareProperty(...). For more information on the supported types of configurable properties, have a look at the Doxygen documentation here.