SFrame - A ROOT data analysis framework Wiki

Brought to you by: davidberge, johanneshaller, krasznaa

SFrameScripts

Introduction
sframe_input.py
sframe_new_package.sh
sframe_create_cycle.py
sframe_parMaker.py
PROOF manipulation scripts

Introduction

This page lists all the scripts that are provided by SFrame, explains their function and usage. All these scripts are in the SFrame/bin directory.

sframe_input.py

This executable script can be used to create the <In ... /> XML nodes for a set of Monte Carlo ROOT files. SFrame needs the integrated luminosity of the input files specified either file-by-file, or for a complete InputData. The script calculates the luminosity of each input file separately, and creates the <In ... /> nodes accordingly. After setting up your environment for compiling/running SFrame, it can be called like:

sframe_input.py -x 23.45 -o test_input.xml *.root

This command would open all the files ending in .root in the current directory, calculating their integrated luminosity from the 23.45 pb cross section given as a command line option. In this example the results of the script would be written in the test_input.xml file.

The script provides a little help by executing the sframe_input.py -h command.

The command line options in detail:

--xsection: The cross-section of the simulated process in case of a Monte Carlo file. Can be disregarded for data files. Usually the x-section is regarded in pb units. But you can specify the value in any units as long as you use these units in a consistent way in the configuration. (For instance you specify the luminosity that the results should be scaled to, in the same units.)
--data: Specify this flag if the input files should be regarded as data files. In this case you don't have to specify a x-section, but will be expected to give a x-section for the InputData definition itself.
--output: Name of the output XML file
--tree: Name of the main TTree in the input files, holding the event level data. In case of CBNTAA or EventView ntuples, this is usually "CollectionTree", but can be anything else in case of custom ntuple files. For D3PD files it is usually called "physics".
--prefix: A string prefix that should be put before the absolute path names of the input files. This is a useful feature if the absolute path names of the files are not the same under which you want the SFrame cycle to access them. This can be the case for instance when the files are served by an XRootD server on a given machine. If for instance the XRootD file server is called "mymachine.cern.ch", and it serves the files under its local "/data/" directory, you can specify --prefix root://mymachine.cern.ch/, and get file names such as "root://mymachine.cern.ch//data/file1.root" in the created XML file.

sframe_new_package.sh

This shell script can be used to create a new package skeleton. To use it, go to the directory where you would like to create the new package, and type

sframe_new_package.sh PackageName

This will create the directory PackageName, and create a number of sub-directories and files inside of it.

sframe_create_cycle.py

This script is a "revival" of the python code that was developed in Hamburg a while ago for a previous version of SFrame. It is able to create a template for a new user analysis cycle in a simple-to-use manner. For instance if the user would like to add a new cycle called MyAnalysis to an SFrame package, (s)he just has to go in the package's directory and execute:

sframe_create_cycle.py -n MyAnalysis

After this the script creates a header file under include/MyAnalysis.h, a source file under src/MyAnalysis.cxx and an XML configuration file under config/MyAnalysis_config.xml. It also extends the already existing include/<packagename>_LinkDef.h file with a line about the new cycle.

Since I like to put my code into namespaces, the script even supports creating an analysis cycle in a namespace. For instance you can write:

sframe_create_cycle.py -n Ana::MyNamespacedAnalysis

This creates all the files with the MyNamespacesAnalysis prefix (removing Ana::), but puts the C++ code into the correct namespace.

Notes:

The created XML file is very poorly done at the moment. It's probably a better idea to forget about it for now, and write the configuration from scratch... (By taking FirstCycle_config.xml as a starting point.)
Just like the other scripts, this can also give some help by calling sframe_create_cycle.py -h.

sframe_parMaker.py

This script is used by the compilation system to create the .par files from the SFrame packages. Since the user is not supposed to use this script by hand, no detailed explanation is given about it here. Of course it can still give some information about its numerous command line options with sframe_parMaker.py -h.

PROOF manipulation scripts

The following scripts can be used to interact with a PROOF server in a simple manner. Much of the same functionality of these tools is available from the PQ2 tools. So the SFrame tools are mostly useful when using an older (<5.26) version of ROOT.

sframe_dset_maker.py

This script can be used to create PROOF datasets on the server. PROOF datasets can be used to reference a number of ROOT files available on the PROOF cluster. The concept can be used very nicely for giving a large number of files to SFrame to process, by just specifying one string name. (Instead of listing all the files in the XML configuration.) You can find more information about using datasets on the [AdvancedFeatures] page. If possible, you should try using pq2-put instead of this script.

The script accepts the following options:

--server: Name of the PROOF server. It is the same as the name specified in the XML configuration. (e.g. "username@machine.institute.org")
--dset: Name of the dataset to give to the listed files. Note, that the full dataset name given by PROOF will be slightly different than the name specified here. In a default configuration the command will create datasets with the full name /<group>/<username>/<dataset name>, where <group> is a group name defined in the PROOF configuration that the user is assigned to, <username> is the name of the user creating the dataset, and <dataset name> is the name specified in this argument. To check the full name of the created dataset after executing the command, use the sframe_dset_ls.py script.
--prefix: This option can be used in the same way as sframe_input.py's option with the same name. Basically the same notes apply here as well.

sframe_dset_ls.py

A simple script that can be used to list either all the datasets on a given PROOF cluster, or to get more detailed information on a given dataset. If possible, you should try using pq2-ls, pq2-ls-files or pq2-ls-files-server instead of this script.

The script accepts the following options:

--server: The name of the PROOF server.
--dset: If specified, then the script prints more detailed information about the specific dataset. If left blank, then the names of all available datasets are printed.

sframe_clear_server.py

This script has no counterpart on the PQ2 side, as it doesn't work with datasets. SFrame basically leaves it up to PROOF to know when to update and recompile the uploaded packages. This makes it possible not to have to recompile unmodified packages with every new job. For instance if you only change your user package, and don't touch the SFrame source code, then the SFrame packages will not get recompiled on the master and worker nodes, only your user package.

Unfortunately this method is not completely full-proof. When PROOF detects that a package has been modified on the client, it uploads the new version to all the nodes, and extracts the files from the package on top of the files that were there before. This is usually fine, but if you removed some files in your workarea, those files will still be on the server. (Because even though the new .par file doesn't contain these files anymore, they are still kept from the old .par file on the server.) To be able to reset the server when it starts behaving strangely with the uploaded packages, this script can be used. It removes all the uploaded packages from the master and worker nodes (of the given user), so the next SFrame job can start fresh. (By compiling everything from scratch on the servers.)

The script accepts the following option(s):

--server: The name of the PROOF server.
--package: When specified, the script will only remove the specified package from the PROOF cluster. If no specific package name is given, all packages belonging to the user are cleared from the cluster.

Wiki: AdvancedFeatures