r2 - 05 Mar 2007 - 21:29:57 - StephaneJEZEQUELYou are here: TWiki >  Atlas Web  >  Informatique > DataAccess > InformatiqueDDM > Enduser-Tutorial-LYON

DDM Tutorial for end-users at LYON


Welcome to the Distributed Data Management (DDM) tutorial for end-user analysis.

This tutorial was conceived at CERN (link) and adapted to LYON in March 2007.

Its goal is to provide an introduction to the distributed data mangement system. We start by a (very!) short introduction of the DDM concepts. We then introduce its current implementation, called DQ2, focusing on end-user client tools. A set of short exercises covering the various aspects of the end-user tools is presented.

Basic concepts

  • The DDM system is based on datasets
  • A dataset is a collection of files
  • A file may belong to more than one dataset (e.g. a file may be part of the original dataset it was produced in, as well as part of a user dataset)
  • A file is always part of at least one dataset (files only exist in the context of datasets)

Storage technology in LYON

The storage technology used in LYON is called dcache (FNAL-DESY). It is also used in other T1s (PIC,SARA,BNL,RAL,FZK)

Setup up your environment

For site-specific configuration, please take a look at this page.

  1. Log in to ccali.in2p3.fr
  2. Set up Grid environment for ccali: lcg_env
  3. Get a VOMS-enabled Grid proxy
    • voms-proxy-init --voms=atlas
  4. Set up DQ2 environment:
    • For sh: source /afs/usatlas.bnl.gov/Grid/Don-Quijote/dq2_user_client/setup.sh.CCIN2P3
    • For csh: source /afs/usatlas.bnl.gov/Grid/Don-Quijote/dq2_user_client/setup.csh.CCIN2P3

You should see a message similar to the following:

#voms-proxy-init --voms=atlas
Cannot find file or dir: /afs/in2p3.fr/home/j/jezequel/.glite/vomses
Your identity: /O=GRID-FR/C=FR/O=CNRS/OU=LAPP/CN=Stephane Jezequel
Enter GRID pass phrase:
Creating temporary proxy .......................................................... Done
Contacting  lcg-voms.cern.ch:15001 [/C=CH/O=CERN/OU=GRID/CN=host/lcg-voms.cern.ch] "atlas" Done
Creating proxy .................................. Done
Your proxy is valid until Sat Mar  3 04:27:41 2007

You are all set to use DQ2 end-user tools!

If you want to copy files from the Grid, copy locally setup.(c)sh.CCIN2P3 and change the value of DQ2_COPY_COMMAND to 'lcg-cp -v --vo atlas' (see dq2 setup).

DQ2 end-user tools

Searching datasets

Now we start with an example on how to search for datasets, given a wildcard:

  • dq2_ls mc11.*recon.*

As output, you see the dataset names. You may pick up one of the datasets and check its content, that is, the files inside the dataset. To do this query, use the option -g:

  • dq2_ls -g mc11.003014.M1_minbias.recon.ESD.v11000202

In the output you will see the dataset above contains 23 files, and you can see each of its individual file names.

Exercise 1: solution

  • Find a mc11 dataset (or your favorite dataset for that matter) using dq2_ls
  • Attempt to list all the files in that dataset.

It is possible to refine a search by specifying you want datasets with more or less than 'N' files. To do so, add e.g. "Total>20" to the end of the query:

  • dq2_ls -g mc11.003014.M1_minbias.recon.ESD.v11000202 "Total>20"

In the example above you should get the same results, since the dataset indeed had more than 20 files (actually it had 23).

If you re-run with:

  • dq2_ls -g mc11.003014.M1_minbias.recon.ESD.v11000202 "Total>50"
.. you should get no results since the dataset only has 23 files.

Exercise 2: solution

  • Find a dataset with more than 200 files. Try to trim down your queries based on dataset name patterns or the query will be VERY slow.
    • TIP: You may still use wildcard with "Total>XXX". (where XXX is a number).
    • TIP: You must use -g !
    • TIP: You should avoid slow queries, like large wildcards (e.g. 'csc*').
  • Find a dataset with less than 70 files.
    • TIP: The production system 'evgen' (event generation) datasets usually have ~50 files (e.g. why not try a csc11 dataset?).

The dq2_ls tool also allows you to search for files present at your site. Since in this tutorial we are running the tool from ccali@cc.in2p3.fr, you can search for files present in LYON only.

To search for files which are part of a dataset, we have used the option -g. To list files which are part of a dataset and are locally available, you should use the option -f.

Exercise 3: solution

  • Let's take another dataset, such as mc12.005325.HerwigVBFH170wwlh.evgen.EVNT.v12003105_tid004218 (this is just a randomly chosen dataset for the exercise)
  • Repeating what you have done before, list all files part of that dataset
    • TIP: This one is similar to the previous exercises.
  • Now for the new part, try and list those files in the dataset which are available locally
    • TIP: For this one, you should use the new -f option.

Remember that many of the datasets are not available in LYON - or in any other single site for that matter. At the moment, the end-user tools provide limited functionality to have a quick overview of locally available data (there is ongoing work to improve this!).

Browsing for datasets: the Dataset Browser

Click here to go to the DQ2 Dataset Browser (right click on the link to open it in a new browser window)

(the page may take a few seconds to load)

The DQ2 dataset browser gives an overview of available datasets, by category, current location, etc.

Exercise 4:

  • The LYON areas are known as 'LYONDISK' or 'LYONTAPE'
  • Choose the 'csc' dataset category and try to filter by datasets present on 'LYONDISK'
  • Now scroll down the page and you will be able to see an overview of available 'csc' datasets on 'LYONDISK'
  • Continue exploring the dataset browser, choosing a single dataset
  • Also explore a site from your home T2

Do spend some time exploring the DQ2 dataset browser and feel free to ask questions about it...

Getting a few files locally

It is time to try and get your hands on the data!..

Remember the dataset you used during the AMI tutorial?

  • misalg_csc11.005300.PythiaH130zz4l.recon.AOD.v12003104

Exercise 5: solution

  • As before, try to list the files in that dataset.
  • This will give you an idea of the size of the dataset.

As you noticed, the dataset is not very big (20 files).

DQ2 end-user tools allow you to retrieve data from any storage. In the ideal situation, dq2_get will only retrieve data from your local storage to your local disk for analysis, or provide you a direct access to the data, via a POSIX-like protocol. For those familiar with CASTOR at CERN, this means dq2_get will either try to use rfcp to transfer data from CASTOR to your local disk, or give you a rfio URL so you can access it directly.

(currently, there are various limitations on POSIX-like interfaces that storages provide as well as in their interactions with ROOT. Although this is not directly a DQ2 issue, it means DQ2 cannot often use or find the optimal access protocol to your local site storage; when customizing dq2_get for a site, your local storage admin should provide you with the necessary information and optimal configuration)

Exercise 6: solution

  • Try to get all files from dataset misalg_csc11.005300.PythiaH130zz4l.recon.AOD.v12003104
    • TIP: We are retrieving all files part of that dataset as an example. So, better put them on a temporary area since you may not have sufficient space in your AFS account:
      • mkdir -p /scratch/$USER/part1/
      • cd /scratch/$USER/part1/
      • and then run dq2_ get from there... (note: since you are transferring data, it may take a while!...)
      • you can add -v option to see the commands
      • Let it run and go for a coffee smile it will take a few minutes

dq2_get is 'smart' enough to warn you about files that are not available at your storage (in our case, on dcache). We will get that file later.

Sometimes, you do not want to get all the files in the dataset, but only some of them. In this case, you should specify each file name (as retrieved from dq2_get -g) after the dataset name.

Exercise 7: solution

  • Try to get again only two files from the dataset:
    • TIP: Why not create a new directory and get it there, so you do not override the files you already downloaded before?
      • mkdir -p /scratch/$USER/part2/
      • cd /scratch/$USER/part2/
      • and then run dq2_ get from there specifying any two files from that dataset you want to retrieve...

To copy files over the Grid, you should use the option -r.

Exercise 8: solution

  • Try to get the one file not available in LYON from dataset misalg_csc11.005300.PythiaH130zz4l.recon.AOD.v12003104
    • TIP: The missing file was misalg_csc11.005300.PythiaH130zz4l.recon.AOD.v12003104_tid004174._00007.pool.root.13
    • TIP: Go back to /scratch/$USER/part1/ and retrieve the missing file
    • Important note: Sometimes files are reported to be at a site but have gone missing. When you run dq2_get it may happen that the file is indeed not at the site... Try again choosing another site.

As you have seen from when you first run dq2_get, the -r option is used to retrieve directly files over the Grid: that is, those files from your dataset that are not already replicated in your local storage.

Generating a PoolFileCatalog.xml or Athena options directly

In many sites you can run directly over the data without copying it over to your local disk.

DQ2 provides you a command (dq2_poolFCjobO) to generate the PoolFileCatalog XML file (with option -p) or Athena job-option file (option -j) directly.

Exercise 9: solution

  • Create the PoolFileCatalog.xml file and Athena Job option for the same dataset as before

What can go wrong retrieve data?

Quite a few things: the destination or source storages can be unavailable (this happens quite often nowadays, as the sites ramp up their h/w and operational support for data taking late 2007). There are two important tips in case of problems:

  • Check if your problem is listed in this page
  • Fill in a Savannah ticket for DDM Operations here

Creating your own datasets

Every user can create their own datasets on DQ2. This allows your data to be available on the Grid (for Grid jobs or even other users)

There are 2 parts to making your data available on DQ2 (and the Grid).

First step: copying files into your local Grid storage

Second step: registering these files onto DQ2 (and the Grid)

This part is not working in LYON. Demonstration at CERN is available here.

A more advanced overview of DQ2

A little bit of theory on names

As we've seen earlier, datasets are identified by a name and contain a collection of files.

A file is identified by a name and typically has a POOL GUID associated with it.

Each file may have multiple physical replicas (one replica at CERN, another in Lancaster, etc)

The name of each physical replica (such as the CASTOR file name) may be slightly different. Nonetheless, you need not worry about physical file names.

A little bit of theory on dataset states

In the examples above, we saw the most simple usage of datasets: that is, creating and adding files to a dataset.

Underneath, datasets are a bit more complicated:

  • datasets actually have versions: whenever a dataset is created, the first version is created automatically
  • you don't really add files to a dataset, but only to its latest version
  • the contents of a dataset can change between versions (with more or less files across versions)
  • if you do not specify a particular dataset version (which you can't do with the end-user tools for now), you always get the latest version, which typically contains all files except those proven bad (e.g. due to failures during the Grid simulation)

Internally, each version can be open or closed, and the dataset itself can be open or frozen. That is:

  • if a dataset is open, it is possible to add more versions to the dataset
  • if a dataset is frozen, it is not possible to add any more versions to the dataset. Also, the latest version (and all previous versions) of the dataset are closed.
  • if a dataset version is open, it is possible to add or remove files from that version
  • if a dataset version is closed, it is not possible to neither add or remove files from that version

Therefore, a closed dataset version is immutable. A frozen dataset is also immutable.

Should you care about dataset versions? Not at this point in time considering the end-user tools are still on their first prototype. It is useful to know the concepts in case you follow 'threads' about dataset productions on the Grid.


A significant part of DQ2 is not covered in this tutorial and is about subscriptions.

Subscriptions are used to move massive amounts of data between sites, on scheduled data movements. Subscriptions are typically used to ship data from data taking from the Tier-0 to the Tier-1s and Tier-2s. Many of the discussions on DQ2 talk about subscriptions. The end-user tools do not currently allow users to subscribe datasets to sites: these are currently human-managed requests inserted into the system by a set of Production tools. In the very near future we want to give end-users the possibility to use subscriptions to move data between sites. dq2_get will continue to be used as the tool to get local access to the data; the subscription will be used to put data at your site.

Major updates:
-- MiguelBranco? - 29 Jan 2007

%REVIEW% Never reviewed

-- StephaneJEZEQUEL - 02 Mar 2007

Edit | WYSIWYG | Attach | Printable | Raw View | Backlinks: Web, All Webs | History: r2 < r1 | More topic actions
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback