Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recommend DiracFile over MassStorageFile #162

Open
betatim opened this issue Jan 5, 2016 · 23 comments
Open

Recommend DiracFile over MassStorageFile #162

betatim opened this issue Jan 5, 2016 · 23 comments

Comments

@betatim
Copy link
Member

betatim commented Jan 5, 2016

Related to a discussion on failed large file downloads from the grid when using the MassStorageFile setup from 11-eos-storage.md

For a MassStorageFile ganga will download the file to the local machine and then copy it to EOS. This is quite fragile, especially as the error messages you get don't really help in diagnosing what went wrong.

We should switch to recommending to use DiracFiles instead. These can be created directly by the worker node instead of having to round trip to the machine where ganga is running.

@alexpearce
Copy link
Member

I think Chris Jones's response was pretty good:

People need to get over this idea... As far as the CERN site is concerned the 'grid' is EOS, so having your data available on the grid, with a replica at CERN is identical to having it on EOS.

I would suggest you update the starter kit to use Ulrik's suggestion, and just save the file as a DiracFile. If for some reason (and actually I think this is also redundant, see next) people want a replica at CERN (aka on EOS) then either replicate your data to the CERN site after the fact, or require this as part of your job description to start with.

Lastly, people should also remember its just as easy to run over your data, interactively, with your data on any site using the XRootD protocol, which is available for all sites, and which you can get for any LFN using the Dirac dirac-dms-lfn-accessURL method. So really its not actually necessary at all to demand your data is all at CERN on EOS to start with...

I think there are two main reasons why people want their files on EOS:

  1. You can organise everything in to a nice directory structure and the paths are (somewhat) memorable;
  2. It would make sense if EOS was faster than accessing a file in Spain when you're at CERN

For 1, I think as long as there's an obvious way to get the list of file paths you need, it's not too bad. I tink you'd need a little loop in Ganga for this.

For 2, I don't know if having the files at CERN actually is any faster, maybe the internet connection to the grid sites is so 🔥 fast that there's some other bottleneck. If people want, they could replicate the file to CERN anyway.

And: if you can easily get a list of files, you could also xrdcp everything from whatever grid site to EOS without too much pain (though you could just ask for replication, as that's what you'd be doing).

I dunno. I think the most important thing is that you want the most automation possible; you don't want to be handing out scripts or functions for .ganga.py. Maybe we just need to change people's opinions that you need the file ‘locally’.

@betatim
Copy link
Member Author

betatim commented Jan 5, 2016

For MassStorageFiles the path where to find the output is easy to predict, and can be accessed either via xrootd or the eosmount eos trick.

For DiracFiles it seems they get stored at a location like: ~/eos/lhcb/grid/user/lhcb/user/a/another with sub directories for each ganga job and subjob ID. ls'ing a random dir there:

~/eos/lhcb/grid/user/lhcb/user/t/thead/452.7
$ ls -R
.:
2014_05

./2014_05:
76328

./2014_05/76328:
76328968

./2014_05/76328/76328968:
HLT.xdst

Do we know what those numbers correspond to? You can look them up and generate a file list, but I fear the self-made paths from MassStorageFile are still nicer to use.

@alexpearce
Copy link
Member

You can look them up and generate a file list, but I fear the self-made paths from MassStorageFile are still nicer to use.

For sure 😞

I think the number is the ID of the grid job, which should be unique across All Grid Jobs Ever. It might be stored in the job's backend object.

If you have a job and want the list of LFNs, this should do it:

job = jobs(job_id)
for sj in job.subjobs:
    for df in sj.outputfiles.get(DiracFile):
        print df.lfn

If the file is replicated at CERN-USER, the LFN can be mapped to an XRootD path I think (I don't have an example job to play around with).

@saschastahl
Copy link
Contributor

Yeah, for me it basically boils down to the fact, that I want to be able to look into my ntuple within seconds. And I really cannot be bothered to find out some obscure LFNs every time :(.
Though I have never used MassStorageFile as I was aware of the several fragile steps it includes and always copy to eos by hand.

@alexpearce
Copy link
Member

Yeah, people want to be able to do root <file>; new TBrowser.

With a file on the CERN grid site/EOS you can do

$ root
root [0] TFile *f = TFile::Open("root://eoslhcb.cern.ch//eos/…")
root [1] TBrowser tb

which isn't awful.

@saschastahl
Copy link
Contributor

No, it is ok. But you have to remember the path :-).

@egede
Copy link

egede commented Jan 5, 2016

Two comments.

$ root
root [0] TFile *f = TFile::Open("root://eoslhcb.cern.ch//eos/…")
root [1] TBrowser tb

The above syntax works also if the file is not at CERN. The name to use can be obtained either from the dirac-dms-lfn-accessURL command line prompt or from using the accessURL method on a DiracFile object in Ganga

The second comment is that there is a long standing request to allow the user to decide on the directory structure of DiracFile objects. It is pending a change on the Dirac side as far as I understand.

@betatim
Copy link
Member Author

betatim commented Feb 16, 2016

Does someone know the state of the ganga issue related to this?

@egede
Copy link

egede commented Feb 16, 2016

You mean the user side decision of directory that Dirac stores the file in? There is a missing feature in the LHCbDirac API. Before that is available there is nothing that can be done from the Ganga side.

@betatim
Copy link
Member Author

betatim commented Feb 16, 2016

Jupp. Do the dirac guys use github to track the progress on this or is there a issue in the ganga repository we can track to keep informed?

@egede
Copy link

egede commented Feb 16, 2016

tracing this further the ball is in the Ganga camp now (where it has been forgotten). I have created a new issue on Github to follow this, ganga-devs/ganga#201

@alexpearce
Copy link
Member

There is now a method on DiracFile for getting the full URL to a file, no matter what Grid site it's on:

Ganga In [7]: df.accessURL?
Type:       function
String Form:<function accessURL at 0x7f68f7183320>
File:       /afs/cern.ch/lhcb/software/releases/GANGA/GANGA_v602r2/install/ganga/python/Ganga/GPIDev/Base/Proxy.py
Definition: df.accessURL(*args, **kwargs)
Docstring:
Attempt to find an accessURL which corresponds to the specified SE. If no SE is specified then
return a random one from all the replicas.

For example;

Ganga In [8]: df.accessURL()
['root://clhcbdlf.ads.rl.ac.uk//castor/ads.rl.ac.uk/prod/lhcb/user/a/apearce/2016_09/139512/139512234/DVntuple.root?svcClass=lhcbUser']

I think we should advise people to use this, rather than copying everything to 'CERN-USER' and only using eoslhcb.cern.ch URLs. Does that make sense @egede?

@egede
Copy link

egede commented Oct 13, 2016

@alexpearce Yes, I think that is a good idea - at least if performance is not harmed. The most important thing is to get rid of the recommendation to use MassStorageFile.

@alexpearce
Copy link
Member

OK, thanks. We'll push to get this done before the next workshop.

@egede
Copy link

egede commented Oct 13, 2016

Should some comment be made that this in effect makes the analysis chain less "CERN-centric" - if you no longer copy files to CERN, there is nothing special about lxplus.

@alexpearce
Copy link
Member

Indeed. We should keep in mind that AFS will soon be 💀. I don't know if there's a know replacement for the interactive environment, so if user's are already able to do things without lxplus (on their local cluster, or on their laptops) that will make the transition easier.

@saschastahl
Copy link
Contributor

I have been playing around with this workflow in the last days and it sometimes a bit tricky. And one problem I encountered is that you have to have a valid grid proxy to use these files. This makes it complicated to use on your own PC or in a batch job.

@egede
Copy link

egede commented Oct 24, 2016

@saschastahl Your comment that it takes a bit more effort to get read access to these files is a valid one. However, you can obtain a long life proxy that should more or less get rid of that. Placing the files on EOS will not make your life easier unless you do subsequent analysis inside CERN which I thought we were in general discouraging.

@alexpearce
Copy link
Member

We do indeed want to discourage location-specific analysis. People should be able to do their work on their own machines wherever they are

Do you know how to generate this "long life proxy" @egede? That might make things slightly easier. Although there are stills hoops to jump through when running on the batch system or on your local machine (that doesn't have the usual Grid machinery on it).

Perhaps we could also provide instructions on how to access these non-CERN Grid files 'locally'? (For me, some sites I could access my Grid files from without a proxy, others not. It seems the access policy isn't uniform, so we just need a solution that works everywhere)

@saschastahl
Copy link
Contributor

Yes, I was specifically referring to jobs on a batch system. It involved several steps to transfer the grid proxy to my jobs. I can provide the instructions I found a twiki page but it is a bit cumbersome.

@egede
Copy link

egede commented Oct 25, 2016

With Ganga 6.3, each object that needs a GridProxy will contain information about it. This does not solve the problem in itself, but paves the way for the a job sent to Batch automatically would forward the proxy to it (and fail to submit if no valid proxy was avaiable).

@renaudin
Copy link
Contributor

In order to update the lesson, should we delete the MassStorage setup and use completely or just leave a sidenote on it ?

@cmarinbe
Copy link
Contributor

This should have been solved by #209 Any comments?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants