Paleonet: A reply to an Open Letter in Support of Digital Data Archiving

Ross Mounce rcpm20 at bath.ac.uk
Mon Mar 28 14:16:14 GMT 2011


Many thanks Jon for making your views publicly known. I recognize it is a brave thing to oppose a document that many of your colleagues at the University of Bristol have endorsed.

I strongly believe that it is only by frank and open discussion that we can get everyone to see the strengths of this proposal. 

Respectfully, I have to say that in this instance I disagree with many of your points and for reasons of time I cannot address absolutely everything you mention.
[and apologies in advance, my reply is also necessarily lengthy]

1.) "the release of raw data during publication"
can I make it clear that we explicitly only proposed to make data available post-publication, when the research paper comes out. NOT 'during' the publication process e.g. submission and review, as the above quote might suggest.

2.) "lowering citations for everyone" 

How so? Data is without doubt meaningless without context - explicit reference to where and who it comes from is key to its identity and validity. So if one did re-use a dataset one would have to cite the data repository where one obtained the data from AND the original authors of the dataset used. This already occurs and is accepted practise. We argue, that by making data more available, more discoverable (data is hard to use if you don't even know it exists!) and more useable, we may provide original data creators MORE citations than they would otherwise get. Furthermore I can provide evidence for this assertion with reference to multiple papers e.g. (http://dx.doi.org/10.1371/journal.pone.0000308).

3.) "I think it is a real concern that this will empower the richer and more established institutions at the expense of those with less access to funding"

I fail to see the logic behind your reasoning with this point.

Providing institutions (and/or private individuals) have access to the Internet and therefore the freely-accessible databases that we suggested; if anything, data deposition would almost certainly level the playing field as ALL (who have access to the internet) would be able to access, re-purpose, re-analyse, and synthesise data. How does this advantage richer more established institutions?  Perhaps they have faster internet connections and so can obtain data a few milliseconds faster than those of us with slower internet connections? Not a major issue, in my opinion. As for research institutions that do not have *any* access to the internet, how many are there? Sadly, this is a socio-economic problem outside the scope of our aims.

4.) "This could actually slow the rate of publication"

Let us suppose that one discovered a fossil specimen of a new and distinct dinosaur, would you really wait to publish this discovery until you had done *every* conceivable analysis possible on the fossil? Descriptive taxonomy, nomenclature & systematics, CT-scans, phylogenetic analysis, morphospace analysis, stratigraphic congruence analysis, taphonomic analysis, body-size estimates, ecological niche estimation, finite element analysis, gait and speed of movement analysis, dental microwear analysis, etc ad infinitum... 
It might take decades. Science progresses incrementally. Publish when you have enough to publish, do some analyses, and accept that it is not possible to do all. There is a universe of near-infinite possibilities, permutations and combinations for re-purposing original data that need not intrude nor compete with the future research interests of the original authors. By delaying publication one only loses out on potential citations (from potential re-use analyses).

5.) "most of the data is already available and published"

In the most technical of senses you are correct. BUT the thrust of our points is that data is often currently published in inappropriate, awkward, malformatted and inextractable ways.
For example: An unfortunately common practice with phylogenetic data is to only publish the codings for new taxa e.g. Wu et al 2011 http://dx.doi.org/10.1080/02724634.2011.546724 .
One is referred to the previous paper Holmes et al 2008 for the rest of the matrix. Except that this paper doesn't have *all* of the matrix in it either http://dx.doi.org/10.1671/0272-4634(2008)28%5B76:NIOTSO%5D2.0.CO;2 . It in turn refers back to Rieppel et al 2002 which STILL does not contain the full matrix, one has to find Rieppel, 1999 http://dx.doi.org/10.2307/4523993 for the full original matrix (and this paper is so old, it is an image scan, not a text-extractable pdf, so the matrix cannot be readily obtained from here either). How is anyone going to validate the findings of Wu et al 2011 without access to the full data matrix? Sure, one could email the authors, but as I and many others e.g. (http://dx.doi.org/10.1037/0003-066X.61.7.726 ) have discovered - the 'email the author for the data' system DOES NOT WORK in practise, even if it sounds okay in theory.

6.) "bias in favour of the re-analyser over the data generator"

Not everyone has the funding or training to go out and do field palaeontology, myself included. Despite this, non-field palaeontologists can and do contribute significantly to our knowledge. If anything, those that have the means and access to primary data sources e.g. in the field, or museum collections, have 'first pick' on what analyses to publish on. It is the 're-analysers' that rely on primary data from field or museum palaeontologists that are disadvantaged.

7.) "the repeated generation of data is hugely positive..."

I agree. The repeated generation of NEW data is beneficial. Spending copious amounts of time copying and reformatting, or emailing and waiting for *exactly* the same data from a previous paper is NOT beneficial to anyone, funding bodies especially.

8.) "archiving space"

As we both agree (I assume), phylogenetic data, measurement data, photographs, and many other data are trivially small and can be easily archived. It is only the archiving of large (++GBs)data such as CAT scans that need further discussion. One way to achieve dissemination is by spreading the (bandwidth) load using a peer-2-peer download system e.g. BioTorrents. Alternatively, as I understand it, Russell Garwood and Mark Sutton (and others) are hard at work right now making the dissemination of large scan data an achievable reality.

9.) "funding"

I too agree this issue needs further debate. In our brief Open Letter we discussed the principles and necessarily (for reasons of length) less of the practicalities. I fail to see where we suggested that palaeontologists are "wasting public money"? All we suggested is that tax-payers should be able to see the results of what they ultimately funded, in order to avoid "Climategate" like misunderstanding-situations occuring again that breed mistrust - we have nothing to fear from greater transparency, it may even gain us *more* funding if the public can see we are doing good and interesting work, but need money to do more...

10.) "the access to fossil collections"

Agreed. This is an interesting issue that has been debated for years, but tbh, it's largely irrelevant to the scope of our Open Letter : digital data. It would be great if everyone could have access to real specimens, but there are many real and difficult factors that prevent this.
For digital data (data that has been digitised during the course of research {N.B. we would not ask researchers to share data that they did not obtain/digitise in the first place!}) there is no such barrier to public access. Upload it to a repository and everyone can see it. Simple and transparent.


My sincerest apologies that I have selectively addressed only a few of your many points, but these are the ones I feel most salient to discuss at this time.


Kind regards,

Ross Mounce








 


   







References:

Piwowar, H. A., Day, R. S., and Fridsma, D. B. 2007. Sharing detailed research data is associated with increased citation rate. PLoS ONE 2:e308+.


On Mon, 28 Mar 2011 10:59:35 +0000
jonathan antcliffe <jonathanantcliffe at hotmail.com> wrote:

> 
> Dear all
> 
> 
> 
> I very
> much endorse the ethic behind this campaign. I think that there are many issues
> that must be carefully addressed if this campaign is to achieve the good it is
> setting out to do. I feel very strongly that we should not rush into this as a
> community, as tempting as the lure of data always is, but take time to reflect
> on the wider implications that this proposal would have to our science. 
> 
>  
> 
> Firstly,
> the release of raw data during publication is always a risk to the scientist
> who publishes. It is a risk because that data has potentially taken a lot of time
> and money to produce, and it is almost certain that the authors do not intend
> to use it solely in this one publication. Thus releasing data always
> jeopardises future research plans as others now have access to the data. If
> data was made more usable than is already the case then this risk would
> increase. I fear that this would utterly alter the desire of scientists to
> produce interim reports or multiple papers (lowering citations for everyone
> making it harder for us all to compete for funding), not publishing until much
> more complete statements are ready. This could actually slow the rate of
> publication and the ready availability of data. I think it is a real concern
> that this will empower the richer and more established institutions at the
> expense of those with less access to funding. To those with more funding the
> loss of data to a competitor half way through a project is less of a worry than
> to those with less funding. It similarly could empower those in established
> positions at the expense of those beginning their careers, particularly new research
> students.  
> 
>  
> 
> Secondly,
> most of the data is already available and published. It is just not published
> in the way that some argue is most convenient to them. Data acquisition is a
> hugely time consuming process and I see no compelling argument why those who
> spend their time generating detailed data by analysing material all over the
> world should alter how they present data so that those doing meta-analyses who
> have not produced any original data themselves can produces papers far more
> rapidly and thereby outcompete their peers for jobs. This motion as it
> currently stands could mandate an institutional bias in favour of the re-analyser
> over the data generator. 
> 
>  
> 
> Thirdly,
> the repeated generation of data is hugely positive and essential to the
> scientific process, it is not a waste of government research funds. The open
> letter makes much of the need for reanalysis with which I agree wholeheartedly.
> However regeneration of data is even more important. There is no point
> reanalysing bad data, and we don’t know if it is bad unless it can be
> regenerated. Thus reanalysis should proceed from, not preclude, regenerating
> data.  
> 
>  
> 
> Fourthly,
> there is a serious issue regarding the compulsion and how it relates to data
> archiving space. Ultimately there is a finite amount of digital archiving space
> available due to cost. Phylogenetic data tables are small, photographs not as small,
> CAT scan data bigger, Synchotron data enormous. It cannot be the position of a
> journal that you must archive images online (and will not be allowed to publish
> until you do) unless your machine produced a data set that is too large for the
> servers to cope with. So we want all your photographs but if you work on
> synchrotrons then don’t worry about it. Again this amounts potentially to an
> institutional bias, this time in favour of scientists with large research
> grants who can afford to use large expensive machines against those with less
> research funding who do photography and drawing but would then be compelled to
> hand it all over... 
> 
>  
> 
> Fifthly,
>  I
> agree with Jere Lipps that the arguments regarding funding have not been
> properly explored and further to his remarks that there are also strong 
> ethical
> implications here. Is data archiving something that we will have to cost
>  into
> grants or is it always just going to be paid for centrally by the 
> government
> for all research produced in their country? We need to admit that 
> palaeontology
> is not swimming in funding in comparison to other sciences. So I must 
> strongly
> condemn sentiments put forward when a signatory of the open letter made 
> the
> argument at a recent conference, stating that research councils should 
> not have
> to repeatedly fund the same work. I challenge anyone to find two 
> research
> projects funded within five years of each other in palaeontology in the 
> UK with
> the express same aims, outcomes, and, critically, also a complete lack 
> of
> mutual illumination. Otherwise such funding is in the best scientific 
> practice
> of testability and data regeneration. In comparison to most sciences we 
> spend
> very little public money and I will not endorse anything that implies 
> that we
> need even less public money or that we are wasting public money. At a 
> time of such sweeping funding cuts it seems that such statements amount 
> to us voluntarily putting our head on the block. If we were to
> examine the number of citations per pound of public money spent then I 
> am sure
> that palaeontology would rank very favourably against other sciences. 
> Further
> much research is done from private funding, whilst living off small 
> teaching
> salaries or no salary at all. This raises a serious question of the 
> compulsion
> of archiving for those who have funded data acquisition out of their own
>  money.
> Should we then be compelled to pay the government/journals to take 
> ownership of
> data that we have privately funded and produced whilst being unable to 
> get hold
> of the small amounts of public money available? 
> 
>  
> 
> Sixthly, there
> is no mention of the legion of data already published. But vast amounts is
> still not easily available, even in pdf. This should be our primary focus in
> terms of making use of centralised government money for archiving. How much
> more useful would a pdfs archive of published field guides be, or an effort to
> translate works published outside of the English speaking world to make use of
> this enormous resource of knowledge. If such government money was available for
> archiving we need to think very carefully how we could gain maximum benefit
> from it as a community.
> 
>  
> 
> Finally,
>  no
> reference is made to the elephant in the room, the access to fossil 
> collections
> or restricted field sites, though the vast majority are protected for 
> very good reasons related to conservation. This remains the real problem
>  in the availability of
> palaeontological data. 
> 
>  
> 
> Kind
> regards
> 
>  
> 
> Jon
> 
>  
> 
> ------------------------------------------------------------------------------------
> 
> Dr.
> Jonathan Antcliffe 
> 
> Royal
> Commission Research Fellow
> 
> Department
> of Earth Sciences
> 
> University
> of Bristol
> 
> UK
> 
> 
>  		 	   		  

-- 
-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-
Ross Mounce
PhD Student
Fossils, Phylogeny and Macroevolution Research Group
University of Bath
4 South Building, Lab 1.07
http://bit.ly/rossmounce
http://www.citeulike.org/user/rossmounce
-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-




More information about the Paleonet mailing list