[I was asked to move this discussion from the "Design" group. I've also added a few words in response to a comment.]
I want to argue that data management, analysis, etc., have become sufficiently complex tasks that we need to be handing them off to expert third parties--much as many of us hand off our email to Google, rather than running lab email servers as used to be common practice. I suggest that this perspective has significant implications for how we conceive of, and implement, an eventual EarthCube architecture.
As supporting material, I attach a white paper prepared for a recent research data lifecycle management workshop, plus a recent paper on the Globus Online system that we are developing as a first foray into this space. The following are the first three paragraphs of the white paper:
Big increases in data generated within research laboratories and demands for more careful data management lead to increased pressure on investigators. Researchers need not data storage, but full-‐service data lifecycle management processes, encompassing data collection, storage, sharing, metadata, search, archiving, provenance, assignment of DOIs, security, etc. Establishing such processes would demand substantial time and resources that most researchers do not have, and cannot easily acquire.
We believe that the solution to this problem is not simply to define “best practices”—nor to provide researchers with software. Once defined, best practices must still be implemented. software still must be installed, operated, and maintained. Those implementation, installation, and operations steps are precisely where many investigators run into problems.
Instead, we should aim to outsource the entire lifecycle management process to a third partyResearch Data Lifecycle Management service. Ideally, this service will encompass discipline-‐specific practices and methods, so that the individual researcher can connect their lab and then have many of their problems taken care of—much as many outsource their email to Google today.
In response to a comment made on my previous version of my post, I don't mean to imply that we should expect such hosted services to organize people's metadata--but they can operate metadata services, and the processes needed to publish data and populate catalogs, for people.
Regards -- Ian
This comment is more of an affirmation than a response. In short, I believe that Ian is right on point here. I can relate from anecdotal evidence that I have spent weeks and months myself and, more recently, had students who have spent too much time just downloading, managing, formatting and processing data -- in our case primarily global climate model simulation outputs and remote sensing data -- and installing and maintaining the necessary software. These tasks are only going to grow in importance and magnitude as datasets are increasing in size and complexity, e.g., higher spatial and temporal resolution of simulation models as we move from IPCC-AR4 to AR5, or increased resolution and fidelity of remotely sensed data.
The GMail analogy is a fitting one, because outsourcing these activities would allow us to focus focus on our strengths, which happen to be in the development of data mining and machine learning techniques for the climate and environmental sciences. At the same time, there are also research challenges in creating this type of infrastructure, which someone else may find of great interest. I hope that EarthCube will initially provide a medium for bringing these (sometimes disparate) people together and ultimately result in a sustainable solution for scientific data lifecycle management.