Short descriptions of the four projects defined by this Task Force.
A tool to push DICOM / NIfTI data on a database hosted by INCF (http://xnat.incf.org) initially, but option to other (HID or IDA) secondarily. Once the data is on the central server, a quality control check would be launched and the report would be sent to the researcher (see below). The raw and metadata available and the QC data would be stored in the DB. The idea is to give something to those who upload data, and later to have distribution of QC measures across a very large database, so that we could have reference numbers for different scanners/sequences/populations in the long-term (see the UK initiative that resembles this, or SME). (There is already a DICOM push to XNAT database at http://nrg.wustl.edu/projects/DICOM/DicomBrowser.jsp; XCEDE and IDA have uploaders, etc.)
The representation of the metadata could be in a modified XCEDE format (see Project 2). The QC would be derived from BIRN and it would be optional (an added benefit for users, not a judgement regarding their data quality). This will involve a review of the current BIRN Quality Control procedures; what is missing (though seems very comprehensive).
See a survey of existing QA tools
There is now a One Click Prototype.
The one-click infrastructure made its public debut at SfN 2011.
There was a pre-release demo and discussion at the task force meeting before SfN. Discussion revolved around potential next steps:
- Support for NIfTI data
- The ability to reference uploaded data by simple, persistent URIs, DOIs, or both
- Need for initial data now that the prototype is in place
- The ability for better attribution of data (e.g. crediting a lab)
- The ability for community comments on/annotation of data
- A "report this data" button to flag abuse
- An indication of how long data will be stored (data lifetime expectancy)
- Augment the processing aspects: light client to do further data processing, e.g., spatial normalization, re-alignments, etc.
- Support upload of processed data by standard tools
Legal issues were also discussed. Data release itself must be allowed by the owner of the data and is governed by IRBs. Privacy issues remained a major concern; the server currently rejects DICOM data that isn't anonymized to a certain (arbitrary but documented) level, and (client- or server-side) defacing was discussed. The general consensus was that we can't anticipate or defend against every eventuality, but we can do our due diligence and document it. To that end, we should provide boilerplate for investigators to provide to their IRBs, document our position on legal aspects, and have a user-side declaration before upload that the data may be shared.
Licensing was also discussed, including the possibility of selecting a license on upload. Because data licensing issues in general are still being developed and because of the complexity involved in international sharing, it was decided to declare our intentions and the behavior of the server as well as we can without formalizing it in licensing language.
January, 2012 Notes
Discussion continued on the next steps for the one-click tool.
Uploads are currently grouped in projects by uploading user, then by subjects and sessions. Uploads require subject and session labels, but it should be clearer what these mean for the storage structure to give the user some guidance for creating labels.
The issue of duplicate data was raised: how do we deal with data that is uploaded a second time? How do we even identify this data?
Client-side tools need better distribution. Packaging of the push script is essential to reach users who don't want to deal with dependencies (which do at this point require some prior knowledge). A native Mac or PC client tool would be ideal.
The possibility of handling preprocessed data was discussed. Consider uploading a FreeSurfer reconstruction: how would this data be sent, and how would the server react to the upload (how is it structured on the server, and is additional processing done)? There are open questions, but this would be very useful to support.
Other QA options were discussed: BRAINS and DTIPrep have QA tools; a (fully-automated) structural tool would also be useful but is still needed. The ability to compare specific QA metrics for uploaded data against those already in the database will be extremely useful.
We reaffirmed the need for documenting legal aspects (what is expected of the uploader, such as permission to share; how the data is shared; download agreements) and some form of client-side agreement before upload. If possible, the INCF will arrange for a legal review.
The next concrete steps for the project are:
- Documenting legal aspects
- Additional QA tools
- Displaying QA metrics in the context of other scans
April, 2012 Notes
From the Nodes Workshop
- In the list of scans available for download it might be useful to enable commenting e.g. so a re-user could state how they used the data (e.g. re-analysed xx, published in xx) or whether they had any problems with re-use (e.g. scan 6 has artifacts)
- Ability to click to email the data originator (his email anonymous) to ask questions or report problems with the data.
- Be able to see download counts for data sets (perhaps just for the data owner)
- Assign DOIs
- Better integration of QA results into XNAT
- Need better instructions/notes in the XNAT interface (this has come from our first user!)
The goal of project 2 is to have a first germ of a standard description of neuroimaging data/meta data to allow / facilitate the communication between databases. A number of efforts have already made progresses towards that goal (XCEDE is probably the most well-known). This standard description would be used to mediate between databases with different data models. Eventually – this could be linked to a set of ontologies to allow for semantic searches and reasoning. A mediation tool between HID and XNAT has been developed by Naveen (see recent paper).
- Review XCEDE and see how we could use it to specify a common data representation - strength and weaknesses - how will we use it? how does it become a standard? *Discuss the need of linking this to lexicons such as Radlex / others.
- Using the same API: retrieve data such as “T1 weighted from males aged > 60” on both XNAT and HID DB - which querying language? How to push data with the same API to XNAT and HID
- Study / Use of what was developed by the BIRN mediator
One possible further idea for the future is to base a standard API that could be used to interrogate any database on this description. Review the XNAT Rest API for this.
- Spend next 3 months assessing the overlap between the DB (HID, XNAT and IDA) schemas - and use the overlap to export the common parameters in XCEDE - or XCEDE- like. Use this wikifor community participation.
- Next 3 months, deliver query over the ‘mapped’ fields across the 3 DB through the ‘mediator’.
- Find common (minimal) metadata set. Is this set sufficient for meaningful datasharing? Formalize the definitions of these concepts, and attach to one of the proper ontology systems.
- make the HID / XNAT mapping available: needs a little documentation but could be done quickly
- Naveen to look at OpenII
- Pull data down directly to LONI pipeline from HID; others?
- Readers for R / Matlab / Python ?
- Use the common representation to query - standard SQL/SPARQL/REST services?
The data format subgroup identified a number of problems:
- The inadequacy of current data formats to handle current fields of interest.
- The inadequacy of current data formats to handle new fields of interest.
- The fact that the set of metadata of interest is a moving target.
- The difficulty of getting a wide range of people to agree on how to express fields of interest.
- The difficulty of getting software developers to adopt a standardized way of storing data and metadata.
Perhaps the most important problem we identified was that the larger problem itself is still somewhat ill-defined. "We need a new data format" was asserted early, but we need to make sure that we really understand the motivations behind that statement. To get a handle on these problems, we decided to address a simple use case. One of the limitations of current data formats is the inability to natively handle multi-echo data; FreeSurfer, for instance, uses multiple .mgz files and users typically use file naming to keep track of them. We will attempt to use the Connectome File Format to handle this data. The goals of this exercise are:
- To think about how extra fields can be represented in data formats (both this limited set and as yet unspecified fields).
- To begin evaluating the Connectome File Format as a candidate for a common format.
- To begin to understand the technical challenges of proposing a new data format (e.g., mapping this to FreeSurfer's current model).
To begin to understand the human challenges in the adoption of a new data format (finding a way to socially implement the standard so, for instance, FreeSurfer users can seamlessly share data with non-FreeSurfer users; possible solutions are wrappers that lower the barrier of use to a level that users will adopt the standard and technical or social factors that will tip the cost/benefit for the developers to adopt the standard).
This is a fairly simple use case which provides a rich set of issues to be considered. The lessons learned from this exercise will be more important than the technical solutions that come from it, so when we evaluate the results, we should be prepared to abandon the technical work in favor of another technical approach. But in any case, this exercise will put us closer to a workable technical solution.
XNAT - NIPyPE Workflow
This project involves taking the output of standard analyses and pushing the results to a XNAT DB to start with - but later using the common API and the XCEDE / standard schema to feedback to the DB with processed data and metadata. Input data would be extracted also using the standard API. Another direction for standardization is the description of a workflow: see the current XML description of LONI pipelines for instance.