2007/04/19: DigCCurr 2007: Identifying Digital Curation Services and Functional Requirements
I attended "Data Sets, Metadata, and Management"Speakers: Jane Greenberg, Gail Steinhart, James Tuttle
Jane Greenberg. DRIADE project. Overview: introduction, consensus building, functional requirements,
DRIADE = Digital repository of Information and Data for Evolution.
Big science initiatives already with webbed interfaces to the data and lots of services built on top. The internet also impacts "small science" such as Knowledge Network for Biocomplexity (KNB), Marine Metadata Initiative (MMI).
Evolutionary biology has some requirements for data deposit (GenBank, TreeBase). Some publications also require supplementary data such as Molecular Biology and Evolution. No single one-stop shop for evolutionary biologist. DRIADE was developed in response to that need.
Goals of DRIADE
* develop one-stop shop for scientific data objects supporting published research
* support data acquisition, preservation, resource discovery, data sharing, and data reuse of heterogeneous digital datasets.
* balance a need for low barriers, with a higher-level data synthesis.
Consensus building
Had a stakeholders workshop where they invited reps from the major journals, organizations/societies, and scientists.
Outcomes: found there was unanimous support for the project, participants felt it was necessary to advance science in this field. It was agreed upon that they would have a central data repository. Journal representatives felt they had a moral obligation and moral authority to initiate this and get people on board. A data repository will help to verify authenticity of data and provide a bit of "policing" in terms of data security. It can also help with interoperability with Genbank and Treebase, working on "handshake activity."
Challenges: scope, representation, quality control, security, cultural change, sustainability.
Scope: should it be restricted to data supporting publications? Should other data be included? One participant advocated for doctoral theses. Who will contribute? Who will create the metadata? At what stage in the publication life cycle should data be contributed? How to coordinate with the journals? Rights are a very important issue. Rights of authors to their datasets. Do journals have rights to publish data collected with public funds? General consensus from the workshop leaning towards a model like Creative Commons. At what stage in reuse to re-users credit the creator of the data? Representation has many associated challenges. Standardization and strict adherence to standards vs. keywords. The idea of a combination of both is a practical way to approach it. Looking to generate as much metadata as possible automatically. Some experiments in drawing keywords from the published article. Quality control of the data input, how to maintain it? People at the workshop were very against the idea of a data curator initially, but by the end of the workshop they were thinking a data curator (or more than one) is necessary for quality control. Security is an issue because evolutionary biology is a controversial subject (think creationists). Need to protect. Sustainability? How do you encourage scientists to deposit data? DRIADE is fortunate in that there is buy-in from the journals. Some discussion of mandatory deposit policies leading to passive-aggressive behavior (like mislabeling tables, omitting data, etc.). There will be a need to "flag" incorrect items . Finally there is an issue with the ongoing funding model. Should it be subscription based? Grant funded?
Priorities and next steps: preservation, access, synthesis (Maslow's hierarchy of needs), cultural change via editorials, publicizing at conferences, requirements. Right now they consider the preservation itself to be the highest priority because there is a lot of data being lost.
Functional requirements: compared other small science projects.
Support: computer-aided metadata generation, specialized modules linking data submission to work flow.
Functional model based on OAIS.
Metadata framework.
Level 1 - initial repository implementation - preservation, access, and basic usage of data.
Level 2 - ??
Level 3 - ??
Application profiles: data elements drawn from one or more name space schemas combined together by implementors and optimized for a particular local application. Single existing schemes are often not sufficient.
Level 1+ bibliographic citation for the journal article, data object metadata (incl. PREMIS)
Level 3 brainstorming: thinking of web 2.0 technologies, personalization, macros, tagging etc.
Conclusions: Team work required. stakeholders meeting was critical. They are benefiting from prior work. Next steps will be to survey and do use-case and life-cycleycle studies. Metadata application profile experiment with evolutionary biologists actually using it.
Implications for education: student participate in the project, service learning is invaluable. Curriculum needs to address the whole picture including digital resource life-cycle, metadata life cycle, IA components, human factors. Language barriers and communication skills ...different vocabularies in different domains. Conferences like these.
Gail Steinhart described her work with a research group at Cornell.
Overview: motivation, strategy, and what we have learned so far.
Motivation: don't need to spend too much time with this audience explaining why digital curation is necessary. There are digital preservation issues and there is information entropy (loss of information about data over time which lessens its usefulness). Mentioned NSF Cyber-infrastructure Vision for the 21st century. Digital curators need to stay plugged into those. She attended the funders perspective concurrent session this morning and it is more and more evident that funders are going to require data curation.
Curation definition from DCC . She does not think that academic libraries are going to do all aspects of curation as per that definition (DCC definition includes processes needed for good data creation and management, and the capacity to add value to data to generate new sources of information and knowledge). Libraries should try to fill in gaps in infrastructure and support but this is not without costs. Engaging in this area will require libraries to develop new partnerships.
At Cornell project the departments and units involved are multiple (animal science, biological and environmental engineering, ecology, horticulture, Mann Library, etc.)
Types of data collected: observational data, experimental data, simulation models. The group also has 30 years of historical observational data that they would like to preserve and share.
They want to share (a) between themselves to create the simulation models (b) public good to share with policy makers and the general public (c) PI's have committed to being a model for shared collaboration. They appreciate the utility of their data for others.
Strategy:
Didn't make sense to develop infrastructure for this one small group. Needed local support of data, metadata.
Project participants share data and metadata via a staging repository provided by the library. When they are ready to make it public they have some choices: Cornell's institutional repository vs. discipline repositories.
Staging repository
*use discipline specific metadata standards and tools (Ecological Metadata Language [EML]), Morpho
*Provide a place to share pre-publication data within the group.
*Provide training and recommendations on metadata
EML
Morpho is a metadata editor that makes it pretty easy for P.I.'s to create metadata. Helps non-librarians to create a metadata record. Metacat?
"Publication" of data options
*DSpace/Cornell institutional repository... but this doesn't add to the science infrastructure so they encourage
*submit metadata (and possibly data) to discipline specific repository (KNB, other?)
Test case: Historical
Observational data from past 30 years
Original format Quattro Pro workbooks with multiple pages.
Converted to Excel for further review and clean-up
Various errors (apparent duplicate records, misaligned columns, out of range values)
Missing or ambiguous information (methods, units, geographic locations)
Extensible model??
Lots of work between the data owner and her. It raises the question of if this level of service can be provided by libraries.
Summary: curation skills
*traditional library and archiving skills (metadata, preservation, interoperability, appraisal, and selection)
*understanding content area ... need to understand how researchers in a discipline really do research.
*awareness of standards and tools related to data.
*productive partnerships with researchers.
Question: how do contributors work with controlled vocabulary? Answer: haven't had to face it yet. It's too new for the researchers. Had to begin with teaching them what metadata is, what EML is etc. EML doesn't require controlled vocab but can accommodate it. She suspects that the reality is that they're going to do what they are going to do and we have to accept it because it is better than not getting the data sets at all.
James Tuttle, NCSU, Curation and preservation of complex data NC Geospatial Data Archiving Project
In his experience geospatial researchers value the newest data and ignore older data when newer data becomes available.
Geospatial data types are complicated. Vector data, for example. It's highly difficult to preserve. Aerial imagery is a little simpler. Spatial databases are incredibly difficult to preserve. Can do one-many export of images and vector data but trying to manage the relationships between them is difficult.
Repository Pre-ingenst work flow
Data receipt - format processing - metadata processing - ingest processes. The process is as much social as technical.
Data receipt: includes acquisition, reorganization, validation, threat analysis, inventory. Try to automate where ever possible. They have no demands on contributors, so files come to them "as is."
Using JHOVE tool to harvest, even though not designed for geospatial data.
Format processing: geospatial data traditionally has not migrated well between formats. Processing includes conversion, compound formats. Typically data is gathered without metadata. Some metadata can be generated from the GIS software. When it exists it typically requires remediation to be used in basic retrieval.
Ingest Processes: metadata conversion, SIP creation.
Extended Curation: feedback loop with contributors, constant improvement on metadata being received. Also work with industry and standards organization (geospatial consortium)
http://www.lib.ncsu.edu/ncgdap/
Question: Have you met with resistance to depositing from contributors?
Answer: Some agencies in N.C. had strict data sharing requirements which impeded our ability to use the data. A lot of local agencies have liability concerns with older data which may be obsolete, superseded, etc. Need to have clear disclaimers. Have to work with providers to reassure them.
General questions:
Comment: general theme of complex/compound objects in the papers, but the presentations bring home the need for people skills. How to talk to people who don't use our language re: metadata? How do you develop that skill set in students so they can speak to researchers about their data sets? All three of the presentations mentioned an education component where they had to explain to researchers about metadata and its function. How can curators be good educators?
Question: In terms of preparing digital curators what level of expertise do we expect them to have in different content areas?
Jane Greenberg: It's a very interesting question. They have discipline post-doc working on their project. She recalls a time as cataloger where colleagues had graduate degrees.
Audience member: when it's a general description of blob object then it's not as important to have the discipline knowledge, but it's more important to know what's actually in the data.
Jane Greenberg - important to teach curators to know when they need a domain expert so they don't get themselves into situations where they aren't qualified.
Labels: DigCCurr 2007

0 Comments:
Post a Comment
Links to this post:
Create a Link
<< Home