2007/04/19: DigCCurr 2007: Afternoon plenary: What is digital curation?
Speakers: Peter Bunemen from U.K. Digital Curation Center & William Lefurgy from NDIIPP.Peter Buneman: Databases & Digital Curation
Databases in science and scholarship:
*Nearly all branches of science depend on database technology for storage and retrieval of data.
*This has changed the scientific method (Mike Lesk)
A curated database is:
*A reference work
*Value lies in the organization and annotation of data
*Commonly constructed by copying parts of other (curated) databases
*Replacing traditional dictionaries, gazetteers, encyclopedias
*Rapidly increasing in scientific research (>800 in molecular biology)
*Catalogs and archival metadata are usually curated.
*Constantly checked/verified. Data quality and timeliness are important.
*Often a group efforts. Produced by a dedicated organization or as a collaboration.
*Labor intensive
*Increasingly seen as "publications" by scientists.
Compare traditional libraries vs. databases
Storage in libraries is: redundant, persistent, distributed, readable by people, clear standards for citation, historical record, well understood legal. Databases? Not.
Example of CIA world fact book and plotting the population of Lichtenstein in 1990. Are you better online or in library?
Research on Provenance:
Very difficult and long term problem
Database preservation:
How do you preserve something that evolves (both in content and structure)
Snapshots if frequent are time consuming, if not, you miss something.
Snapshots are immediate, and longitudinal/temporal queries are easy.
How do you cite something in a database? Many scientific databases ask you to cite them, but they
*don't tell you how, or
*they tell you to give the URL, or
*they tell you to cite a paper about the database.
What is a citation? Location and descriptive information .
Getting a canonical version of the database, borrowed data set from nearby scientist. The first task was to convert the db into a hierarchical structure. Preserve all versions of the data, generate stage web pages (less software, more efficient).
Able to clean-up the data.
Created a book from the database.
What is digital curation? A unified approach to
Preserving - the process of preserving digital data for future use, once it has been created
Creating/maintaining - missed last part of this second bullet...
Impertinent thoughts on a DC curriculum. Should they involve databases?
Yes, but they need not be intrusive or hugely time consuming. Teach data formats first and use that as an introduction. No need to teach internals or optimization. Teach design through data import/export technology and through schema mappings. Provide a short course in semistructured data and ontologies.
Do internships with data publishers (they need help!).
Other things to think about
*legal aspects (copying)
*security and confidentiality in databases, timed embargoes
*economics of long-term database maintenance, open access
*Combine with or borrow from, other curricula (ex. NSF data integration)
William Lefurgy from NDIIPP. Digital Curation and Sustainability
Will focus on economic sustainability because he thinks more curators need to pay more attention to that aspect of digital curation.
Aspects of Sustainability
*to keep in existence, to maintain or prolong
*meeting the needs of the present without compromising the ability of future generations to meet their own needs.
*resources, broadly defined, for keeping digital materials available and accessible over time (technology, staff, cash)
*concept application at different levels (national, consortial, local, etc.).
Wants to focus on the third bullet - resources. Brian Lavoie has pointed out that we need an economic infrastructure as well as a technological infrastructure. Example of LC's assumption that the money they had available would remain available. Not so. Their funding was revoked when Congress changed.
Sustainability closely linked with other key issues: collecting content, developing technology, and outlining public policy.
Preparing for substantial discussion of sustainability in the final report of NDIIPP.
Many projects are recognizing the issue. ARL "To stand the test of time" etc.
The need is clear
*expanding digital stewardship requirements
*infrastructure, capabilities still largely geared to analog
*digital funding is largely project based
*broad range of work necessary to effectively manage content across life cycle
*rapid change means regular migration of data, systems
Open questions
*how to preserve, make available
*how to transform existing stewardship organizations and practices
*what are the costs?
*who pays?
*why it matters
See http://www.arl.org/bm-doc/econ_models.ppt for a "mind-map" of the economic issues
Basic action items
*make content value explicit
*probe business case elements
*explore business models
Content value/Why should I care?
*values of digital materials are typically intangible, as is the material itself
*funders need clear, concrete evidence for importance of digital content
*must clarify the demand side. what values accrue as a result of preservation? what are the deficits if the content goes away?
Content value clearly explained
*need frameworks to consider dimensions of value for various digital materials, e.g. value for institutional users, value for institutional reputation, prestige, value for posterity
see British Library "contingent valuation"
Business case: risks, fixes, costs
*compelling story about risk
*incentives/barriers
*plan for addressing risk
*some estimate of cost
*value added by the curatorial practices
The how and how much
*needed level of service, e.g. bag and tag, transformation, disaggregation, rich metadata
*prospective work flow
*credible cost estimates (see City College of London??)
Business models
*how to put preservation into operation
*provide for resources on an ongoing basis
*leverage incentives, remove barriers
*emergent models, but experimental at this point; modeling and testing appropriate
*collaboration is key
see LOCKSS/CLOCKSS and Portico for emerging models
Working within a network
*no one institution, community or sector can develop the best solution, collaboration is essential
*networks build shared infrastructure, reduce costs
*repositories will vary but all can draw from shared suite of tools, services, best practices.
Self-interest and the public good
*institutions work together in pursuit of individual net positive value
*key to sustainability: members get value from networks, but benefits accrue to all from exchange of knowledge
Summary of collective needs
*work to illuminate content value for decision makers
*make the case for specific curatorial actions with supporting cost data
*implement and test models
Labels: DigCCurr 2007

0 Comments:
Post a Comment
Links to this post:
Create a Link
<< Home