I'm in the process of doing spring cleaning in my office(s). I pulled out a bunch of 3.5" floppies and zip disks. A big bunch. I'm cursing myself for ignoring the situation. I should have reviewed these files and migrated them years ago. Fortunately I still have access to both types of disk drive.
I found papers dating back to my undergraduate years -- that's 1990-94. Most of my electronic files have updated easily to current formats. The older ones were a bit trickier. They were in WordPerfect 5.1 for DOS. They ended up having some garbage in the text when I converted, which was a pain. Now I'm just enduring the tedium of viewing the contents of the disks, deciding what to keep, then wiping the disks so somebody else can use them (anybody want a pile of free disks?).
I'm going to move my files to dedicated server space rather than keeping them on fixed media. I think this will help with future refreshing and/or migrations. I definitely will visit my files a bit more frequently than every 10 years. I suppose I should do it each year when the time changes from standard to daylight savings and vice versa. Just like changing batteries in your smoke detectors.
I'm lucky. I didn't lose anything I value -- except perhaps my time. This is probably better done more frequently with fewer files.
Labels: digital preservation, electronic files, migration, refreshing
I've fixed the spelling and rendering issues in those blog entries which were raw dumps of the notes I took during the DigCCurr sessions. Apologies for their initial roughness -- when I said they were raw, I meant it. It didn't help that I was using Appleworks to draft the notes. Blogger did NOT like the text encoding Appleworks used and therefore rendered all diacritics in a very messy fashion.
I don't think I'll cover any more conferences in this manner. I don't think its useful for those who aren't there. The conference presentations will be put on the web at the DigCCurr web site at some point if one is interested in seeing the full content/context. I think it's far more useful to get some summary and analysis of the proceedings.
All in all I thought it was a fabulous conference.
The good?
*Reconnecting with colleagues, especially my former coworkers from SDSC and the UC. *Getting up-to-date with the European digital curation projects
*Hearing Liz Bishoff speak -- she is incredibly dynamic.
*Realizing that everybody is struggling to define the problem space of digital curation
The bad?
*The food at the Friday Center was both stale and in some cases rotten (wilted and rotting salad on both days). As a vegetarian the entree choices were limited to starch starch and more starch.
*Location - U.N.C. is gorgeous, but I mentioned feeling isolated in suburbia. Don't try to be a pedestrian near the Friday center unless you have a fair amount of courage.
My take-aways?
Nobody has a good definition of digital curation or digital preservation. Partnerships and people management are key to the success of any digital curation project. Librarians and archivists have to realize that we are not the only profession which has been radically transformed in the past 10 years. No single institution or profession can provide all of the necessary skills. In addition, no single institution can provide the economic sustainability needed to ensure longevity of digital objects. Digital curators need to pay more attention to developing business cases for archival repositories. Most projects are currently funded as projects and we need to move from soft-funding to production level services. In order to make the business case we should pay better attention to the demand side of the equation. What do the content providers need? For what community is the content being archived? Focusing on fulfilling those needs can assist in figuring out (a) what to do and (b) how much it's realistically going to cost.
I'm looking forward to reading the published papers from the conference. All in all, I'm glad I went.
Labels: DigCCurr 2007
Cal Lee and Cliff Lynch shared the results of the survey and lessons from individual feedback.
Cal Lee went first. He gave a big caveat that the data is as close to research as "The Situation Room" is to journalism.
Survey Questions
What do you see as the biggest digital curation challenges in your institution?
*A few high level categories: need to change/influence beliefs/perceptions of people outside; skilled IT staff that understand the issues; expectation management (don't promise the impossible without expanded staff and/or time line); organizational commitment; insufficient buy-in from the top;
*lack of IT support either internal or outsourced
*essential technological components are lacking
*money or funding
*ownership of the problem-space (other disciplines besides LIS/AS
*how to identify roles and responsibilities
*planning, quality control
*skills of digital curators-- lack of wide spread competencies
*volume of data, long term preservation and access, metadata
*define what digital curation is, what it takes to do it
Discussion: very difficult to disambiguate the challenges listed above since they all touch upon each other.
2. What are the most important topics to cover in digital curation education?
*High level conceptual orientation, being aware of information/archival theory, OAIS model, risk management
*Functions and tasks like cost modeling, systems analysis, cataloging, web design
*Artifacts
*Standards
*Current landscape: economic models, major players, trends, users and services
3. What do you look for if you were hiring a digital curator for your institution
*Communication skills were mentioned far more frequently than any other skill
*Programming
*Leadership
*Project management
*Metadata
*Service orientation, history of profession
4. What are the skills required that are now currently lacking?
*technology and IT
*programming
*server administration
*knowing alternative technologies (not just pick one but how to evaluate between options
* a good BS radar
One comment in the survey about distributing these skills over many people so one resignation doesn't hold up production
*people with fundamental respect for research process
5. Other comments?
*Allow for retooling for professional development for current LIS/Archivists but also for CS and systems people to remediate in the LIS stuff
*Metadata
*comment on being comfortable with digital curation but less so with digital curators
*don't underestimate the importance of management skills
*strengthen still useful skills from LIS/AS
*think about who you want to recruit -- do they need the discipline degree?
*eventually this will be "the" curriculum, we should teach it within the context of everything else
*seems to be more suited to a specialization on top of other degrees
Cal comment on "vacancy in the professions" this is distributed among a bunch of different professionals. Who will be responsible?
Cliff Lynch - giving a mix of his opinion, reaction, based on many conversations with people here and in the past year. Some study of what was and wasn't covered in the program at this symposium.
Suggest that language does matter. We moved from the phrase "data curation" which comes out of developments in the sciences such as Chris Greer's long life data report to the "truly frightening" term digital curation--which we may want to consider getting rid of. We start from the perspective of archives and records management which are traditionally marginalized although recognized as important.
Better to recognize that we're not the only ones being impacted by large scale computing and networking. Everything has changed radically in the last couple of years. CL just came from NSF/JISC workshop -- it started about digital repositories, then morphed into talking about "data driven science" and then finally became about the entirely new ways of doing science/research. The new types of roles emerging.
"Research facilitators" start looking around S&E facilities of large grant receiving institutions -- they are finding ways to squirrel these people into their organizations (i.e. "staff technologist")
Humanities "critical editions" used to be important -- are they still?
There is a set of activities around data curation that we need to define, specifically regarding management and preservation of scholarly data. One set of activities for long term memory organizations. One set of activities for the creation of the data. Just how much specific scholarly expertise do you need? Critical question.
Difference in use of word "curatorial" in bio-informatics/biology data sets. It's more of a critical editorial role.
Digital curation vs. data curation. Curate. Recognize that once upon a time that libraries had curators who built and managed collections then this role got sliced and diced into all the various types of librarians you've got now (bibliographers, catalogers, etc.). The return to curation as term reflects the changes that need to happen in the way we think about acquisitions in libraries.
Other comments: stunning how different the participants view curation. we all have different opinions of what skills somebody with a certificate in digital curation would have.
We need people who can sort through social, organizational, economic issues around sharing, destruction, anonymous-ization of information resources across time. Not just a narrow view of records management. Broader view needed of social policy and impact in order to make the necessary case for stewardship.
We have a lot of case law that gives primacy to individual ownership rather than social commons.
Economics - besides business and cost models for individual organizations, it is a social good and needs that type of funding rather than a bunch of organizations recharging each other in a circle.
Risk management - very important but difficult to do as it's hard to quantify value of irreplaceable objects. CL says we haven't yet discussed an acceptable loss rate. Keeping all bits for perpetuity is not doable from engineering standpoint. Must get explicit about that.
If your bits don't make it to next week all discussion is mute. We're in an environment where our information is exposed to all kinds of complex threats all the time.
How do you teach the next generation how to do something we don't know what to do? Go back to original principles underlying sound curation as place to start. Foundational principles are a good place to frame analysis.
Labels: DigCCurr 2007
DigCCurr 2007: Concurrent session: Digital Curation in Practice
I attended "Science and Biomedical Data"
Speakers: Milton Corn, Don Sawyer, Tyler Waters
Milton Corn "Archiving the Phenome"
Phenome - total mass of physical and mental facts known about you. It's coordinating genome info with patient info (date of birth, hair color, cholesterol , etc.). (.g. OSHA), State laws very, Need to maintain a paper record for preservation purposes under debate. Text can substitute for non-textual information (e.g. x-ray report)
2. Well-being of the patient. Diagnosis and prescription of new illness can be influenced by past history. Implies records needed for life-time of patient. NOT a legal requirement. Hard to assemble from distributed sources, argument for personal health record or "super" repositories.
3. Well-being of family/nation. Patient's health record in genomic era of value to family, and to the entire population. Secondary use of health records of value to health services research, public health. Implies preservation "forever."
How to archive?
Same problems as for all digital archives plus:
*multiple content owners per patient
*variation in software, hardware, data formats, ontologies etc.
*privacy issues -- HUGE issues
*ownership of data not always clear
*multiple media, text, graphics, images all included
*Not seen as a problem in the U.S. by AMIA, AMA, NARA, MLA, AHIMA, DHHS, AHA but modest discussion in U.K., Belgium, India, Australia)
Corn surveying current practice, results so far:
*DHHS: no response
*Large HMO: no response
*Hospitals and offices -- no archiving policy "we plan to keep forever", privacy safeguards for daily use, definitive record is mix of paper and electronic and may not include images or graphics, how to manage old date when EHR system is changed remains a problem . N.B. one practitioner said he erases colonoscopy videos after reading to prevent second guessing later by lawyers.
Summary: curation of clinical data
*not a problem now, at least it's not recognized yet
*will become a problem as soon as size, migration costs escalate esp. with imaging
*preservation by CIO may, in fact, work for solvent enterprises (hospitals, pharmacies, etc.) i.e. the public pays
*situation for office practices uncertain
*Can health care system conglomerate all health data for an individual? Unlikely unless patient is the custodian.
Don Sawyer "Digital Curation at the National Space Science Data Center"
Overview: NSSDC requirements and digital curation, NSSDC holdings and archival services,
NSSDC requirements:
*functions as the space science permanent data/metadata repository
*provides the space science community with data stewardship guidance and supported. Data made available to the research community by various repositories should be well documented in order to support independent usability via, for example, virtual observatory access
*NSSDC as a repository making unique data/metadata available must participate in Virtual Observatory development efforts to assist in the practical evolution of these concepts
NSSDC uses OAIS concepts
Data providers:
*NASA's Space Science Active Archives typically under written agreements (MOUs)
*Space Science Space Flight Projects
Users:
NASA Space Science Archives
Space Science Projects
Individual researchers
General public
NASA headquarters
Digital holdings: acquiring data for 40+ years, currently 47 TB, reaching 270TB by 2010, 1300+ experiment from 375 US and international spacecraft, over 4400 data collections (typically each with a large numbers of files)
NSSDC Archival information services
*permanent archive: long-term curation, uses AIP implementation, data may be repackaged and/or transformed to maintain accessibility and usability
*Second archive: data also held in another archive, NSSDC holdings may be AIP form, data may be repackaged and/or reversibly transformed
*Third archive...
Administration activities
External: MOUs with various active archives, respond to NASA HQ requests, monitor progress of SAMPEX resident archive (home after project ends)
Internal: Oversee maintenance and modernization of infrastructure including systems administration (e.g. low cost Linux), manage personnel and physical space, oversee refreshing of tapes in archive every 6 yrs or less, oversee migration of legacy data from 9trk/3480 tape archive into current media
Ingest activities
*Development: develop, maintain and enhance new AIP ingest software, enhance remote submission information package and AIP creation sofware (MPGA) to support non-linux platforms, large SIPs and reliable electronic delivery of SIPS
*Operations: identify current/expected missions, collections, research and organize information, populate data management database
Archival storage
*development: develop upgrades to AIP storage manager, develop provenance management system, develop integrated document management preservation system
*operations: manage media and AIPs for 3 service levels
Data management
*maintain descriptive information database to include photo searching & support automated ingest, revise database to normalize and streamline infrastructure, design and implement XML mark-up of metadata producing systems to enhance finding aids
*participate in appropriate registries in Space sciences (e.g. heliophysics virtual observatories)
*provide general request and access support
Preservation Planning Activities
*External: continue participation/leadership in standards activities, monitor technology trends, sponsor NASA-wide workshop on archiving and metadata standards, provide curation guidance regarding documentation, database reports etc.
Key staff roles and skills
*Curation scientists: PhD in space science discipline, extensive handling and analysis experience
*Information architect
*Systems engineers
*Database administrator
*Operations manager
*Archive Head: PhD in space science discipline
Conclusions:
*Need science discipline experts with curation training (curation scientists) for interacting with data providers, data users
*Need computer professionals with curation training, working with curation scientists, for development and operation of internal systems and to interact with similar personnel at data provider sites
*Desire data providers with 'preservation understanding' to assist with ingest.
Tyler Waters " To Stand the test of time" Report on workshop of the same name
(**ed. note presenter went incredibly fast and it was quite difficult to keep up, pardon the brevity of these raw notes in advance)
Workshop findings
*The ecology of digital data reflects a distributed array of stakeholders, institutional arrangements, and repositories with a variety of policies and practices
*The scale of the challenge regarding the stewardship of digital data requires that responsibilities be distributed across multiple entities and partnerships that engage institutions, disciplines and interdisciplinary domains
*Historically universities have played a leadership role in advancement of knowledge and shouldered substantial responsibility for the long term preservation of knowledge ... an expanded role for some research and academic libraries and universities along with other partners, in digital data stewardship
*data is distributed, heterogeneous
*stewardship involves both preservation and curation and should be throughout the research life cycle.
Workshop recommendations
*NSF should facilitate the establishment of a sustainable institutional framework for long-term stewardship of data. This framework should involve multiple stakeholders by:
*supporting the research and development required to understand, model,
*supporting training and educational programs to develop a new workforce in data science both within NSF and in cooperation with other agencies, and...
*developing, supporting, and promoting education efforts to effect ...??
Also
1. Fund projects that address issues concerning ingest, archiving, and reuse of data by multiple communities
2. Foster the training and development of a new workforce in data science
3. Support the develop of usable and useful tools
4. ??
5. include data management plans in the proposal submission process
6. NSF should encourage the development of data sharing policies for programs involving community data
URL for full report "To Stand the Test of Time - Long-term stewardship of digital data sets in science and engineering"
http://www.arl.org/bm~doc/digdatarpt.pdf Question re: NSF funding models for data curation centers
*want proposals in domain science areas, usually funding for 5 years and can be renewed for another 5 years
Labels: DigCCurr 2007
DigCCurr 2007: Concurrent session: Building Capabilities for Digital Curation
I attended "Defining Capabilities" Speakers: Liz Bishoff, Nancy McGovern, Oya Rieger
Liz Bishoff: Digital Preservation Assessment: Readying Cultural Heritage Institutions for Digital Preservation
Benchmark paper 1996 "Preserving Digital Information" by ARL defined the issues and possible solutions to digital preservation. Cultural heritage includes scientific as well as arts.
The number of projects doing digital preservation is minuscule compared to the number of cultural heritage institutions that are digitizing, or having born-digital items. The publishing of the ARL white paper indicated that digitization can be preservation. More funding agencies have preservation solutions in their requirements. There are emerging state, national, and international initiatives for digital preservation. Are all these projects ready for managed digital preservation?
Bishoff surveyed institutions on their readiness. Also Anne Kenney at Cornell studied institutions participating in NEH funding 2003-5. Cornell study results avail RLG diginews. Cornell found that 90% still using CD/DVD for digital storage. That number is now about 70%. Only 50% had policies but only 30% have implemented those policies.
2005 NEDCC Survey - 66% of institutions had no one responsible for digital preservation, paralleled Kenney's findings. Also a fair number of institutions indicated that they had backed up only once (or not at all!!!)
Findings from Bishoff's survey: Issue of digital preservation is just now coming to the forefront of discussion and action. Many institutions are still at the the project stage and have not yet gotten to the ongoing program stage. Written policies and documented digital preservations practices are lacking. Preservation/conservation staff are generally NOT directly involved in many of the digital initiatives.
They also found: Few have coordinated institutional approach to their digital initiative especially in the areas of standards (imaging, metadata), quality control, access, promotion, digital preservation. A big lack of understanding of when institution has a "born digital" material. CD/DVD is the major storage media but moving to networked servers. Refreshing data on CD/DVD with lengthy periods between refreshing. Quality control of master images is inconsistent at best. Education is important before doing a digital preservation project. Ability to advocate for digital preservation is lacking at many institutions. Funding is primarily through local funds and grants.
Areas of policy which support digital content: mission and goals, collection development, emergency preparedness, exhibitions, preservation, strategic planning, public services, rights and licensing.
If the institution was out sourcing, do they follow the elements of a trusted digital repository (TRACK?). Financial viability of the company you choose is very important.
Types of recommendations: improved documentation (continuity planning, work flow processes, etc.), review digital preservation activities including refreshing schedules, quality control, etc., review system back-up procedures and implement off site storage.
So what does it mean? focus of long term preservation has been on the technology and standards, certification, etc. to build the infrastructure. To make it reality we now need to
*expand advocates for long term preservation
*expand the knowledge base of practitioners
*move from digital project to digital program
*integrate preservation into all aspects of digital life-cycle
*develop best practices
*make policy examples available
Education needs to be moved to the state and regional level. Also we need both professional AND continuing education. That needs to focus on technology and standards, policies and tactical strategies (development and implementation), work flow and documentation, business planning and all that it involves such as market research, financial analysis and planning.
Conclusion: progress is being made, need to increase awareness of importance, most institutions which are doing digitization, however, are not doing the the basic preservation activities.
Question - what is the definition of digital preservation? Response from ALCTS Preservation co-chair -- they are sending out a definition to various email lists in the next couple of weeks.
Nancy McGovern: Canary in a Coal Mine: A digital preservation response to technological change
How to deal with open-ended change? We've lacked specificity and scope about how to respond to that change. How to we detect things that might have adverse implication for digital preservation. How do you go about doing the assessment?
Outline: technology response requirements, common response, scope of interest, priorities for digital preservation, timing response to technology.
Technology response: the call for responding came in the 1996 seminal paper. The specification is most explicit in OAIS.
OAIS monitor technology: objective: track emerging technologies, information standards, computing, platforms. purpose: avoid obsolescence.
Examples of technology watch: DPC, DCC, DigiCult, LITA, PRONOM
Characteristics: range in services provided by technology watch services reflects absence of definition. Providers select topics not community, lack access to accumulated data, defined levels of service is rare (detailed synopsis? headline?).
Community formalization: Digital preservation for museums CHIN 2004 service requirements, LIFE project UCL/BL, 2006. Strategic priorities of SAA 2006-7 calls for leadership and training on how to respond to technological change.
Scope of interest: macro taxonomy
Object - file formates, media metadata
Collection - relationships, metadata
Repository - software, tools, modules
Platform - protocols, security, software, hardware
Scope: micro taxonomy
35 technology types enable OAIS
Examples: communication (ability to convey message), logs, policy enforcement
Priorities for digital preservation: Contact, interaction, exploitation, risk management, automation.
Contact: requires direct contact with digital content
Interaction: must respond to, not just be made aware of, changes in digital content
Exploitation: potential to contribute to digital preservation strategies by exploiting opportunities
Risk management: participates in the avoidance of risks to integrity, longevity, or authenticity
Automation: potential to perform more effectively
Timing response is important
Identify potential new technology, monitor, assess, respond, act to avoid obsolescence of existing technologies.
Technology responsiveness - Community objectives:
*accumulate current and historical information
*develop competencies and tools
*incorporate community developments
*build a network of contributors and users
*ensure sustainability
Question: is there a need to have some kind of peer review process/consensus building on deciding whether a technology matters or not or how to go about implementing collaborative technology watch?
Answer: Assessment is key. Organizations should be able to pick the right size thing for them which fits their requirements. People must be able to picture themselves in the results.
Digital preservation is research and development although we do it in a production environment we need to keep questioning and monitoring and assessing. William Gibson "The future is already here, it's just not evenly distributed yet."
Oya Reiger: Select for Success: Key principles in assessing Repository Models
Within a life-cycle framework, digital curation involves a series of technical, intellectual, and managerial activities in support of stewardship for digitized or born-digital information assets.
What is a repository system? A system to capture, store, index, manage, preserve, and deliver digital objects.
Factors in choosing a repository model
*development characteristics
*financial sustainability
*digital library infrastructure
*interoperability and support for standards
*institutional policies and practices
*support of archival business requirements
*content type characteristics
*preservation functionality
*usability (staff and end-user)
*search, browse, access features
See Art Libraries Society of North America "Digital Image Database Standards Checklist"
RLG/OCLC " Trustworthy repository certification"
Key principles in selecting a repository
1. Identify key stakeholders (users, programmers, subject experts, etc.). It builds awareness and trust, gathers feedback, build trust, get support, expand resources, understand risks
2. Conduct needs assessment to characterize your environment. Include documents (document type, condition, metadata attributes, selection criteria, usage restrictions, relation to other collections), users, and resources (available staff, money)
3. Explore resource requirements. Institutional repositories SPEC KIT- start up range from $8,00-$1,800,00 (mean=$182,550) and an average ongoing operating cost of $113, 500 . There are many hidden costs. No common metrics yet to determine what information points to include
4. Understanding the existing and evolving human landscape. Work culture and practices, relevant social groups, interpretive flexibility, appropriation (how technology fits into the workplace and how it supports your culture).
Quoted Judson King's report from Institution for Studies in Higher Education, 2006 "Scholarly communication: academic values and sustainable models" -- "Approaches that try to move faculty and their deeply embedded value systems directly toward new forms of archival systems are destined to fail"
Conclusions:
*flexible and scalable repositories - Choudary and Martino 2005 "At Johns Hopkins, we are promoting the idea that applications should access repositories through an abstract, repository agnostic layer, rather than through custom application to repository integrations" see Cornell, as pioneers they ended up with many different repository systems (Greenstone,
*web services/service oriented architecture models - ex. file format migration, file obsolescence service, social tagging, citation analysis, text annotation, plagiarism detection (added to arxive recently)
*repurposing - ex. Cornell making their digitized books available via Amazon print-on-demand
*new information chain see Van de Soemple DLIB article on how the information chain is expanding
Labels: DigCCurr 2007
DigCCurr 2007: Migraine kills the plenary
I missed this morning's plenary because I awoke with a throbbing right eye which quickly evolved into a killer headache. More sleep and a few ibuprofen have lessened the impact but I'm still not quite up to a long day of sessions. I shall persevere and keep my fingers crossed that I feel better after food and caffeine.
The organizers of the conference are surveying attendees on the types of skills they look for when hiring digital curators. It made me think about the myriad areas in which a repository manager must be conversant. First and foremost, I think repository rats, need people skills. So often our problems are not technical but political. Developing teams, collaborating across multiple institutions, convincing contributors to contribute to collections, and raising money all require schmoozing.
Second, I think repository rats need an appreciation of archival theory. I didn't glean anything about archives/archiving from my MLIS. I ended up going to UCLA for a CAS to get that specialization (*ed. note - I'm one credit shy of completing that certificate. I won't speak of why I quit UCLA, but if you truly want to know, I'll tell you in-person). While working on the CAS, I found the learning I did about authenticity and evidence in digital record keeping to be incredibly useful for explaining to people why archival control is necessary for some types of repositories and/or digital collections. Third, comes all of the technical skills -- which are quite numerous. Running servers and databases, metadata creation and interoperability, creating and testing websites only skim the surface. Finally, comes financial and business acumen. Most repository projects have been financed with project funding. As repositories evolve from projects to production this type of funding is not sustainable. Just ask the folks at NDIIPP what happened when Congress changed. An understanding of the business case for your repository is crucial not only for pitching funding agencies but also for evaluating the success of your repository. The business case provides the measures of assessment.
I'm off in search of a decent coffee. I'm staying at the Marriott near the Friday Conference Center. It feels like I'm in the middle of nowhere. You can't even cross the street due to a lack of sidewalks and pedestrian signals. Nothing is really visible except a golf course and a housing development. Unfortunately, hotel room coffee is a bit weak for my espresso habit. There is a business park across the street and I've heard rumors of a bakery. Bakery = potential latte. Wish me and my pounding head luck.
Labels: DigCCurr 2007
***ed. note -- it's 4ish in the afternoon, my brain is dead tired, and the speakers in this session are incredibly quiet and mumble-y. these notes may be more raw than the others. ***
I attended "Designing & Implementing Repositories Across Institutional Boundaries"
Speakers: Mike Smorul, Bill Underwood, Richard Marciano
PAWN project
Michael Smorul
http://umiacs.umd.edu/research/adapt or Google ADAPT UMIACS
Problems facing ingestion
*reliable data transfer
*each producer/archive interaction is unique
*how the archive deals with each collection is unique as well
Distributed ingestion with PAWN
*multiple producing sites with different requirements
*separation of administrative responsibility
Components - showed network architecture diagram
Package work flow overview
1. create producer-archive agreement
2. client package template
3. create package based on template
4. once approved, packages can be archived
5. rejected packages can be held until rectified or deleted for resubmission
Custom roles
*actions in PAWN can be grouped together to create roles (modify items in a package, create users, etc.)
*default roles
**producer
**records manager
**archive manager
**global administrator
PAWN utilizes SRB from SDSC
Case study 15,000 CD-ROMs of LANDSAT data
Case study from SLAC @ Stanford, created specialized roles (records creator, records liason officer, records manager)
William Underwood. PERPOS (Presidential Electronic Records Pilot System)
*initial objective, R&D project, develop tools to support archivists in gaining intellectual and physical control of PC records from the administration of George H.W. Bush
*contents of 500+ hard drives
*included operating system and software applications as well as user-created files
*DOS and Windows 3.1
PERPOS
*developed a prototype system to support accession, arrangement, preservation, review and description of e-record series
*evolutionary prototyping
*system has been pilot tested by archivists at the Bush Presidential Library
*several record series have been systematically processed
*FOIA processing currently being Pilot tested
***found viruses in legacy data *** important to use virus checkers
Summary of research results and benefits
*supports both systematic and FOIA processing of presidential e-records
*provides an environment for experimental application of advanced information technologies to archival process
*document type identifier speeds up processing
*automatic description of items, file units, and record services enables earlier intellectual control of e-records.
*prototype access restriction checker
*knowledge acquisition reduces work required to apply access restriction checker to records of subsequent administrations
Richard Marciano, SDSC/UCSD
The perspectives of digital curators on building distributed repositories
Collaboration between digital curators and IT folks looking at how to make cost effective distributed repositories.
PAT = persistent archives testbed
2 yr NHPRC project, extended for 1 year
Project summary:
*participants were digital curators from libraries, archives,, historical socieities, scientific data environments, museums and IT researchers and staff
*main goal: design a distributed repository for electronic records management, demonstrate the management of various types of records with a common software infrastructure
*approach: each site choose an archival collection, set up access control and update permissions for their preservation environment independently of the other participants.
Presentation goals:
*comment: David Giaretta says "no repository is an island" ... PAT fits the archipelago model
*examine: lessons learned and skills needed by digital curators to automate archival functions (appraisal, accessioning, arrangement, description, preservation, and access of records), benefits achieved by using common infrastructure
PAT Community Grid
Local storage resources
||||
SDSC Archive
||||
MCAT Metadata catalog (Oracle), Shared preservation environment, Storage resource broker (SRB)
Unique contributions of digital curators to the infrastructure:
*Windows based SRB clients/servers
*Development of a Perl for Windows client library
*Bulk operations were developed, tested, and refined (registration, accessioning, metadata extraction from the records, metadata loading, validation of data movement into/out of the system/within the system)
*End-to-end work flows were developed (accessioning, replication)
*SRB bugs revealed: better reliability
*MCAT ported to mySQL (Oracle, DB2, Sybase, Informix)
*Development of a wiki for documentation
*Registration of filenames with unusual characters discovered and fixed
*Suggestions on ways to simplify governance issues tied to particular types of data management:
**need to express such policies as rules to be applied to the data management system.
**development of the next generation of data grid technology: iRODS (integrated Rule-Oriented Data System)
**Each preservation process is expressed as a set of micro-services (operations that can be performed using a remote storage system)
What Digital Curators Liked
*leverage common software and hardware
*use commodity storage hardware
*lower the cost of participation
*reduce the level of expertise required at each site
*focus on management of the archival collections and outsource the details of the archival repository
*automate the manipulation of collections to minimize the level of effort
Conclusions
*PAT suggests that sustainability is probably beyond the capability of most archival repositories (costs of tracking new types of technology, expertise to manage, costs of storage systems and databases)
*outsourcing of the management of records is feasible through use of data grid technology
*preservation environments can be assembled by creating regional community archival partnerships with university data centers (yes, there are still many political barriers)
*independence can be maintained
*service agreements for storage and preservation of archival e-records are needed
Labels: DigCCurr 2007
DigCCurr 2007: Afternoon plenary: What is digital curation?
Speakers: Peter Bunemen from U.K. Digital Curation Center & William Lefurgy from NDIIPP.
Peter Buneman: Databases & Digital Curation
Databases in science and scholarship:
*Nearly all branches of science depend on database technology for storage and retrieval of data.
*This has changed the scientific method (Mike Lesk)
A curated database is:
*A reference work
*Value lies in the organization and annotation of data
*Commonly constructed by copying parts of other (curated) databases
*Replacing traditional dictionaries, gazetteers, encyclopedias
*Rapidly increasing in scientific research (>800 in molecular biology)
*Catalogs and archival metadata are usually curated.
*Constantly checked/verified. Data quality and timeliness are important.
*Often a group efforts. Produced by a dedicated organization or as a collaboration.
*Labor intensive
*Increasingly seen as "publications" by scientists.
Compare traditional libraries vs. databases
Storage in libraries is: redundant, persistent, distributed, readable by people, clear standards for citation, historical record, well understood legal. Databases? Not.
Example of CIA world fact book and plotting the population of Lichtenstein in 1990. Are you better online or in library?
Research on Provenance:
Very difficult and long term problem
Database preservation:
How do you preserve something that evolves (both in content and structure)
Snapshots if frequent are time consuming, if not, you miss something.
Snapshots are immediate, and longitudinal/temporal queries are easy.
How do you cite something in a database? Many scientific databases ask you to cite them, but they
*don't tell you how, or
*they tell you to give the URL, or
*they tell you to cite a paper about the database.
What is a citation? Location and descriptive information .
Getting a canonical version of the database, borrowed data set from nearby scientist. The first task was to convert the db into a hierarchical structure. Preserve all versions of the data, generate stage web pages (less software, more efficient).
Able to clean-up the data.
Created a book from the database.
What is digital curation? A unified approach to
Preserving - the process of preserving digital data for future use, once it has been created
Creating/maintaining - missed last part of this second bullet...
Impertinent thoughts on a DC curriculum. Should they involve databases?
Yes, but they need not be intrusive or hugely time consuming. Teach data formats first and use that as an introduction. No need to teach internals or optimization. Teach design through data import/export technology and through schema mappings. Provide a short course in semistructured data and ontologies.
Do internships with data publishers (they need help!).
Other things to think about
*legal aspects (copying)
*security and confidentiality in databases, timed embargoes
*economics of long-term database maintenance, open access
*Combine with or borrow from, other curricula (ex. NSF data integration)
William Lefurgy from NDIIPP. Digital Curation and Sustainability
Will focus on economic sustainability because he thinks more curators need to pay more attention to that aspect of digital curation.
Aspects of Sustainability
*to keep in existence, to maintain or prolong
*meeting the needs of the present without compromising the ability of future generations to meet their own needs.
*resources, broadly defined, for keeping digital materials available and accessible over time (technology, staff, cash)
*concept application at different levels (national, consortial, local, etc.).
Wants to focus on the third bullet - resources. Brian Lavoie has pointed out that we need an economic infrastructure as well as a technological infrastructure. Example of LC's assumption that the money they had available would remain available. Not so. Their funding was revoked when Congress changed.
Sustainability closely linked with other key issues: collecting content, developing technology, and outlining public policy.
Preparing for substantial discussion of sustainability in the final report of NDIIPP.
Many projects are recognizing the issue. ARL "To stand the test of time" etc.
The need is clear
*expanding digital stewardship requirements
*infrastructure, capabilities still largely geared to analog
*digital funding is largely project based
*broad range of work necessary to effectively manage content across life cycle
*rapid change means regular migration of data, systems
Open questions
*how to preserve, make available
*how to transform existing stewardship organizations and practices
*what are the costs?
*who pays?
*why it matters
See
http://www.arl.org/bm-doc/econ_models.ppt for a "mind-map" of the economic issues
Basic action items
*make content value explicit
*probe business case elements
*explore business models
Content value/Why should I care?
*values of digital materials are typically intangible, as is the material itself
*funders need clear, concrete evidence for importance of digital content
*must clarify the demand side. what values accrue as a result of preservation? what are the deficits if the content goes away?
Content value clearly explained
*need frameworks to consider dimensions of value for various digital materials, e.g. value for institutional users, value for institutional reputation, prestige, value for posterity
see British Library "contingent valuation"
Business case: risks, fixes, costs
*compelling story about risk
*incentives/barriers
*plan for addressing risk
*some estimate of cost
*value added by the curatorial practices
The how and how much
*needed level of service, e.g. bag and tag, transformation, disaggregation, rich metadata
*prospective work flow
*credible cost estimates (see City College of London??)
Business models
*how to put preservation into operation
*provide for resources on an ongoing basis
*leverage incentives, remove barriers
*emergent models, but experimental at this point; modeling and testing appropriate
*collaboration is key
see LOCKSS/CLOCKSS and Portico for emerging models
Working within a network
*no one institution, community or sector can develop the best solution, collaboration is essential
*networks build shared infrastructure, reduce costs
*repositories will vary but all can draw from shared suite of tools, services, best practices.
Self-interest and the public good
*institutions work together in pursuit of individual net positive value
*key to sustainability: members get value from networks, but benefits accrue to all from exchange of knowledge
Summary of collective needs
*work to illuminate content value for decision makers
*make the case for specific curatorial actions with supporting cost data
*implement and test models
Labels: DigCCurr 2007
DigCCurr 2007: Identifying Digital Curation Services and Functional Requirements
I attended "Data Sets, Metadata, and Management"
Speakers: Jane Greenberg, Gail Steinhart, James Tuttle
Jane Greenberg. DRIADE project. Overview: introduction, consensus building, functional requirements,
DRIADE = Digital repository of Information and Data for Evolution.
Big science initiatives already with webbed interfaces to the data and lots of services built on top. The internet also impacts "small science" such as Knowledge Network for Biocomplexity (KNB), Marine Metadata Initiative (MMI).
Evolutionary biology has some requirements for data deposit (GenBank, TreeBase). Some publications also require supplementary data such as Molecular Biology and Evolution. No single one-stop shop for evolutionary biologist. DRIADE was developed in response to that need.
Goals of DRIADE
* develop one-stop shop for scientific data objects supporting published research
* support data acquisition, preservation, resource discovery, data sharing, and data reuse of heterogeneous digital datasets.
* balance a need for low barriers, with a higher-level data synthesis.
Consensus building
Had a stakeholders workshop where they invited reps from the major journals, organizations/societies, and scientists.
Outcomes: found there was unanimous support for the project, participants felt it was necessary to advance science in this field. It was agreed upon that they would have a central data repository. Journal representatives felt they had a moral obligation and moral authority to initiate this and get people on board. A data repository will help to verify authenticity of data and provide a bit of "policing" in terms of data security. It can also help with interoperability with Genbank and Treebase, working on "handshake activity."
Challenges: scope, representation, quality control, security, cultural change, sustainability.
Scope: should it be restricted to data supporting publications? Should other data be included? One participant advocated for doctoral theses. Who will contribute? Who will create the metadata? At what stage in the publication life cycle should data be contributed? How to coordinate with the journals? Rights are a very important issue. Rights of authors to their datasets. Do journals have rights to publish data collected with public funds? General consensus from the workshop leaning towards a model like Creative Commons. At what stage in reuse to re-users credit the creator of the data? Representation has many associated challenges. Standardization and strict adherence to standards vs. keywords. The idea of a combination of both is a practical way to approach it. Looking to generate as much metadata as possible automatically. Some experiments in drawing keywords from the published article. Quality control of the data input, how to maintain it? People at the workshop were very against the idea of a data curator initially, but by the end of the workshop they were thinking a data curator (or more than one) is necessary for quality control. Security is an issue because evolutionary biology is a controversial subject (think creationists). Need to protect. Sustainability? How do you encourage scientists to deposit data? DRIADE is fortunate in that there is buy-in from the journals. Some discussion of mandatory deposit policies leading to passive-aggressive behavior (like mislabeling tables, omitting data, etc.). There will be a need to "flag" incorrect items . Finally there is an issue with the ongoing funding model. Should it be subscription based? Grant funded?
Priorities and next steps: preservation, access, synthesis (Maslow's hierarchy of needs), cultural change via editorials, publicizing at conferences, requirements. Right now they consider the preservation itself to be the highest priority because there is a lot of data being lost.
Functional requirements: compared other small science projects.
Support: computer-aided metadata generation, specialized modules linking data submission to work flow.
Functional model based on OAIS.
Metadata framework.
Level 1 - initial repository implementation - preservation, access, and basic usage of data.
Level 2 - ??
Level 3 - ??
Application profiles: data elements drawn from one or more name space schemas combined together by implementors and optimized for a particular local application. Single existing schemes are often not sufficient.
Level 1+ bibliographic citation for the journal article, data object metadata (incl. PREMIS)
Level 3 brainstorming: thinking of web 2.0 technologies, personalization, macros, tagging etc.
Conclusions: Team work required. stakeholders meeting was critical. They are benefiting from prior work. Next steps will be to survey and do use-case and life-cycleycle studies. Metadata application profile experiment with evolutionary biologists actually using it.
Implications for education: student participate in the project, service learning is invaluable. Curriculum needs to address the whole picture including digital resource life-cycle, metadata life cycle, IA components, human factors. Language barriers and communication skills ...different vocabularies in different domains. Conferences like these.
Gail Steinhart described her work with a research group at Cornell.
Overview: motivation, strategy, and what we have learned so far.
Motivation: don't need to spend too much time with this audience explaining why digital curation is necessary. There are digital preservation issues and there is information entropy (loss of information about data over time which lessens its usefulness). Mentioned NSF Cyber-infrastructure Vision for the 21st century. Digital curators need to stay plugged into those. She attended the funders perspective concurrent session this morning and it is more and more evident that funders are going to require data curation.
Curation definition from DCC . She does not think that academic libraries are going to do all aspects of curation as per that definition (DCC definition includes processes needed for good data creation and management, and the capacity to add value to data to generate new sources of information and knowledge). Libraries should try to fill in gaps in infrastructure and support but this is not without costs. Engaging in this area will require libraries to develop new partnerships.
At Cornell project the departments and units involved are multiple (animal science, biological and environmental engineering, ecology, horticulture, Mann Library, etc.)
Types of data collected: observational data, experimental data, simulation models. The group also has 30 years of historical observational data that they would like to preserve and share.
They want to share (a) between themselves to create the simulation models (b) public good to share with policy makers and the general public (c) PI's have committed to being a model for shared collaboration. They appreciate the utility of their data for others.
Strategy:
Didn't make sense to develop infrastructure for this one small group. Needed local support of data, metadata.
Project participants share data and metadata via a staging repository provided by the library. When they are ready to make it public they have some choices: Cornell's institutional repository vs. discipline repositories.
Staging repository
*use discipline specific metadata standards and tools (Ecological Metadata Language [EML]), Morpho
*Provide a place to share pre-publication data within the group.
*Provide training and recommendations on metadata
EML
Morpho is a metadata editor that makes it pretty easy for P.I.'s to create metadata. Helps non-librarians to create a metadata record. Metacat?
"Publication" of data options
*DSpace/Cornell institutional repository... but this doesn't add to the science infrastructure so they encourage
*submit metadata (and possibly data) to discipline specific repository (KNB, other?)
Test case: Historical
Observational data from past 30 years
Original format Quattro Pro workbooks with multiple pages.
Converted to Excel for further review and clean-up
Various errors (apparent duplicate records, misaligned columns, out of range values)
Missing or ambiguous information (methods, units, geographic locations)
Extensible model??
Lots of work between the data owner and her. It raises the question of if this level of service can be provided by libraries.
Summary: curation skills
*traditional library and archiving skills (metadata, preservation, interoperability, appraisal, and selection)
*understanding content area ... need to understand how researchers in a discipline really do research.
*awareness of standards and tools related to data.
*productive partnerships with researchers.
Question: how do contributors work with controlled vocabulary? Answer: haven't had to face it yet. It's too new for the researchers. Had to begin with teaching them what metadata is, what EML is etc. EML doesn't require controlled vocab but can accommodate it. She suspects that the reality is that they're going to do what they are going to do and we have to accept it because it is better than not getting the data sets at all.
James Tuttle, NCSU, Curation and preservation of complex data NC Geospatial Data Archiving Project
In his experience geospatial researchers value the newest data and ignore older data when newer data becomes available.
Geospatial data types are complicated. Vector data, for example. It's highly difficult to preserve. Aerial imagery is a little simpler. Spatial databases are incredibly difficult to preserve. Can do one-many export of images and vector data but trying to manage the relationships between them is difficult.
Repository Pre-ingenst work flow
Data receipt - format processing - metadata processing - ingest processes. The process is as much social as technical.
Data receipt: includes acquisition, reorganization, validation, threat analysis, inventory. Try to automate where ever possible. They have no demands on contributors, so files come to them "as is."
Using JHOVE tool to harvest, even though not designed for geospatial data.
Format processing: geospatial data traditionally has not migrated well between formats. Processing includes conversion, compound formats. Typically data is gathered without metadata. Some metadata can be generated from the GIS software. When it exists it typically requires remediation to be used in basic retrieval.
Ingest Processes: metadata conversion, SIP creation.
Extended Curation: feedback loop with contributors, constant improvement on metadata being received. Also work with industry and standards organization (geospatial consortium)
http://www.lib.ncsu.edu/ncgdap/Question: Have you met with resistance to depositing from contributors?
Answer: Some agencies in N.C. had strict data sharing requirements which impeded our ability to use the data. A lot of local agencies have liability concerns with older data which may be obsolete, superseded, etc. Need to have clear disclaimers. Have to work with providers to reassure them.
General questions:
Comment: general theme of complex/compound objects in the papers, but the presentations bring home the need for people skills. How to talk to people who don't use our language re: metadata? How do you develop that skill set in students so they can speak to researchers about their data sets? All three of the presentations mentioned an education component where they had to explain to researchers about metadata and its function. How can curators be good educators?
Question: In terms of preparing digital curators what level of expertise do we expect them to have in different content areas?
Jane Greenberg: It's a very interesting question. They have discipline post-doc working on their project. She recalls a time as cataloger where colleagues had graduate degrees.
Audience member: when it's a general description of blob object then it's not as important to have the discipline knowledge, but it's more important to know what's actually in the data.
Jane Greenberg - important to teach curators to know when they need a domain expert so they don't get themselves into situations where they aren't qualified.
Labels: DigCCurr 2007
DigCCurr 2007: What do digital curators need to know?
Each concurrent session time has a different theme. The theme of session one is "What do digital curators do and what do they need to know? I went to the "Research Perspectives" session. I was a bit disappointed because I was expecting to hear about hot research questions but the discussion was interesting nonetheless.
Speakers: Hans Hoffmann, Phil Eppard, David Giaretta.
The speakers described their associated projects. Hoffman is involved with the Planets project, Eppard described the work of InterPARES 1 and 2. Giaretta, manager of the CASPAR project, spoke more of what digital curators need to know.
Most of the detail that Hoffmann and Eppard discussed I'm familiar with from reading about the projects over the years.
Giaretta was incredibly amusing. What follows are my "raw dump" notes. I'll have to summarize and comment at some point, but fwiw, here are my notes.
Concurrent Session - Funder's perspective
Hans Hoffman. Described the PLANETS projects. An European initiative.
Components: planning services, characterization services.
Interdependencies between all of the components.
Preservation planning. Come up with a process to identify what should be done with the digital object's for which you are responsible. Criteria for preservation based upon organizational policies, collection profile, provenance of digital objects (authenticity).
What are the best available preservation action given the criteria? Develop a plan. Ideal is to make it an automated process. Requirements should be proactive rather than reactive.
Preservation policy, content profile, usage profile, and actions inform the plan.
Plan will be executed on the content of your repository.
Characterization of objects can take two approaches. Intellectual approach, building objectives trees based upon utility analysis and extraction of intrinsic file (format) information.
They are trying to develop a description language to match the two approaches.
TNA PRONOM file-format identification used to define a characteristic language, define an extraction language, define a pluggable interpreter.
Preservation actions: two approaches: transform content/objects and transform environments (migration, emulation). Content objects: wrap third party transformation tools, ...preserve relational databases.
Testbed environment will help them determine what works. Developing a corpora of objects. Performing experiments on it.
The testbed consists of: data storage, hardware, PLANETS software, testbed software...
Interoperability framework.
What do digital curators need to know? Preservation planning, how to identify what criteria should inform decisions, how to apply that criteria to digital objects, how to test and evaluate available preservation strategies with respect to a given type of objects. How to do it in a effective and efficient way.
Training programs based on Planets results. Coming up with a modular approach to bring together course materials already in existence and building on the work of ERPAnet?
http://www. planets-project.euQuestions: More about the criteria for judging which approach is best? You need to know what your collection is about, what are the characteristics of the digital objects? When you use migration, for instance, will it be the best solution? Emulation? The authenticity requirements for the document or record are needed and they are based on the business requirements of the collection and the context of creation.
Could you envision a situation where the requirements would come from the context of use? Hoffmann: Yes, that's why we're doing user studies. How are they using the digital objects? What will it tell us about the need to preserve digital objects.
If you have multiple users going after the objects in different ways would there be different criteria for different needs? Hans Hoffmann -- you have to deal with the object how you receive it. The object and how you use it are two different things. Add services to the repository based on the use. You try and evaluate and revise.
Phil Eppard on InterPARES 2 Project.
Many InterPARES researchers here in the room. Investigating the complex issues in the preservation of digital materials. InterPARES has a very long history. PE provided an overview of history and scope and some of the InterPARES products.
Started at UBC with authenticity of records project 1994-97, concerned with creation and maintenance of records in their active phase. Product of that research was DoD electronic records standard.
1999-2001 InterPARES 1. 13 countries, 4 continents, 60 researchers. Included practioners and experts in c.s., law, and policy studies. Focus was on records as defined by archival science. Theoretical principles based on archival theory and diplomatics (the study of creating and identifying authentic records).
Used case studies. Through the case studies used a template for analysis developed via diplomatics. Key product was two sets of activity models for the functions of selection and preservation functions and a framework for assessing and maintaining authenticity. Benchmark requirements supporting the presumption of authenticity and baseline requirements supporting the production of authentic copies of electronic records.
Not preserving the records themselves so much but the ability to reproduce the records in an authentic form.
Benchmark requirements: maintain expression of record attributes relating to identity and integrity, control access privileges, protective procedures to prevent loss of corruption of records, procedures to prevent media degradation, procedures for maintaining documentation.
A preserver looking to take over a set of electronic records would test them against the benchmark and this may influence an appraisal decision.
Baseline requirements: maintain controls over records transfer, maintenance and reproduction, retain documentation of reproduction process and its effects, capture ...?
2002-2006 InterPARES 2
Expanded interdisciplinary team adding researchers from various sectors of the arts and sciences to the team of archivists, preservationists, etc.
Focused on newer types of electronic records: dynamic, interactive, experiential.
Develop understanding of their creation, maintenance, and preservation.
Research domains: records creation & maintenance, authenticity, accuracy and reliability & methods of appraisal and preservation.
Focus areas: arts activities, scientific research activities, and e-government.
Cross-domain research groups: description (metadata), modeling, policy, and terminology.
Created a dictionary of terminology available to the public as a database.
Key products: manage chain of preservation model (preserver centered), business driven record keeping model (records creators, business centered), principles for records creators and preservers (for policy development rather than principles of preservation), guidelines for digital records preservation (operationalizing process for practitioners), guidelines for individuals, Metadata and Archival Description Registry and Analysis System (MADRAS), terminology database.
MADRAS is a key product. A web-based tool for developing registering and evaluation metadata schemas and archival description standards. It allows people to compare schemas as to how well they meet international standards and guidelines (such as the benchmark requirements).
InterPARES and Digital Curation: training new researchers and educators, case study methodology and examples, integrating preservation with other processes, metadata schema and analysis, policy recommendations.
Question: will you offer counseling to universities who want to use your methodologies?
PE: InterPARES 3 selected effort to work directly with repositories to test and implement some of the products of previous InterPARES work.
David Giaretta, CASPAR Project manager
CASPAR = Cultural, Artistic, and Scientific knowledge for Preservation, Access, and Retrieval.
What digital curators do: Struggle with: funders (reluctant to provide long-term commitment; cost control, cost estimates), Information provides (unwilling to provide what is needed, ways to capture required info), Users (increasingly demanding).
CASPAR a large consortium
http://www.casparpreserves.eu What do digital curators need to know? They do preservation and publication/access but do not confuse them.
Needs of access: responsive, sophisticated search techniques, users often familiar with the material.
Needs of preservation: ensure the information trapped in the bits is authentic and understandable -- to the designated community (this also implies making it fit for the purpose, adding the info).
Disincentives for preservation: Cost, Time.
Can sell preservation as benefiting access. Cyber-infrastructure allow users to find and try to use data from many sources. Some of these will be familiar but most will be unfamiliar. How can one be sure that the unfamiliar data is used correctly?
Need understanding: garbage in, garbage out.
Digital preservation is terribly easy to do.... as long as you can provide money forever. Easy to test claims about tools...as long as you live a long time.
Know what is being preserved: the great data/document divide. Need to preserve information & knowledge -- not just "the bits." Documents, videos are rendered -- simple? Data must be processed in new ways ... this is harder.
Information is the important thing. What information? documents, data. Original bits? Look and feel? Behavior? Performance? Explicit/Implicit/Tacit.
Things change/disappear -- how can we ensure that the information trapped in the "bits" remains understandable despite all these changes? Example of Google changing a style sheet and messing up the RSS. The network links to related information may be important.
Time is short. Neither you or your institution will last forever. The chain of preservation is only as strong as its weakest link. Need to be prepared to hand over responsibility for the preservation.
No repository is an island. Your organization can not do everything. Must tap into other resources -- how can we find them and evaluate those resources.
We can not foretell the future. Need to manage knowledge to keep archives alive thorough time. Preservation is a process not a one time event. Preservation is expensive.
OAIS. Know more than the functional model diagram. The information model is key. With data especially you need to know the semantics (context).
Authenticity - evidence, evidence, evidence.
Support infrastructure: registries of representation information, representation information gap manager, orchestration manager, toolkits (representation information; preservation description information).
CASPAR aims to produce tools and techniques to support digital preservation and make it easier to share the cost. Must be relatively easy to use, must have a low "buy-in" in terms of effort required for adoption, must avoid requiring wholesale change of everyone else's systems, must be decentralized and reproducible so that it can live.
How can you tell you is selling preservation snake oil?
How to decide? Validation: demonstrate theoretical basis. Accelerated lifetime tests (changes in hardware, environment, and changes in designated community). Demonstrate increased trustworthiness, measured using Certification process as/when available.
http://wiki.digitalrepositoryauditandcertification.org (NARA work to produce ISO standard development)
Question:
One problem with OAIS is defining the designated community. What do you do when your archive, under law, has to serve everybody? Answer: State assumptions of what your community should already know in order to use .
Anne Gilliland asked what type of skills they expect people taking these positions to have? Eppard: Management skills and people skills. Gilliland: it goes back to developing the curriculum.
Marchionni (from his notes)- people need to know about the different models of preservation, need to know about their communities and how to monitor changes within it, know about appraisal as a continuous process rather than a discrete event, know about decision making process and they need to know about fund raising
Hoffmann - It's context related. I work in archives. Libraries may require a different understanding. Tools are applied differently in different contexts.
Labels: DigCCurr 2007
DigCCurr 2007: Opening plenary
Helen Tibbo informed us of the proper pronunciation of the conference name. It's DIGH-seek-er. That's a relief. Now I won't embarrass myself because I had no idea how the cool kids were saying it. Attendance here has exceeded all expectations. There are 280 attendees and the organizers were only expecting 100 or so initially. The plenary room is packed and we have an overflow room. There is a conference wiki where the conference can be live blogged and chatted.
www.ibiblio.org/jewel/digccurr2007/pmwiki/Labels: DigCCurr 2007
I need a new motherboard for my laptop. I use a Dell Latitude D610 and its...a Dell...
I think universities get good contracts for Dells. I've used one for the past ten years and I've never been particularly impressed by the performance.
I've had this baby for 2 or 3 years now. It has never docked properly into it's desktop station. The screen configuration settings are supposed to switch between the desktop station monitor and the screen on the laptop. Never.worked.
We tried many things and even called Dell, but for the past two years I've been manually switching the configuration settings every time I docked. Annoying, but live-able. Now the laptop won't dock at all. It won't recognize the power supply when it's input through the docking station.
Dell says it's cause I need a new motherboard.
Bad timing.
I leave for N.C. and DigCCurr tomorrow and I was intending to blog the sessions I attend. I have to leave the laptop at work so the Dell guy can arrive "either today or tomorrow" (shall I hold my breath?). I will probably take M's PowerBook so I won't be computer-less.
I am decidedly NOT a Mac person. I know, I know. All the cool kids love their Macs. I'm one of those weird people who have a hard time switching back and forth. I have problems telling my left from my right so it does mess me up to have the windows buttons on the opposite side.
I will still be blogging, it just may be slower than I intended.
Labels: DigCCurr 2007
I'm thrilled to announce that I have accepted the Metadata Services Manager position at the California Institute of Technology.
There are many reasons to be excited. First, Caltech Library is innovative. Second, I will be reporting to Eric Van de Velde. I greatly respect Eric and feel we'll be able to successfully tackle the challenges facing library systems and technical services.
Third, Caltech was one of the first libraries in the country to create repositories and they have a very successful and active repository program. I will not be directly involved with the repositories, at least initially. My job will be to reposition the cataloging department into a Metadata Services department. It's an open question as to how a Metadata Services department evolves and develops to best fulfill the Library's mission. Metadata Services are integral to repositories so I can forsee some involvement in the future.
Labels: MPOW
As I am reading the new draft ch.3 of RDA, all I can think is, "how the $*%& am I going to train people how to use this thing?"
And I love metadata. Picture how it would read to somebody who doesn't thrill to the notion of cataloging. Picture how reading it will feel to a new hire in a formerly-known-as-cataloging department or a library-school student. I have to confess, I skimmed the AACR2 when I was in Gloria Leckie's kick-ass cataloging class back in my lib-school days. It really is meant to be a reference book digested in wee pieces. I'm not suggesting that newbies read it wholesale like the current reviewers are doing with RDA. Yet one needs a mental model of what the whole "book" is about in order to understand how to use it. At least for me. I'm a visual thinker.
If any a text required a visualization, the
AACR2 AACR3 RDA does. It's difficult for me to digest the many and varied connections between RDA and other standards. I'm constantly flipping back and forth between FRBR, ISBD, FRAR, FRAD, etc. I'm glad I can print them out at work and I don't have to spend for the printer ink on my own dime. And don't even get me started about carrying them to ALA for the CC:DA meetings.
I do have thoughts on what I've read of the rev.Ch 3 so far. Oh yes indeedy do. I need to clean them up and clarify a few things for myself before I comment publicly. Mostly I want to get caught up with NGC4LB and RDA-L and make sure I add value to the discourse
Labels: metadata, RDA
An
early release of Sophie is available. Shout outs to Karen at Free Range Librarian for bringing it to my awareness. Now I have to take action on
my book rant. I shall post an invitation for IR managers to play with Sophie along with me once I've got it installed and running and networked somewhere.
I haven't been reading feeds for the past few days (life trumps blogging..LTB). I've probably got a dozen announcements in my aggregator but hey, I read Karen first. If one has to prioritize feeds, Karen's is the creme de la creme.
I will attempt installation on both Windows and Linux by this time next week (If I blog it, may I hold myself to it. Beats reading RDA...)
I still haven't played with Archivists Toolkit either. I've got many good reasons for this lack of free time for library tech playing but I can't yet divulge. Rest assured it's all interesting and good.
Labels: book, sophie
b.o.o.k. & RDA
I've ranted about
the notion of a book on Institutional Repositories. Since writing that rant, I've had some publishers contact me to elucidate the advantages of using a professional publisher, namely: a close read and suggestions for revisions, publicity, and experience with distribution.
Of course I'm not dissing publishers and editors. I recognize the value they bring to the publication process. The point of the rant is my opinion that literate people will need to radically reconceptualize our collective notion of the book in order to make full use of books of the future. For librarians, this should go hand in hand with our use of FRBR and RDA.
I've been procrastinating about reading the recent release of its draft chapter number three.
Even though I'm a trained cataloger, I still struggle with catalogerese. And it's not the most scintillating of reads after a long days work. I'm purposefully avoiding the RDA discussion list and the NextGenCatalog space, just so that I can form my own opinions while I read it.
I'm also beginning task force work for CC:DA on internal and external communication. It should be interesting in this time of flux to take another look at that.
For what its worth, I'm firmly in the Coyle/Hillman/Weiss train of thought when it comes to all things RDA. They state the issues far more eloquently than I could. Once I've got the draft chapter under my belt, I'll write out my thoughts.
Labels: book, metadata, RDA
I extend a hearty congratulations to my former colleagues at
UCSB's Map and Imaging Library on being named in the top 10 Models of Technology Innovation according to a survey done by the ACRLog bloggers.
It was an honor and a privilege working with you on the ADEPT educational adaptation of the ADL. Larry Carver, visionary, Mary Laarsgard, map cataloging guru, Dave Valentine, programmer, Linda Hill geo-spatial indexing specialist, Greg Janee programmer, and of course Terry Smith,Jim Frew,Chris Borgman, and all of the research PI's. I'm sure I'm forgeting others. All of the staff of ADL are very deserving of this recognition.
If you haven't had a chance to play with the
Alexandra Digital Library project, I highly encourage you to take a look. It's beyond super keen-o.
http://diva.sfsu.edu/help/aboutI suppose it's natural that I like something named diva. Deja vu or something.
From the about page:
DIVA is the Digital Information Virtual Archive, a web-based file management solution for use by faculty and researchers in higher education. DIVA provides always-on storage, organization, recovery, version-tracking, and sharing/dissemination capabilities for campuses across the CSU.
Way cool. It really warms the cockles of my heart because I used to work directly on stuff like this.
Labels: DIVA, MPOW