The following proposed enhancements follow the numbering scheme outlined in "Merritt, EZID, and the Web Archiving Service Enhancements and Development Activities" (Rev. 0.1 – 2012-02-02)
3.1. Exposing Content
The following items are focused on enhancing the discovery of relevant curated data by consumers. Several of the development activities defined here and in other sections are focused on expanding accessibility to content in Merritt, moving it from a “dark” to a “brighter” archive”.
Exposure of content for search engine indexing
Currently, only supplied Dublin Kernel (DK) metadata fields are visible through the Merritt UI. (Arbitrary metadata may be supplied as a component of a submitted data set. It is stored and is retrievable, although not currently visible.) Merritt should deploy additional schema parsers to recognize and surface metadata elements found as components of a submitted content. Merritt users will be able to submit metadata in various formats, e.g. MODS. Merritt will be able to process these metadata and allow for search and display of metadata in their native formats.
Exposure of content for search engine indexing
Research in search engine optimization (SEO) strongly suggests that an ever increasing number of researchers find primary data through the major internet search engines. (Google, Yahoo!, and Bing collectively account for over 95% of search activity.) Merritt and EZID should expose designated data and metadata for harvesting by these search engines. This will require the generation and registration of appropriate sitemaps. * *Curators will be able to indicate if they wish their metadata to be crawled by search engines.
Mediated communication between consumers and providers
The simplest way to address this need in the short term is to support the visibility of provider email addresses as part of dataset metadata to facilitate out-of-band communication. Supporting a more robust threaded discussion capability would require a much more substantial period of investigation and development.
Provide a way for providers to brand deposited content at the collection level, presumably in the form of supplied header/footer, descriptive text, logo, links to further information, etc. Organizations will be able to add their logo to a collection.
DataONE Member Node
We are working with DataONE to become a DataONE member node. Merritt clients will have the option to include metadata about datasets meeting the DataONE collection criteria in the DataONE union catalog.
Support for multiple metadata schemas
Currently, the Merritt submission interface only provides for the input of Dublin Kernel metadata. All of the additional metadata elements/schemas identified in the "Exposure of content for search engine indexing" enhancement should be supported through the submission interface, i.e., have provision for supplying metadata at the point of deposit. This effort may be informed by activities of the NSF DataONE Preservation and Metadata working group. Merritt submitters will have a wider range of options for including descriptive metadata in their native formats, such as MODS, without the need to derive Dublin Kernel metadata.
Simplified submission workflows for single objects
Anonymous public access to designated collections and objects
This has previously been identified as the top priority for Merritt development. The designation of collections and/or objects to be exposed publicly is performed by providers, based on local policy decisions. Merritt curators will be able to designate their collections publicly accessible, and users will have direct access to materials stored in Merritt.
Click-through Data Use Agreements
Distinct access rules for metadata and content
Currently, the granularity of access control is at the object level; if designated for read access, all components of the object are accessible. There are use cases in which it is desirable for a meaningful distinction to be made between object metadata, which generally should be open for the widest access, and data, which may be subject to more restrictions. Merritt should support expression of access control rules at a finer granularity supportive of a metadata/data distinction, and these need to carry through to EZID as appropriate so that any indexing of EZID metadata respects those designations. Curators will have the option of allowing the metadata for their objects in Merritt to be accessible by the public, while restricting access to the objects’ associated files. What this means from EZID’s perspective is that researchers can control how their data/resources are indexed and exposed.
Limited time embargoes
As an extension to controlling access to content in Merritt at a finer granularity, users will also be able to add a time-based valence to when materials can be more broadly exposed. As an example, a user can submit content -- such as a dataset -- to Merritt in an intermediate or working state that is not ready for use by others, but specify a specific date for when it can be made available.
Self-service user account registration
Merritt account registration is currently an off-line operation that requires communication with the Merritt service manager. Merritt should support a self-service model for new account registration.
Merritt storage and legacy infrastructure
SDSC Cloud storage
We want to take advantage of the cloud storage offered by the San Diego Supercomputer Center, allowing for further cost-savings and extending the replication of content stored in Merritt. This requires some changes to the architecture of our storage micro-service.
Migrate DPR collections to Merritt
We want to migrate the content from the Digital Preservation Repository (DPR) to Merritt. We know that some clients do not wish us to migrate their content for them, but would rather submit it themselves to Merritt. We will contact our DPR clients to determine the most appropriate actions for migrating their content.
Integrate Merritt with DataONE repositories
Merritt will be deployed as a DataONE member node which will allow users that deposit earth science data in Merritt to expose them in DataONE
EZID has a new client relationship called the “EZID Service Provider” (ESP). This is an EZID Institution that offers EZID as a paid service to others.
4.4.1 Branding and user administration
To support this type of “super” client, EZID needs to provide a lightly branded UI and also some user administration capabilities. Support for this service level benefits the Libraries, because it is part of EZID’s business plan for cost recovery, necessary for the subsidy of UC Libraries.
4.4.2 Automate assignment of NAAN and DOI prefixes
In addition to the administration of users, the ESP should be able to assign NAANs and DOIs for the ESP’s clients. Automating these tasks will also make UC3’s work more streamlined so that our client service stays responsive even as we grow.
5. Enhance identifier persistence
The following items are focused on identifier persistence by increasing EZID’s resilience and extending its service offerings.
5.1 Link Checking
In 2011, EZID introduced the Tombstone Page feature. Tombstone pages are web pages that appear automatically in place of a broken link. By default, EZID will provide “last known” metadata, including the original owner, but it will also display a configurable “reason code” selected by the owner. A second tool will make Tombstone Pages a very powerful tool in promoting good resource stewardship is automatic link checking.
5.2 Protect EZID Infrastructure from malicious attack
Although EZID has implemented numerous levels of security, the N2T infrastructure upon which it depends has a particular security vulnerability. Without exposing it further, suffice it to say that this must be closed off.
5.3 Seek replication partners
Of the identifier types that EZID supports, DOIs come with what might be thought of as “built-in” redundancy: the identifiers and their accompanying metadata are stored here in Oakland, California, and also in Hannover, Germany, where the DataCite Managing Agent is headquartered. So, this builds in redundancy for our clients. Additionally, DOI resolver services are part of the international Handle system, therefore they are not dependent upon any one institution’s uptime.
However, for ARKs and other identifiers the EZID service supports, we are engaging in talks at the present time to establish replication sites at 2 other locations, one in southern California, and one in the United Kingdom. We expect these talks to conclude successfully in the near term. Our goal is to establish 4 to 6 replicas of the N2T resolver (the ARK resolver) and 3 replicas of the EZID management interface. This will give Libraries added confidence in choosing a wide range of identifiers for their data management needs.
6. Community Building
EZID exists within a context. The following items focus on the communities of users.
6.1 ARK Community Building
The Archival Resource Key (ARK) specification and code has been available publically for a sufficient enough period that major institutions on more than one continent have adopted and installed the solution. It has become clear that a community-based governance scheme is necessary. As a starting point, CDL will launch a list-serv.
6.2 DataCite US
The 3 DataCite U.S. partners (including CDL) are developing a DataCite US Alliance. The Alliance will be a community of stakeholders who share common concerns and interests and can pursue common directions as appropriate. DataCite US will offer a community of shared concerns and practice, sponsor US-based events, and open up an opportunity to work collaboratively at a national level to build robust, innovative, and cost-effective solutions to counteract inevitable disruptive change.
7. Policy Development
7.1 DataCite Policy changes to Service Guidelines
The DataCite Consortium is in the process of finalizing a number of policies that we will be incorporating into the EZID Service Agreement. These include:
- Requirement of submission of metadata with DOI request;
- Acceptance of Creative Commons license for metadata;
- Requirement of a landing page if the dataset (or other resource) itself is not publicly accessible.