Building linked open data web applications from the outside in—lessons learned from building the Getty’s Research Collections Viewer
Adam Brin, J. Paul Getty Trust, USA, Gregg Garcia, J. Paul Getty Trust, USA, Ben O'Steen, J. Paul Getty Trust, United States, Charles Webb, J. Paul Getty Trust, USA, Pam Lam, J. Paul Getty Trust, USA, Selina Chang-Yi Zawacki, J. Paul Getty Trust, USA, United States
AbstractGetty recently launched the Research Collection Viewer (RCV), a new interface for using the Getty’s archival collections using linked open data and the Linked.art data model. We outline some of the challenges and the potential of linked open data for this application as well as the necessary architecture. We selected a microservice architecture for RCV and built the core application APIs using the Linked.art model. This choice impacted the design of our backend infrastructure which extracted and transformed data from ArchivesSpace, Rosetta, and Arches. The frontend was similarly impacted by both the architecture and the state of modern web development. RCV highlights some of the issues and possibilities of what can be developed.
Keywords: Linked Open Data, LinkedArt, IIIF, Archives
The J. Paul Getty Trust’s Research Collections Viewer (RCV, https://www.getty.edu/research/collections/) is a unified interface for the Trust and Research Institute’s archival collections. RCV provides a new way to navigate collection finding aids and digitized materials by presenting users with a highly interactive and intuitive experience. It does so by exposing the intrinsically hierarchical nature of the data and allowing easy access to related information including digitized images. The data that drives RCV originates from three separate backend systems which publish the underlying data:
- ArchivesSpace – which contains the metadata about the archives including its structure and descriptive information.
- Rosetta – which contains the technical metadata about the digital files stored in the archive.
- Arches – which contains additional metadata about the items in Rosetta.
Getty Digital is committed to providing open access to its information using community standards. These efforts include embracing standards such as Linked.art and IIIF, as well as publishing the Getty Vocabs and (more recently) the museum’s collection data as linked open data for others to use. RCV is the latest demonstration of these efforts. It is our goal to talk about three main aspects of RCV that make the application and the infrastructure around it unique. First, we discuss the architecture we built to synthesize and transform data from ArchivesSpace, Rosetta, and Arches so that it conforms to the Linked.art data model. Second, we address challenges around building a modern web application on linked open data. And, lastly, we outline the potential for future work utilizing this infrastructure.
The concept of linked open data that we’ve built upon has existed since the mid-2000s, with the goal that data should be easily accessible, understandable, shareable and discoverable (Berners-Lee, 2006). Doing so requires work, through mapping and modeling, but produces significant benefit. In this model, the relationships in the data can be as specific as needed since those relationships are distinctly mapped. Furthermore, the precision of the data and the nature of Linked Data in the form of RDF or JSON-LD mean that these relationships may not only exist in the JSON or XML documents we traditionally use, but can be decomposed into smaller pieces without losing meaning (Hausenblas, 2015). One of the most powerful ways this can be accomplished is using a Graph Store like GraphDB or Amazon’s Neptune, which allow the decomposition, joining, and querying of the data.
Getty is not alone in publishing our data as linked open data. A number of museums have begun publishing their records including the O’Keeffe museum (Neely, 2019) and the Smithsonian American Art Museum (Smithsonian, n.d.). Outside the museum community, research institutions such as Nomisma.org (Gruber, 2020), the Library of Congress, and OCLC have done so as well. Additionally, publishing platforms such as Arches (Arches, 2020) have built linked open data capabilities into the software—something that should ease greater community adoption. Like many technologies built on networks, the larger the network, the greater the potential for what can be accomplished. The more institutions publishing linked open data, the more interesting questions that can be asked.
The Promise of Linked Open Data
For an institution like the Getty, the promise of linked open data can be viewed as a set of concentric goals. Using the Getty and the artist Edward Ruscha as an example, the centermost ring is the interface that can display everything RCV knows about Edward Ruscha from the Getty Research Institute (GRI), which includes works he created and his relationships with people and other collections held in the GRI. When such a dataset from one system is in linked open data, there may be little difference compared to a more traditional system like a relational database or set of XML records.
But, if we move a step outward, we could begin to integrate other systems at the Getty that are separate from the GRI. For example, the Museum (a separate entity) has their own biographical statement for him as well as additional information in the Getty’s Union List of Artists Names (ULAN). At the outermost ring, non-Getty sources might also be included, such as the Library of Congress or the Smithsonian American Art Museum, thus enabling a user to seamlessly navigate data dynamically gathered from multiple institutions in one application. While each of these goals is achievable without linked open data, being able to query and aggregate across systems using standard approaches and protocols such as CIDOC-CRM, Linked.art, and SPARQL make their achievement easier. With RCV, we are squarely focused on that first step, while actively laying the groundwork for the future.
For linked open data to succeed, it must be easy to implement, use, and understand. In 2019 a community formed to address challenges associated with integrating linked open data into the cultural heritage and museum world. One resource this initiative has developed is Linked.art, a data model with documentation and examples for working with linked open data. Although the community is working on an official 1.0 release, the current version was mature enough for us to implement. The data model continues to evolve to meet the needs of the museum and broader cultural heritage community. Documentation and data models are both critical to the adoption of standards, but practical implementations are also useful in showcasing to both users and other institutions how these standards impact the accessibility of information. Examples like the O’Keeffe museum or Nomisma demonstrate the use of complete Linked.art records and how they might be published as part of existing systems, as well as some of the challenges involved in implementation. (Linked Art, n.d.)
By building Linked Data into the core of RCV, we are taking a different approach. Instead of transforming an existing data model into Linked.art for download or re-use, Linked.art is the data model used by the entire system. Getty’s publishing of the linked open data APIs utilized by RCV is aimed to build confidence in their use. This has the benefit that the API is always being maintained and tested. If the API breaks, RCV breaks; and any issues with the API will likely be found in the regular development and testing of RCV.
One of our primary goals when designing RCV was for the application to use the Linked.art data model (and thus linked open data) throughout the application. The plan was for the web browser to request, process and use the same objects, people, groups and other entity types we publish via our JSON-LD APIs. Keeping this goal in mind helped ensure that our own APIs were not only functional but might also be usable and useful to others outside of the Getty. This constraint also affected both the creation of the underlying backend infrastructure and the frontend application.
Initially, our backend simply included our three native systems—ArchivesSpace, Rosetta, and Arches. Since the final goal was serving up Linked.art flavored JSON-LD, there needed to be new infrastructure to connect everything. This meant adding tasks like extraction, transformation, and loading (ETL) from the native systems into the new infrastructure which would serve the JSON-LD. It also meant reconciling the data in the three systems to create a unified data model.
We would need a system to create and maintain links between the three existing systems and the new Linked.art records to ensure proper change management. To this end we added three pieces of software to our infrastructure:
- A Linked Open Data Gateway (LOD Gateway) that is responsible for serving JSON-LD, providing a SPARQL endpoint for querying the data, and publishing changes via W3C standard ActivityStream (Snel, 2017). The goal of this software is to provide a cache of the JSON-LD data we share, along with an ActivityStream that allows consumers to know what data has changed and when the change occurred. In addition to storing Linked Data, it also provides a proxy for graph-based SPARQL queries of the data.
- An ID Manager that’s responsible for linking between identifiers among the systems using the W3C standard for Web Annotations (Sanderson, 2017). The goal of this software is to maintain an independent source for identifiers and links between records.
- A Celery based task management system for performing and managing the ETL tasks.
The LOD Gateway
The LOD Gateway was the first piece of software that we created. Its primary goal was to serve the Linked.art JSON-LD to the frontend. The relatively simple job of the LOD Gateway, however, hides the complexity of transforming the data from our source systems. The data available in each system often required reconciliation with other data in that system or another system to produce a single record for a work of art, archival folder, or collection. Although each of these systems maintain APIs, these APIs were not as performant or robust as was required to run this process regularly. The LOD Gateway also allows these native systems to undergo maintenance without affecting the production environment. It equally isolates them from impact of the data transform, indexing, or other uses that might cause significant load.
In order to better manage the complexity of the ETL tasks, we developed intermediary LOD Gateways. With this additional infrastructure, each step is targeted to a specific task. First, data is extracted from the native system as METS, JSON or JSON-LD. The data is then transformed as little as possible into JSON thereby maintaining a 1:1 relationship between the JSON records and the native system. Then, the data is loaded into a LOD Gateway for that system with an identifier that relates to the native system’s identifier for that record. The set of LOD Gateways at this level are referred to as a Level 1 Gateway (L1). All the L1 Gateways share the above conventions. With all of our data in JSON, it simplifies the transformation and merging process. It becomes easier to reconcile information between Arches and ArchivesSpace or identifiers between Rosetta and the other systems.
The LOD Gateway that holds the transformed JSON-LD versions is referred to as the Level 2 Gateway (L2), to distinguish it from the L1s. The L1 Gateways are intended to reflect the data in the native systems, and the L2 holds a transformed and remodeled version of data. Records are taken from the L1 Gateways, reconciled or combined with other records and stored in the L2 Gateway. The new records are JSON-LD formatted Linked.art records representing objects, archival components, people, groups, places or other document types.
Although Linked.art provides general guidelines about how the data is modeled, one of the particular challenges was making decisions about the boundaries between records—that is, when to maintain information in separate JSON documents and how those records relate to each other. Keeping unique data separate makes it easier to maintain the documents but may make it harder to work with them later as you may need to request additional documents.
Links between documents also become a challenge, as circular relationships (parent to child and back, for example), can be technically difficult. To address these, we generally attempted to maintain relationships in one direction (i.e. child to parent) to simplify these tasks. This decision did cause some challenges for the frontend implementation, which are addressed later in that section.
The SPARQL Endpoint
Having the data from the LOD Gateway loaded into the SPARQL endpoint in addition to the LOD Gateways means that we can utilize a standard to query and extract data from the JSON-LD. The process of loading data into the graph store decomposes the records into the underlying triples represented in the JSON-LD. The ability to query data at this level—as triples—means that queries can be quite focused or traverse the entire graph. It also eliminates the issue introduced when modeling the data and attempting to determine where boundaries should be placed between records.
The Identity Manager
An Identity Manager (IDM) service was constructed to store connections between identified resources. This was built using the Web Annotation Model with each connection equivalent to an annotation. The IDM service associates identifiers for the described people, places, objects, and digital resources with typed relationships or ‘motivations’. For example, an annotation can connect a IIIF Manifest to the archival description of the object it shows, or connect a persistent URI to the current, impermanent system identifier for that object. The motivation encodes the type of connection. Thus-far, the IDM has been used to link between systems at the Getty; however, it facilitates linking between the Getty and external systems.
One of the challenges of working with the three source systems was that the data had not been combined in an automated manner before. This raised a number of issues around the permanence of system identifiers and ultimately changes in convention and local practices, such as the bulk export and editing of records. Persistent identifiers are crucial, so it was important to add additional logic to the ETL tasks to attempt to keep the same identifier for a JSON-LD representation if the subject of the record is the same, even if the native system identifier has changed.
The Task Manager
The task manager is the glue between all of the systems. The transform workflow adds, updates and deletes the data held in the IDM, along with the records stored in the LOD Gateways to create the JSON-LD versions. Similarly, all of the ETL tasks described above are managed as part of this system.
Each process is triggered by a change in the system it depends on – the loading of the L1’s based on changes in their native systems, and the L2 based on changes published to the L1’s ActivityStream. The speed of the extraction processes from the native systems to the L1 Gateway is a compromise related to how fast the native systems can be accessed and the impacts to the performance and reliability of these business-critical systems. On the other hand, the LOD Gateways themselves are managed separately from the business-critical native systems. Thus, the resource intensive transforms do not affect the native systems, and the LOD Gateways can be scaled up or down as needed.
With a solid foundation of linked open data and APIs to build the application upon, we can turn our focus toward the RCV frontend application itself. One of the traditional approaches for building these sorts of applications would have been to transform the data into a simplified form and load that into a database or data-store and index it. From there, the data could be utilized for search, or rendered into HTML. There are arguably benefits to this monolithic architecture (Fowler, 2014), such as fewer dependencies on external data sources, less complexity in the data and data model, and less reliance on the web browser to perform more complex tasks. The approach we’ve taken instead is to utilize a microservices-based architecture, resulting in many smaller services, each performing a single task. If done right, this should mean we end up with less code to write, debug, and maintain over time. It also opens up the potential for our APIs and services to be reused across applications in new and creative ways. Lastly, the move to VueJS enables a richer and more interactive user experience.
Background: the State of Web Development
The issues that remain are the differences between browsers. For nearly a decade, programming shims and frameworks like JQuery have been necessary to unify the particularities of different browsers like Internet Explorer, Firefox, Chrome, or Safari. The guiding assumption was to start with HTML and build interactivity on-top of it. These frameworks provided developers a single method for accomplishing tasks and obviate the need to track the particularities of each browser or browser version. But now, many browsers have added auto-updates as well as support for web standards and standard APIs.
By using new frameworks like Vue, we’re able to build complex, dynamic applications that can be more responsive to the user’s interaction. The application is only downloaded by the browser once. This may be a larger initial download, but means that as a user browses between records, only the data necessary to load those pages is requested from the server. With less to load, users perceive the application to be faster and more responsive. These improvements are built on standard browser APIs and enable a different type of application to be developed—one that is far more interactive and dynamic.
Building a Linked Data RCV
Our approach to the frontend was similar to that of the backend—we started with the goal and worked backwards to break down and identify dependencies and requirements. Specifically, the aim of RCV is to provide an interface that allows users to navigate the archival collections of the Research Institute and find items related to their research questions. With the infrastructure available to us—namely, the APIs of the LOD Gateway, the ID Manager, the Getty’s IIIF Infrastructure and modern frameworks such as VueJS—we were able to identify challenges and solutions to building such an interface. Like most collection-based applications, the primary goals focus on search, discovery, and use. As the LOD Gateway provides all of the data in JSON-LD, the remaining work for the frontend was to render pages and perform search.
Accessing the JSON-LD on the client side introduced new challenges to page construction. Constructing a single page in RCV might require between 5-50 separate documents to be loaded from the LOD Gateway—people, organizations, archival components, and object information, among many others. For example, to render an object page, JSON-LD for the following might be required: the object itself, the collection it is in, each of the object’s parents in the archival tree up to the collection, the creator(s) of the object, and a IIIF manifest. Once a JSON-LD document is loaded, specific fields are extracted and rendered on the page.
This is where the reactive model of VueJS shines, as it allows us to knit these records together as they’re loaded and render their contents on screen. Furthermore, as a page changes, only new information needs to be retrieved from the server. I.e. when a user browses from one item in a box to the next item, the information about the collection will not change, and thus will not be requested again; only the information specific to the new item will be requested.
Although the LOD Gateway’s SPARQL endpoint could provide some of this functionality, it was not performant enough to meet user expectations. Instead we decided upon Elasticsearch. In order to leverage a dedicated search backend, we created a simple server application that could translate search requests into queries for Elasticsearch and return the results as JSON to our application. This allowed us to perform searches – however, the question of how to load the data into Elasticsearch and maintain it remained.
Data Transformation & Indexing
Loading data into Elasticsearch required the creation of an indexing service that could read the ActivityStream of the LOD Gateway and update the Elasticsearch index accordingly. Each indexing task is quite similar to how a page is loaded in the client. The indexer loads the data into Elasticsearch while the VueJS client loads the data into the browser—the first constructs the object to index while the second renders the object as part of the page. As these two methods are relatively similar, one of our early experiments was to evaluate whether we could reuse code between VueJS and an indexer. NodeJS proved to be a successful platform for running the complex transform logic from JSON-LD on the server.
Being able to share both logic and a single language between the client and the indexer had a number of benefits:
- It required less code be written as more code could be reused between the indexer and the VueJS frontend
- It meant developers could easily move between the indexer and frontend because the logic was familiar.
- It reduced the potential for bugs.
Using the same logic for the frontend and indexer greatly reduced the potential for introducing regressions and bugs. For example, data from the LOD Gateway is not necessarily optimized for display—the title of an archival component, for example, may be constructed from multiple fields (title and the first creation date). If that logic is managed in two places, it requires that the developer update that logic twice, first in the indexing code and second in the display code. In addition to reducing the risk of introducing bugs, managing this logic in one place also simplifies the development process as updates to the display logic are immediately reflected in the indexer and vice versa.
To manage the shared code, we created a series of helper libraries to store common functions that transform linked open data. The helper libraries in RCV can be split into two categories. The first is a set of functions that are specific to RCV’s internal logic, such as how the curators want titles to be constructed, dates to be formatted, or the extraction of specific fields. The second is a class of functions that are more general and deal with the structure of the data being served from the LOD Gateway. These were packaged up as a reusable Node Package Manager (NPM) module that could be used across applications. Having a separate module for these functions allowed us to abstract, or simplify, the application-specific logic. As a result, this module has been and will continue to be useful in the development of new applications at the Getty that utilize Linked.art. Maintaining the library and module separate from the application made it easier to test the functionality, manage updates to the data in a single location, and fix bugs centrally.
One of the challenges of building research applications is the inability to develop an interface that meets the needs of all users. There is always a more exact, advanced query that the interface doesn’t facilitate. Publishing our data as linked open data provides an alternate means to query the data—RCV does not need to be the sole source of answers for researchers. Instead, it allows the interface to provide an initial method to query and introspect the data that hopefully meets the needs of 80-90% of the users. The remaining users—those with unusual or unique research interests—have the ability to take advantage of the underlying APIs, the SPARQL query engine and the ability to download the JSON-LD and query it directly.
One type of navigation we aimed to facilitate, was the ability to easily move among sibling records—to digitally traverse objects located in the same box or archival collection. However, as the data being served from the LOD Gateways are in JSON-LD documents, these document boundaries make it difficult to implement seemingly simple features such as navigating to the next or previous box or record in an archival collection. Additionally, it was difficult to show the total number of digitized items in a collection or archival box. We were able to take advantage of the LOD Gateway’s SPARQL endpoint to address both these and similar issues. By querying the documents, it was possible to get an aggregate count of digitized items that could be stored in Elasticsearch for display. Similarly, because the data loaded into the SPARQL endpoint was now bi-directional, one could easily use a query to find the closest siblings to a box or object.
For Getty Digital, building the Research Collections Viewer on open standards with linked open data and Linked.art at its core was a logical step. It meets our commitment to both open access and community standards. Designing and developing the application to be sustainable and extensible is equally important. The microservice based architecture and Linked.art APIs at its core provide a strong foundation for RCV to grow both in terms of collections and future functionality.
The initial release of RCV included four of the Research Institute’s archival collections including over 100,000 images, with the plan of adding additional collections and content over time. The tool represents the first phase of what we hope to accomplish with this novel infrastructure and software. As we add more data into this infrastructure, we’re excited by the possibility of creating new connections among objects, people, and collections both within systems at the Getty and hopefully beyond.
Building a complex application such as RCV is a community effort; one that is built on the hard work of archivists, catalogers, curators, systems administrators, application administrators, project managers, management, and developers. Without this vital collaboration, RCV would not have been a success.
Berners-Lee, T. (2006). Linked Data—Design Issues. Available https://www.w3.org/DesignIssues/LinkedData.html
CIDOC. (n.d.) “CIDOC-CRM: Conceptual Reference Model.” Accessed December 4, 2020. Available at: http://www.cidoc-crm.org/
Fowler, M. & E. Lewis. (2014) Microservices, a definition of this new architectural term. Availablet https://martinfowler.com/articles/microservices.html
Gruber, E. (2020). numishare: First pass at mapping Nomisma.org to Linked Art JSON-LD. Numishare. http://numishare.blogspot.com/2020/08/first-pass-at-mapping-nomismaorg-to.html
Hausenblas, M. (2015). “5 Star Open Data.” Accessed December 30, 2020. Available https://5stardata.info/en/
“Linked Art Community.” (n.d.). Linked Art. Accessed December 30, 2020. Available https://linked.art/community
Linked Open Data at SAAM | Smithsonian American Art Museum. (n.d.). Retrieved January 12, 2021, from https://americanart.si.edu/about/lod
Neely, L., A. Luther, & C. Weinard. (2019). Cultural Collections as Data: Aiming for Digital Data Literacy and Tool Development – MW19 | Boston. Available https://mw19.mwconf.org/paper/cultural-collections-as-data-aiming-for-digital-data-literacy-and-tool-development/
Roadmap and Releases | Arches Project. (2020). Available https://www.archesproject.org/roadmap/Sanderson, R., P. Ciccarese, & B. Young. (2017). Web Annotation Data Model. Web Annotation Data Model. Available https://www.w3.org/TR/annotation-model/
Snel, J. & E. Prodromou. (2017). Activity Streams 2.0. Web Annotation Data Model. Available https://www.w3.org/TR/activitystreams-core/
Swan, C. (2020). InfoQ. 55th Anniversary of Moore’s Law. Available https://www.infoq.com/news/2020/04/Moores-law-55/
Szekely P. et al. (2013) “Connecting the Smithsonian American Art Museum to the Linked Data Cloud.” Semantic Web: Semantics and Big Data. ESWC 2013. Lecture Notes in Computer Science, vol 7882. https://doi.org/10.1007/978-3-642-38288-8_40
Brin, Adam, Garcia, Gregg, O'Steen, Ben, Webb, Charles, Lam, Pam and Zawacki, Selina Chang-Yi. "Building linked open data web applications from the outside in—lessons learned from building the Getty’s Research Collections Viewer." MW21: MW 2021. Published January 15, 2021. Consulted .