Getting Your Data to Your Users: A Nerdy Deep Dive into APIs, ETLs, and Aggregated Databases
AbstractWhat does it really take to provide access to a museum’s data? The Getty, the Dallas Museum of Art, and MoMA shared similar challenges: They all wanted to integrate collection and archive data into multiple visitor-facing platforms, planning for both current use cases and future iterations. But they took different approaches: APIs, ETLs, middleware aggregators, or a combination of all three. Each institution made different choices based on their internal needs and the needs of specific audiences they wanted to emphasize. Every audience has their own sets of values, and understanding the tradeoffs of data completeness, ease of use, flexibility, and timeliness needed to be weighed to come to the right solution for each institution. Nina Callaway of Art Processors will facilitate a conversation with key members of each project: David Newbury of Getty, Rik Vanmechelen of MoMA, and Shyam Oberoi of Royal Ontario Museum (formerly of Dallas Museum of Art) about how these applications have been implemented to reduce museum staff efforts, support creative projects, and increase content capacity. They’ll analyze why they made the choices they did, talk about the pros and cons of each and plans for the future.
Keywords: data, data integration, TMS, API, ETL, middleware
In 2021, there is no longer a question about whether or not museums are digital organizations, just as there is no longer a question about whether or not digital technology is part of our culture. The work of capturing, presenting, and managing information is mediated through technology, and the technologists within the institution are often responsible for the work of translating the very human work of the museum into abstractions that can work for computers—while ensuring that those abstractions still enable humans to do their work! Application Programming Interface, or APIs, are a common technique for managing those abstractions, and in this paper we examine how those abstractions and APIs function—not just as technologies, but as part of data systems that bridge between humans and computers. APIs are, most importantly, social contracts and communication tools that allow humans to build shared models and establish the trust needed to work together to enable technology that helps our organizations fulfill their mission.
This paper is the result of a series of conversations about APIs between David Newbury, Head of Software at Getty; Rik Vanmechelen, Manager of Enterprise Applications at The Museum of Modern Art; and Shyam Oberoi, Chief Digital Officer of the Royal Ontario Museum that were facilitated by Nina Callaway, Producer at Art Processors. All of us have extensive experience building and consuming cultural heritage APIs—and all of us have strong opinions about them. Shyam, Rick, and David are all in the process of planning or implementing yet another API—we’ve all had experience at institutions of various sizes, and we’ve learned a lot about what it means to take on the responsibility of creating an API for a museum.
Throughout our conversations, we identified seven points that we agreed on and felt were critical to be aware of for anyone who found themselves in our unfortunate position of being responsible for an API.
Point #1: Museums are knowledge organizations.
The current ICOM definition of a museum is:
“A museum is a non-profit, permanent institution in the service of society and its development, open to the public, which acquires, conserves, researches, communicates and exhibits the tangible and intangible heritage of humanity and its environment for the purposes of education, study and enjoyment.“
To simplify for the purposes of this paper: Museums collect things, and the generation of knowledge, the management of data, and the presentation of information about those things within the context of a broader human experience is part of the core work and mission of museums.
The generation of knowledge is a fundamentally human task. Understanding how to do in the real world—understanding how information functions within society, and how to take advantage of that information is work that, at least for now, cannot happen without humans. The work required is that of judgement, not mere reckoning. Computers cannot and should not replace that work. However, once knowledge has been created, there is often a desire to capture that knowledge in a fixed form.
Point #2: Data Entry and Data Presentation are different aspects of this use of technology.
Museums have always captured knowledge, through internal files as well as publications, research reports, and other practices. Beginning as early as 1967, there has been interest in the use of computers to aid in that work of data management; today, this is embodied in the Collection Management System, or CMS. CMSes create a human/computer interaction point, or interface, that allow a human to record data within a digital system in a managed way. We call that human work data entry. Someone develops a digital tool and underlying data system that allows a human to collaborate with the computer in the encoding of that information in a machine-readable way. Usually that takes the form of words and numbers—human affordances, turned many levels below into the bits and bloops of binary information. We also do this with pictures through digital imaging—turning colors and shapes into those same bits to be stored and managed by computers.
There is no intrinsic benefit to encoding human knowledge into ones and zeros: it’s both annoying and boring. We suffer through this work because we assume that the effort will result in the information that we enter being used for some future purpose, one that will affect other humans down the line. In the corporate world, we might call this the “business value provided,” and the importance of inventory management and conservation to museum practice should not be overlooked. However, within cultural heritage, our motivation to do this work is often as a way to enable sharing that knowledge with our communities: we call this access.
Traditionally, access meant letting people in the doors to see the things, artfully arranged. Technology enables a much broader range of possibilities for that access—the traditional Museum Collection, online; but also audio guides, interpretive materials, exhibition documentation, and innumerable other digital experiences. Each of these are presentations or visualizations of information within a given context to meet a user need and support our larger institutional mission. Some interfaces, such as MoMA’s wall label caption generator, meet internal user needs; others, such as the Getty’s Animal Crossing experiment, meet broad public desires. We develop these interfaces on top of the data, but the context of these presentations are not one-to-one representations of the underlying information as entered into the CMS. The information has to be transformed, filtered, and reconstituted back into a new form that accommodates the human experience. That’s the work of the technologist, who—amongst other tasks—invents and encodes complex rules regarding how to move information from one place to another and from one form into another.
Point #3: Technologists create abstractions to help computers cope with the messy “real world” of humanity.
This work is not something that humans are innately good at—we’re intuitive and brilliant and our core strength is that we can function at all without enough information and still manage to make generally good decisions. If something isn’t working correctly, we work around it and invent a solution that will help us do the things we need to do.
Computers are not brilliant. They follow rules precisely; they never guess. It’s their strength, but it means that someone has to be brilliant for them. The real world does not help here, either—it’s also messy and squishy and full of exceptions—and those exceptions and complexities leak into our technology, particularly at the point of data entry.
Within cultural heritage, this messiness tends to be managed at the point of data entry into the Collections Management System. Currently, each of our three institutions uses TMS (The Museum System), but we’ve all also worked with other systems and management layers as we attempt to shoehorn the real world—or worse, the pretend world of Art History—into a series of grey boxes with strange labels.
Every single organization we’ve seen—and we’ve seen lots—abuses their CMS. There is a critical bit of data in the real world that does not fit into the boxes available, and some brilliant, squishy human works around the system, puts that data in the “wrong” box, and tells the other four people in the world who care about the critical data that the field marked “constituent role type” actually now holds “profession.” And they all roll their eyes at the system, and complement their colleague on the workaround, and write a post-it note to stick above their desk to remind them. Five years later, it’s just “the way things work”.
This works amazingly for the data entry people. The poor computers, though, are not good at reading Post-It notes on walls. And more importantly, the engineer who created the “constituent role type” field had a set of assumptions around the types of data that would go in that field, and probably had to work around something else under the hood, so that field is actually, within the database, called “REV_CONST_TYPE”. It’s wrongness all the way down.
To add complexity to this, museums as digital institutions have matured. We also now have to deal with the fact that the CMS is no longer the only place this data is managed—a mature institution like the Getty has two Digital Asset Management systems, four Collection Management systems, three Content Management systems, Customer Management Systems, and a collection of home-grown interfaces. At the Dallas Museum of Art, the data sources include a similar array of collections data, exhibition archives, and digital asset management systems, but also a treasure trove of not-yet-digitized scholarly and other contextual information that needed a consistent and persistent way to organize and manage.
Each of these tools provides an affordance for input of information—they’re all fit for purpose, assuming the purpose is to help the poor soul doing data entry lose as little of their life as possible shoving the complexities of the real world into said little grey boxes. But, in order to provide access to our institutional knowledge in a form that makes sense, we need to pull data from several of these sources—taking into account not only the difference between the schema-of-data-entry and the schema-of-interface-presentation, but also all of the underlying technological cruft needed to get data out of those systems—keys, network latency, SOAP APIs, and the ramifications of the bad technology decisions some other programmer made late on a Friday evening eight years ago.
To do this, every technologist keeps within their heads an enormously complex set of rules and constraints about the real world, alongside a mental model of all of the compromises and constraints of our ability to express those rules and understand them. Even the best technologist cannot do this perfectly—remember, they’re squishy humans as well, just ones trained in a practice of suppressing their natural instincts in service of the needs of a computer. Instead, developers use a practice called “abstraction”: they take a big chunk of the messiness, make up some human-understandable rules about it, and use those rules to “pretend” that it’s something simple that they can reason about as a human.
Point #4: Sometimes, we expose our abstractions through formal patterns called APIs.
Every interface has one of these abstractions, sitting between the data store used by our data entry teams and the interface that we present to enable access. They must—no sane person reads database tables, but even more the needs of the data entry team are almost certainly different than those of users who want to access institutional knowledge. Sometimes, the abstraction is ad-hoc, hidden within the interface code; sometimes we call it out as a specific part of the work. It’s a point of professional pride for software developers that they generate “elegant” solutions for this abstraction layer—that they’ve hidden the messy, uncomfortable needs of those squishy humans away well enough that they don’t need to think about them to solve the current problem. They understand the use case of the interface and don’t just hide the data management layer, they formally translate it from the affordances of data entry into the requirements for interface development. Yes, the abstractions are always lossy, and yes, the underlying chaos tends to leak out, but in general they’re good enough to let us do the other parts of our job, which involve actually coaxing the computer to display said interface.
This work of conceptualizing, developing, and creating an elegant abstraction is one of the most difficult and satisfying things a technologist can do. That said, once we’ve done the hard work of generating an abstraction, we want to reuse it: to pull it out of the underlying code and make it a standalone thing—something that can be used again and again. When we choose to take the abstraction and formalize it—we tend to call it an API, or Application Programming Interface.
One good definition of an API is “a well-defined interface that allows one software component to access programmatically another component and is normally supported by the constructs of programming languages.” It’s another interface, though not for humans who want to understand artwork, but for applications—by which we really mean it’s for technologists who build applications. Sometimes our APIs are software libraries, but in modern usage an API is almost always a network-available application that returns machine-readable data for use by other applications. Once you have done the extraction and translation work to generate such an abstraction, the additional engineering needed to create a REST-based JSON API for it is trivial—the tools and patterns to do so effectively are easily available and widely known. That is to say, we can then easily create an API architecture designed to increase scalability, portability, and reliability.
Point #5: The primary goal of an API is not to generate a data abstraction, but to communicate a guarantee to others around the stability of that abstraction.
The question for museums, and in particular for the software engineering teams within museums, is not if we can, but if we should do the work of generating this API—and then if we should expose that API to the general purpose. We are understandably proud of this work, and the desire to make visible the underlying labor that goes into the difficult work of abstraction can become the primary motivation for generating an API. It can also become an easy way to justify the work and to connect the often abstract work of software engineering back to that broader mission of the organization: to provide access to knowledge to our communities.
Perhaps the single most important lesson the three of us have learned in our work in cultural heritage is that we must resist the temptation to use this as a justification for the creation of an API. The primary goal of a public API is not to publish data abstraction, but to communicate a guarantee to others around the stability of that abstraction. Beyond providing access to the data and abstraction, the value of an API is that it provides a fixed point of reference between teams so those teams can independently iterate without having to coordinate on all areas of their work, allowing information hiding for internal details.
When we three started working with museums, the use case we were addressing with our APIs was to generate an abstraction on top of our Collection Management System to create “The Collection, Online”. Our APIs were there to allow us to merge in the inevitable data that was NOT in the CMS, manage the inconsistencies of our user’s data entry practices, and prevent the expected massive traffic from impacting the often brittle data access mechanisms provided by the CMS. Regularly, these APIs were internal-only, created as part of ETL processes, and were usually understood by the sole software developer responsible for the API, Collection Online, and rebooting the iPads in the gallery when they crashed.
Over the years, as digital technologies have become even more pervasive, as museums have expanded their digital outreach, and as the conversation around Open Access has gone from a wild dream to the expected status quo, we’ve discovered that our Collection Online is not the only interface needed to provide access to our communities. We’ve begun working with vendors who need access to our data (but not access to our back-of-house systems). We’ve begun providing our collections information to researchers and scholars for analysis. We’ve built audio guides. We’ve integrated our collections into online stores. We’ve shared our data with Google, Wikidata, and others. We’ve integrated structured markdown within our pages to help our robot friends crawl our pages for improved SEO.
To meet these needs, the Dallas Museum of Art built a full collections and media API which provided object and media data to different endpoints – website, mobile app, and in-gallery digital interactives. This API was layered onto a middleware layer that served as a content aggregation engine. This middleware, built on MongoDB and Django, harvested data from our internal systems of record (TMS, DAM etc), as well as from Evernote, which we used as the platform to record and structure scholarly materials that had previously only been available in object records, exhibition catalogs and other analog sources. One advantage to this approach is that it allowed us to use readily available open source platforms as well as well-developed & documented commercial products (ex Evernote), which meant that we could focus more attention on putting these pieces together in a flexible way that would work for non-technical staff without spending a lot of time & effort in is custom application development.
This level of reuse has shown the value of the API, but it has also exposed the difficulty of managing the public presentation of this information. APIs facilitate the reuse of information across multiple systems, but each of these systems has their own specific needs. A robust API should meet all of those needs, but with each new use case we discover ways that our abstraction was not quite abstract enough, or that data wasn’t quite as clean as we’d thought.
We’ve also realized that technology has a shockingly short lifespan, and that maintaining an API requires dedicated attention as underlying systems change, as platforms upgrade, and as our needs and community practices change. Ensuring that there are resources available to sustain this maintenance can be deeply challenging, since a good API is an invisible technology, and the audience for it is a handful of software developers who can find it challenging to explain exactly why they need to spend a month working to update the APIs to accommodate W3C changes to Cross-Origin Resource Sharing in Chrome.
We’ve also had the experience of leaving organizations for whom we originally built these systems. In doing so, we’ve realized that one of the primary metrics of success of our work is not “did it work,” but “does it still work now that I’m gone?” The sustainability of these APIs is directly tied to the number of people who understand the abstraction, and who are capable of then building and extending that abstraction to meet new needs.
Within the context of a smaller institution, “Our API” is primarily a tool we’ve built to help ourselves, and many museum APIs never move beyond this point. There’s nothing wrong with that: as we’ve discussed at length, abstractions are valuable tools! However, when you are your own best user, it doesn’t require much in the way of coordination. If, in the course of using the tool, you discover that your abstraction is “leaky” (as they all are) and you know how to “plug” it (as you almost certainly do), you just go ahead and do that. And (this is the important bit) you also know all of the upstream systems that are affected by that change, and can thus make the gut decision that the benefit of improving the API outweighs the cost (borne by yourself) of fixing everything upstream.
Single-user APIs, or even APIs that support a small team of people working closely together, are typically poorly documented, rapidly changing, and highly effective. The API remains part of a large set of shared “tribal knowledge,” and formal communication is rare, since everyone tends to have been part of the needed change. While the API may be a application or service outside of any interface, it can be thought of as an internal interface, owned and managed within a single team, not an external interface relied upon by other teams.
At the scale of an institution like MoMA, Dallas, or the Getty, APIs are essential for supporting the desired level of digital work. These are large institutions that have the resources (both human and financial) to take on this project, and can justify sustained effort to support them. There is enough interface work going on at all times to be able to justify team members who can devote the needed mind-space to solving these real-but-tiny problems. The downside of larger institutions is there are many more projects and so the communication overhead is larger. It’s no longer that one developer keeps the entire abstraction within their head, so they have to get it out of their head and explained to other people or, worse: written down.
At this scale, it becomes clearer that the core value of APIs is that they provide a layer of abstraction over “I understand it” to “we all understand it.” It also reveals that the cost of that abstraction is not the initial technology development, or even the conceptual work of generating and extending the abstraction, but instead that of change management. When “what we all understand” is not true anymore, we now have to communicate that change to everyone. We also have to manage the impact of those changes across all of the users of the API.
Point #6: When deciding to create an API, who you’re communicating with is more important than the data abstraction.
As an organization grows to support several developers working on separate projects, or multiple teams of developers, you can no longer “just know” how the API has been used, and the communication overhead grows. At this point, there needs to be some mechanism of communication about how to use the API, which often takes the form of ad-hoc, code-level documentation. The need also arises to communicate about changes in the API—usually through emails, code reviews, or internal meetings—or just by changing things and seeing what happens.
The first real complication comes when you begin to open access to the API to third parties. APIs open up opportunities for working with external vendors more effectively, since you only have to communicate to them about the abstraction you’ve built, and not the internal processes within the data management layers. The abstractions also provide security improvements: registrars are always nervous about providing direct access to their databases to vendors, no matter how glowing the recommendations might be.
However, once you’ve opened your API up to those vendors, documentation and change management of the API becomes a real concern. You can no longer just “change” things—if that change is a breaking change, you will need to communicate that change to the vendor, who will almost certainly charge you for the privilege. They’re also not typically aware of the other systems and why the change is happening, so the documentation will need to be comprehensive enough to communicate both intent and the magnitude of that change.
The friction of doing this is very real, and can be difficult for the original developer to understand. The abstraction they’ve built is primarily a mental model of the problem, not the code itself; it is not only clear, but obvious why the improvement needs to be made, and it can be almost painful to not make that change.
The communication overhead only grows when the API is made publicly available. Documentation becomes critical, because there is neither a personal relationship nor the opportunity for direct communication with the user. What’s more, the documentation now has to help provide not only the technical details of “how,” but also express the “why” —communicate the limitations of the abstraction out to the user. Changing the API is also more complicated—at least with vendors, you know that they exist. Once the data is in the world, you almost certainly cannot know who is using the data, and thus the communication about changes can no longer be targeted. Instead, you need to broadcast the changes at least as widely as you broadcasted the availability of the API, and you will need to provide significant advance notice of these changes. And even when you do so, you will end up breaking somebody else’s project, because they didn’t know.
Point #7: APIs are not the only (or even best) solution for access.
The cost of this communication overhead is primarily because APIs are designed to facilitate direct, technical integration between unrelated bits of software—this code calls that API, and it expects that API to be there and to be the same as it was last time. That level of coordination is essential for applications that require up-to-date information—Getty needed an API to ensure that their visitor app, built by Art Processors, has the same, up-to-date collections information as their internally-developed website, and that changes are reflected in both systems at the same time.
However, the mission of our organizations is to provide access to information to our communities, not to enable syncing between applications. Often, all that’s needed is to give someone access to your data—for educational purposes, for research, to make art, or other reasons. To meet that need, static data dumps can be sufficient. Rather than create a REST API, MoMA decided to release their collection and exhibition data as large dumps in csv and JSON format. The data is hosted on Github and is updated monthly, with every release standing on its own. Doing so allows MoMA to change the format of the data dumps with minimal impact to consumers of the data. Another advantage of plain text data dumps is that these can be consumed in whatever way the end user desires: from a simple text editor to full data analytics suits such R, SPSS, or Tableau. The simplicity of this technique means that the communication overhead is minimal, and the impact of change can be managed by the end users of the data at their convenience, rather than when the institution needs it to happen.
MoMA also has internal APIs that they use to coordinate internal tools, but they’ve chosen not to make those public—which minimizes the communication overhead involved when they need to change. Getty’s public APIs enable others to build on top of their data—but changing a field takes multiple years, and the documentation work to both explain the API and manage expectations around upcoming changes is significant and ongoing.
Conclusion: APIs are for people who need a tool to help multiple people understand and manage changing data abstractions.
Museums are in the information business, and technology is increasingly the way museums (and the rest of the world) both manage and provide access to information. As a field, we’ve built the skills needed to work within that space, and there are increasing numbers of institutions and practitioners with the capacity to build software to take advantage of technology in interesting and mission-driven ways, and to develop useful abstractions to bridge the gaps that exist between our internal and external requirements.
As museums become more sophisticated in their use of technology, the need to bridge between internal systems is clear, and formalizing these bridges in software as APIs also provides clear benefits to our organization—particularly as we also realize that we need to provide access to information across multiple platforms and audiences.
These APIs are useful for internal teams, and they can be useful as we augment our internal capacity by working with vendors and outside agencies. They can also be a way to meet institutional missions around providing public access to information—but as we increase the potential audience for these APIs, we need to remember that doing so increases the cost of change and communication, and slows down our ability to rapidly respond to new requests. It also means that those APIs become critical infrastructure for organizations and individuals outside of the walls of the museum, and the museum must have a long-term sustainability and maintenance plan for public APIs and their documentation—which means staff time dedicated to the invisible work of communicating and supporting a technology that has no visible audience benefit. Other techniques such as data dumps exist to provide public access to information without also taking on the social and technical obligations that come with a formal API.
We’re not trying to convince you not to have an API. Really, we’re not. APIs are hard because they’re not really for computers—they’re for people. Like every social commitment, they require work and a shared understanding, but properly understood and supported, they have the potential to allow others to build on our work in ways we might never expect. However, if your primary goal is to enable access for your communities, you should consider if you need that social commitment, or if you can better embody your mission by providing for that access in ways that the people involved can sustain.
Callaway, Nina, Newbury, David, Vanmechelen, Rik and Oberoi, Shyam. "Getting Your Data to Your Users: A Nerdy Deep Dive into APIs, ETLs, and Aggregated Databases." MW21: MW 2021. Published April 20, 2021. Consulted .