EcoLens: Integration and interactive visualization of ecological datasets

Cynthia Sims Parr1, Bongshin Lee1,2* and Benjamin B. Bederson1,2

1Human-Computer Interaction Lab, UMIACS, University of Maryland, College Park, USA and 2Department of Computer Science, University of Maryland, College Park, USA

 

Direct correspondence to: Cynthia Sims Parr, csparr@umd.edu

 

 

*abstract

Complex multi-dimensional datasets are now pervasive in science and elsewhere in society. Better interactive tools are needed for visual data exploration so that patterns in such data may be easily discovered, data can be proofread, and subsets of data can be chosen for algorithmic analysis. In particular, synthetic research such as ecological interaction research demands effective ways to examine multiple datasets. This paper describes our integration of hundreds of food web datasets into a common platform, and the visualization software, EcoLens, we developed for exploring this information. This publicly-available application and integrated dataset have been useful for our research predicting large complex food webs, and EcoLens is favorably reviewed by other researchers. Many habitats are not well represented in our large database. We confirm earlier results about the small size and lack of taxonomic resolution in early food webs but find that they and a non-food-web source provide trophic information about a large number of taxa absent from more modern studies. Corroboration of Tuesday Lake trophic links across studies is usually possible, but lack of links among congeners may have several explanations. While EcoLens does not provide all kinds of analytical support, its label and item-based approach is effective at addressing concerns about the comparability and taxonomic resolution of food web data.

 

Keywords: food webs, visualization, data integration, taxonomy

introduction

Ecologists are performing more meta-analyses and many researchers are integrating their datasets to achieve analyses that are unparalleled in geographic, historic, and topical scope (e.g., review by Storch and Gaston, 2004, publications from NCEAS http://nceas.ucsb.edu). Integrating large numbers of datasets is fraught with pitfalls, and problems are difficult to catch: have like elements been properly merged, have all inconsistencies among datasets been recognized and handled, are appropriate metadata available for further assessment of dataset quality? Furthermore, exploration of these multiple datasets for trends or testable patterns prior to statistical analysis is tedious, as it relies on complex SQL queries, spreadsheet macros, or specialized applications.

Data sharing itself poses an additional set of problems. A corpus of data is typically chosen by investigators for a particular purpose, and then made available to others, perhaps in a data clearinghouse. However, other investigators may have different criteria for their own purposes and may need only a subset, or they may choose an intersection of two or more corpuses; with some datasets appearing in more than one corpus. These are problems faced by any domain where multiple data sources must be explored and selected for further analysis.

In this paper, we are particularly interested in addressing these problems for ecological interaction analysis. The food web research community has a long history of data sharing and integrated analyses (reviewed in Pimm, 2002). Here the datasets typically involve networks of organisms and their trophic relationships, as well as associated population characteristics, flows, and organismal attributes. While there are clearly trophic relationships, or links, within datasets, many organisms have been studied in multiple places by multiple researchers in the same or different habitats, and so there are relationships across datasets as well.

Though graph visualizations are often used in network analyses (Lima, 2006), these have typically focused on visualizing one network at a time. They emphasize the nature of the linkages among nodes in a particular network. Most food web visualizations use a node-link diagram, laid out in 2-D or 3-D space (e.g., Christian and Luczkovich, 1999; Dunne et al., 2006). Primary tasks they support are identifying clusters and the distribution of node or link attributes across these clusters. Where multiple webs are available for visualizing, they must be viewed one at a time with no support for choosing which web to view.

PaperLens (Lee et al., 2005), winner of the 2004 InfoVis contest (Fekete et al., 2004), and its successor NetLens (Kang et al., in press) illustrate an alternative approach. PaperLens was designed to allow analysis of trends in research publications and exploration of topics, authors and other publication metadata. It provides easily sorted and scrolled tables, whose items are coupled to relevant items in related tables. Linkages are revealed primarily by interaction with these tables, but also by a Degrees of Separation Links diagram. Information is summarized into bar charts, and also linked to the items in tables that generate the bars. PaperLens was designed for exploration of a digital library, not for visualizing scientific data.

In this paper, we describe the selection and integration of datasets. Then, we describe EcoLens, an enhanced version of PaperLens that provides effective filtering, querying, and visualizing of multiple ecological interaction datasets. We give examples of results gained by using it on the integrated database and the results of a qualitative evaluation of EcoLens. Finally, we summarize lessons learned and propose new directions for tool development.

METHODS

1         Database development

1.1          Data Requirements

The goal of our theoretical ecology research is to develop effective algorithms for predicting trophic links in a system where they are unknown, or where conditions and therefore the known trophic linkages will change. At present we focus on presence or absence of links, i.e. the basic network topology critical for more sophisticated network analysis and modeling projects such as Christian and Luczkovich (1999). Our approach (Parr, in prep) is to take advantage of large numbers of known trophic links and use similarity in attributes or evolutionary relationships to predict whether links exist among organisms whose trophic links have not been studied. Thus, the datasets to integrate include studies of food webs throughout the world, including the names of organisms, their links, and metadata about each of these studies such as when, where, and in what habitat it was conducted. Furthermore, our algorithms require information about the evolutionary relationships among organisms, and the attributes of those organisms. The database should be maintained online in order to better integrate with SPIRE forecasting tools (http://spire.umbc.edu, e.g., Parr et al., in press), and potentially for integration with our other project datasets and tools.

1.2          Data Sources and Integration

Below we describe the original data sources and our process of modifying them to integrate into our database. A current MySQL database schema is available at http://www.cs.umd.edu/hcil/biodiversity.

Taxonomy and evolutionary information

We followed the integrated classifications as in Parr et al. (2004) for animals andITIS (2006) for plants and other organisms. These compilations of multiple sources provide an internally consistent source of names and allow us to use phylogenetic or taxonomic relationships among food web nodes in our other analyses (in prep). Information from these sources did not need to be modified in order to be integrated. We made some effort to identify and replace synonyms in the food web data with current names using these sources. Typographical errors in food web node names were also fixed wherever they were identified.

Ecological interaction data

We obtained ecological interaction data from online repositories in the most machine-readable format (we could obtain usually ASCII files, but occasionally MS Excel spreadsheets or PDFs). We focused initially on integrating large multi-dataset sources which have already been subject to multiple analyses, such as Cohens EcoWEB (Cohen, 1989), the trophic webs at the NCEAS Interaction Web Database (Vazquez, 2005), and the Webs on the Webs corpus (Dunne et al., 2006). We also included two webs specifically to see how EcoLens can handle taxon list comparisons (Jonsson et al., 2005). We emphasize that these are merely a starting point and not intended to be wholly representative of the data available. Integration of webs was achieved primarily on two dimensions, habitat and organism names. Habitat categorization was determined by manual inspection of original data files or published studies. It followed the moderately rich biome categories used by the Animal Diversity Web ontology (Parr et al., 2005).

Mapping food web entities (nodes) to the scientific names in our evolutionary data involved 1) moving modifiers such as age and size classes and other descriptors to other database fields; 2) searching taxonomic databases for exact or approximate matches to scientific names; and 3) determining the most appropriate scientific name or names for a common name. Step 1 was accomplished largely by scripts written in Java (available upon request), steps 2 and 3 were handled first by scripts accessing our own database information, then remaining typos and synonyms were handled by manual inspection and searching of external sources such as FishBase.org or Google.com.

In some cases, a node name needed to be mapped to multiple taxonomic names. For example, birds of prey becomes Falconiformes and Strigiformes; foxes becomes the individual fox species or genera known to occur in this geographic location. This process is in effect the opposite of constructing trophospecies where a name is found for a group of taxa that all share the same trophic link. Trophospecies can be sufficient for understanding trophic relationships and a way to avoid problems with taxonomic resolution (reviewed in Dunne, 2005), but de-aggregating trophospecies into taxa is critical for integrating data across multiple food web datasets. We mapped to the narrowest scientific name possible to include all the likely instances in the food web.

For those nodes where a taxonomic name was not possible to assign, we mapped to a controlled vocabulary. For example, Dissolved Organic Matter in one food web and DOM in another were both mapped to the same name, but DOM and POM (Particulate Organic Matter) were mapped to different names. We will refer to these also as taxon names though of course these are not evolutionary units or groupings.

With de-aggregated taxa, it is necessary to de-aggregate links. When a trophic link was reported between two nodes, and one or both of the nodes maps to more than one taxon name, we assumed that there was a link among all the resulting taxa. Webs from Jonsson et al. (2005) were already provided in both aggregated and de-aggregated forms.

Non-web trophic information

Most food web research involves trophic links reported in the context of source or sink or population webs. Source webs include a basal organism and all the organisms that eat it and the relationships among them. Sink webs include a top predator and all the organisms it eats. Population webs include a community of organisms and all the links among them. We used categorizations from the original sources to indicate each webs type. In addition, our schema allows evidence of trophic links that do not come from web studies at all, but from sources that report only lists of prey for a given predator or lists of predators for a given prey or general food habits. This type of information is readily available in online encyclopedias and greatly increases the scale of available knowledge in terms of taxa, habitat, and geographic coverage. It augments the more comprehensive food web studies.

Food web attributes

Food web researchers often compare overall web characteristics, such as number of species or nodes (S), number of links (L), and connectance (L/S2). We included the published or original values for the webs, and others we calculated based on our new de-aggregated taxa and links. Though other quantitative attributes of webs are possible to calculate (Pimm, 2002), we did not attempt to do so because of known lack of comparability of these measures across datasets collected under diverse conditions. Quantitative link strength or flow measures were not possible to compute because we currently use only presence/absence data which is more widely available. Given our newly mapped taxon names, we were able to compute the percentage of each webs taxa which were species (or subspecies), above species, or unknown (either the entity is not truly taxonomic or its level could not be determined).

Taxon attributes

The common name and rank of each organism were obtained from the taxonomic sources described above. To demonstrate how natural history characteristics could be integrated, we downloaded maximum mass from Animal Diversity Web (Myers et al., 2006) using their advanced inquiry search. Finally, we determined from our data tables the number of food web studies in which each organism appears.

2         EcoLens Software Development

2.1          Visualization Requirements

The goal of EcoLens is to allow biologists to explore a collection of food webs, find webs of interest, and then visualize an individual food web. As described above, our data consists of several elements such as food web study details, taxa, and habitats. Inspired by our successful experience with PaperLens, we tightly coupled multiple views to show relationships among these data elements. Within each food web, trophic relationships among taxa are important as well. Therefore, we wanted our design to combine the PaperLens overview technique with a guiding metaphor proposed in TreePlus (Lee et al., in press): Plant a seed and watch it grow. Using this philosophy, users start with a specific node and incrementally explore the network, avoiding complexity until it is necessary. Through the overview, users can easily find not only interesting trends and patterns in the dataset but also particular webs of interest. Once they find desired webs to look at, they can investigate each food web. We consider labels essential for both overviews and details, while the need to see every single item in a single overview is not important. While EcoLens is implemented with a food web dataset, we aimed to support a variety of general tasks having to do with understanding multiple datasets and their integration.

2.2          Evaluation Methods

For internal evaluation, we constructed a list of questions that EcoLens might help a biologist answer. We then tried using EcoLens to answer these questions and reported the results in the Dataset characteristics section below.

For external evaluation, we asked ten ecologists to use EcoLens several times and fill out a survey. Four responded. These ecologists had contributed data and were asked to evaluate the mapping of the food web nodes to taxonomic names. We also asked them specific questions about the interface. The survey included both Likert-scale questions and open-ended questions. This kind of formative qualitative evaluation is not expected to demonstrate clear advantages over existing systems but provide insight into advantages worth quantitative study.


Figure 1.EcoLens provides easy exploration of relational data tables by sorting and selecting in tabular form from complete lists to selected lists (for both web list and taxon list views, the list at the bottom is complete list and the one at the top is slected list), coupled with graphical representations in a bar chart (1), degrees of separation view (4), and network visualization (5).

 

2.3          System Description

As shown in Fig. 1, EcoLens consists of five main views: 1) web habitats; 2) web list; 3) taxon list; 4) degrees of separation links (DOSL); and 5) TreePlus. The web habitats view shows the list of habitats with the number of food webs in each one. Users can sort the view either by habitat name or by the number of webs. The bottom of the web list view shows all the webs in the database. When some webs are selected either by users or by the system, they are shown in the Selected Webs list at the top of the view. Similarly, the bottom of the taxon list view shows all the taxa in the database and the currently selected taxa are shown in the Selected Taxa list. When users double click on a taxon in the Selected Taxa list, EcoLens opens a dialog box to show the list of studies that contain the selected taxon (Fig. 2). The TreePlus view visualizes the food web as a node-link, tree-like diagram and the DOSL view shows one of the food chains from one taxon to the other in the web currently being visualized by the TreePlus view.

Figure 2. When users double click on a taxon, Balanus balanoides, in the Selected Taxa list, EcoLens opens a dialog box to show the list of studies that contain the selected taxon.

 

These views are tightly coupled. When users select a habitat in the web habitats view, all the webs from the selected habitat are highlighted in the Webs list. Furthermore, they are displayed in the Selected Webs list for easy access. In addition, all the taxa in these webs are highlighted in the Taxa list and displayed in the Selected Taxa list. For these three views web habitats, web list, and taxon list user interactions are symmetric. For example, users can select webs from the Webs list to see habitats for particular food webs or get lists of taxa. Habitats of the selected webs are highlighted in the web habitats view and taxa in the selected webs are shown in the taxon list view. Users can also copy reference information of the selected food webs to the clipboard by selecting the Copy Reference menu option after right clicking on the selected food webs.

Users may visualize an individual food web in the TreePlus view to see trophic links among taxa by double clicking on a web in the Selected Webs list or in the Webs list. They can also press the Graph It button after clicking on a web in the Selected Webs list. EcoLens then builds a food web from the database and visualizes it using TreePlus. Since it uses the default root selection mechanism in TreePlus, the taxon with the most connections to others is chosen to be the root. EcoLens also adds all taxa in the web to the From combo box in the degrees of separation links view. Once a taxon is selected from the From combo box, EcoLens displays all the taxa reachable from the selected taxon in the To combo box with the corresponding degrees of separation. When a taxon is selected from the To combo box, EcoLens displays one of the shortest food chains between two taxa. When users click on a node in the degrees of separation links view, EcoLens opens the selected taxon within TreePlus. Similarly, when users click on a node in the TreePlus view, EcoLens highlights the selected taxon within the degrees of separation links view if it is already displayed.

Based on requests from users, an Export to Excel feature was implemented for the web and taxon list views. This launches Microsoft Excel with the table data contained in the list where the button was used. For web lists, averages and standard deviations are automatically calculated for web statistics.

EcoLens was implemented in C# and runs on any standard Windows PC. To visualize each food web, it uses TreePlus (Lee et al., in press), a reusable component we developed to visualize networks. The web habitats and degrees of separation links views are implemented with Piccolo.NET, a shared source toolkit that supports scalable structured 2D graphics (Bederson et al., 2004). EcoLens accesses the MySQL server using a MySQL data provider, which links a data source and .NET code. To make each window dockable, it uses the DockPanel Suite (Luo, 2006), an open source docking library. EcoLens is now available for download at http://www.cs.umd.edu/biodiversity/#EcoLens.

2.4          TreePlus

TreePlus is an interactive graph visualization component based on a tree layout approach. It transforms a graph into a tree plus cross links (i.e. the additional links that are not represented by the spanning tree) and visualizes the tree instead of the graph. TreePlus uses a guiding metaphor of Plant a seed and watch it grow. This allows users to start with a node and expand the graph as needed.

TreePlus reveals the missing graph structure with visualization and interaction techniques while preserving good label readability. It highlights and previews adjacent nodes when a node is focused by a single click (Fig. 3). TreePlus updates the tree structure when a node is opened by a double click. TreePlus carefully animates the transitions[1] so that users can follow changes. The color of the node background and arrows indicates the link direction relative to the focus node. TreePlus uses the color blue for outgoing links, red for incoming links, and purple for bidirectional links. For example, in Fig. 3, the red node (Homo sapiens) eats Todarodes pacifus while Todarodes pacifus eats blue nodes (e.g., Sergia lucens). TreePlus also provides users with the option to show preview bars representing how fruitful it would be to go down a path. Color bar graphs placed below the nodes represent how many organisms are reachable in each direction.


Figure 3.. Homo sapiens was set as the root, and users selected Scomber japonicus which added all its adjacent nodes to the tree. A single click on Engraulis japonicus gives it the focus and shows a preview of its adjacent nodes in the preview panel on the right. Red or blue color indicates the direction of the link.

 

3         Results

3.1          Using EcoLens to Explore Food Web Data

These results about the database were obtained using EcoLens and simple spreadsheet manipulations. Trends identified here could subsequently be analyzed statistically but that is beyond the scope of the present exercise.

Example 1: Obtaining an overview of the integrated data

Using EcoLens, it is possible to quickly get an overview of the integrated dataset (Table 1). The most frequently studied habitat in our database is plant substrates, including data from 13 countries involving 543 distinct taxa. 28% of the 4594 distinct taxa are found in more than one study. It is obvious that studies have increasingly included more taxa and links, and that early webs appear to need more de-aggregation than later webs (Table 1). Across webs, a wide variety of taxa have been studied. Detritus is the most often included taxon in food webs (N=140). Humans appear in 20 different food webs, and two of these are prehistoric food webs.