EcoLens: Integration and interactive visualization of ecological datasets
Cynthia Sims Parr1,
Bongshin Lee1,2*
and Benjamin B. Bederson1,2
1Human-Computer Interaction Lab, UMIACS, University of Maryland, College Park, USA and 2Department of Computer Science, University of Maryland, College Park, USA
Direct correspondence to: Cynthia Sims Parr, csparr@umd.edu
Complex multi-dimensional datasets are now pervasive in
science and elsewhere in society. Better interactive tools are needed for
visual data exploration so that patterns in such data may be easily discovered,
data can be proofread, and subsets of data can be chosen for algorithmic
analysis. In particular, synthetic research such as ecological interaction research
demands effective ways to examine multiple datasets. This paper describes our integration
of hundreds of food web datasets into a common platform, and the visualization
software, EcoLens, we developed for exploring this information. This publicly-available
application and integrated dataset have been useful for our research predicting
large complex food webs, and EcoLens is favorably reviewed by other
researchers. Many habitats are not well represented in our large database. We
confirm earlier results about the small size and lack of taxonomic resolution
in early food webs but find that they and a non-food-web source provide trophic
information about a large number of taxa absent from more modern studies.
Corroboration of
Keywords: food webs, visualization, data integration, taxonomy
introduction
Ecologists are performing more meta-analyses and many researchers are integrating their datasets to achieve
analyses that are unparalleled in geographic, historic, and topical scope (e.g.,
review by Storch and Gaston, 2004, publications from NCEAS
http://nceas.ucsb.edu). Integrating large numbers of datasets is fraught with
pitfalls, and problems are difficult to catch: have like elements been properly
merged, have all inconsistencies among datasets been recognized and handled,
are appropriate metadata available for further assessment of dataset quality? Furthermore,
exploration of these multiple datasets for trends or testable patterns prior to
statistical analysis is tedious, as it relies on complex SQL queries,
spreadsheet macros, or specialized applications.
Data
sharing itself poses an additional set of problems. A corpus of data is
typically chosen by investigators for a particular purpose, and then made
available to others, perhaps in a data clearinghouse. However, other
investigators may have different criteria for their own purposes and may need
only a subset, or they may choose an intersection of two or more corpuses; with
some datasets appearing in more than one corpus. These are problems faced by
any domain where multiple data sources must be explored and selected for
further analysis.
In
this paper, we are particularly interested in addressing these
problems for ecological interaction analysis. The food web research community
has a long history of data sharing and integrated analyses (reviewed in Pimm, 2002). Here the datasets typically
involve networks of organisms and their trophic relationships, as well as
associated population characteristics, flows, and organismal attributes. While
there are clearly trophic relationships, or links, within datasets, many
organisms have been studied in multiple places by multiple researchers in the
same or different habitats, and so there are relationships across datasets as
well.
Though
graph visualizations are often used in network analyses (Lima,
2006), these have typically focused on visualizing one network at a time.
They emphasize the nature of the linkages among nodes in a particular network.
Most food web visualizations use a node-link diagram, laid out in 2-D or 3-D
space (e.g., Christian and Luczkovich, 1999; Dunne et al.,
2006). Primary tasks they support are identifying clusters
and the distribution of node or link attributes across these clusters. Where
multiple webs are available for visualizing, they must be viewed one at a time
with no support for choosing which web to view.
PaperLens
(Lee et al., 2005), winner of the 2004 InfoVis contest (Fekete
et al., 2004), and its
successor NetLens (Kang et al., in press)
illustrate an alternative approach. PaperLens was designed to allow analysis of
trends in research publications and exploration of topics, authors and other
publication metadata. It provides easily sorted and scrolled tables, whose
items are coupled to relevant items in related tables. Linkages are revealed
primarily by interaction with these tables, but also by a Degrees of Separation Links diagram.
Information is summarized into bar charts, and also linked to the items in
tables that generate the bars. PaperLens was designed for exploration of a digital
library, not for visualizing scientific data.
In
this paper, we describe the selection and integration of
datasets. Then, we describe EcoLens, an enhanced version of PaperLens that
provides effective filtering, querying, and visualizing of multiple ecological interaction
datasets. We give examples of results gained by using it on the integrated
database and the results of a qualitative evaluation of EcoLens. Finally, we
summarize lessons learned and propose new directions for tool development.
METHODS
The goal of our theoretical ecology research is to develop effective
algorithms for predicting trophic links in a system where they are unknown, or
where conditions and therefore the known trophic linkages will change. At
present we focus on presence or absence of links, i.e. the basic network
topology critical for more sophisticated network analysis and modeling projects
such as Christian
and Luczkovich (1999). Our approach (Parr, in prep) is to take advantage of
large numbers of known trophic links and use similarity in attributes or
evolutionary relationships to predict whether links exist among organisms whose
trophic links have not been studied. Thus, the datasets to integrate include
studies of food webs throughout the world, including the names of organisms,
their links, and metadata about each of these studies such as when, where, and
in what habitat it was conducted. Furthermore, our
algorithms require information about the evolutionary relationships among
organisms, and the attributes of those organisms. The database should be
maintained online in order to better integrate with SPIRE forecasting tools
(http://spire.umbc.edu, e.g., Parr et
al., in press), and
potentially for integration with our other project datasets and tools.
Below we describe the original data sources and our process of
modifying them to integrate into our database. A current MySQL database schema
is available at http://www.cs.umd.edu/hcil/biodiversity.
Taxonomy and evolutionary information
We followed the integrated classifications as in Parr et al. (2004) for animals andITIS (2006) for plants and other organisms. These compilations of
multiple sources provide an internally consistent source of names and allow us
to use phylogenetic or taxonomic relationships among food web nodes in our
other analyses (in prep). Information from these sources did not need to be
modified in order to be integrated. We made some effort to identify and replace
synonyms in the food web data with current names using these sources.
Typographical errors in food web node names were also fixed wherever they were
identified.
Ecological interaction data
We obtained ecological interaction data from online repositories in the
most machine-readable format (we could obtain usually ASCII files, but
occasionally MS Excel spreadsheets or PDFs). We focused initially on integrating
large multi-dataset sources which have already been subject to multiple analyses,
such as Cohens EcoWEB (Cohen, 1989), the trophic webs at the NCEAS Interaction Web
Database (Vazquez, 2005), and the Webs on the Webs corpus (Dunne et al., 2006). We also included two webs specifically to see how
EcoLens can handle taxon list comparisons (Jonsson et al., 2005). We emphasize that these are merely a starting point
and not intended to be wholly representative of the data available. Integration
of webs was achieved primarily on two dimensions, habitat and organism names.
Habitat categorization was determined by manual inspection of original data
files or published studies. It followed the moderately rich biome categories
used by the Animal Diversity Web ontology (Parr et al., 2005).
Mapping
food web entities (nodes) to the scientific names in our evolutionary data
involved 1) moving modifiers such as age and size classes and
other descriptors to other database fields; 2) searching taxonomic databases for exact or
approximate matches to scientific names; and 3) determining the most appropriate
scientific name or names for a common name. Step 1 was accomplished largely by
scripts written in Java (available upon request), steps 2 and 3 were handled first by scripts accessing our own database
information, then remaining typos and synonyms were handled by manual
inspection and searching of external sources such as FishBase.org or Google.com.
In
some cases, a node name needed to be mapped to multiple taxonomic names. For
example, birds of prey becomes Falconiformes and Strigiformes; foxes becomes
the individual fox species or genera known to occur in this geographic location.
This process is in effect the opposite of constructing trophospecies where a
name is found for a group of taxa that all share the same trophic link. Trophospecies
can be sufficient for understanding trophic relationships and a way to avoid
problems with taxonomic resolution (reviewed in Dunne, 2005), but de-aggregating trophospecies into taxa is
critical for integrating data across multiple food web datasets. We mapped to
the narrowest scientific name possible to include all the likely instances in
the food web.
For
those nodes where a taxonomic name was not possible to assign, we mapped to a
controlled vocabulary. For example, Dissolved Organic Matter in one food web
and DOM in another were both mapped to the same name, but DOM and POM (Particulate
Organic Matter) were mapped to different names. We will refer to these also as
taxon names though of course these are not evolutionary units or groupings.
With
de-aggregated taxa, it is necessary to de-aggregate links. When a trophic link
was reported between two nodes, and one or both of the nodes maps to more than
one taxon name, we assumed that there was a link among all the resulting taxa. Webs
from Jonsson et al. (2005) were already provided in both aggregated and
de-aggregated forms.
Non-web trophic information
Most food web research involves trophic links reported in the context
of source or sink or population webs. Source webs include a basal organism and
all the organisms that eat it and the relationships among them. Sink webs
include a top predator and all the organisms it eats. Population webs include a
community of organisms and all the links among them. We used categorizations
from the original sources to indicate each webs type. In addition, our schema
allows evidence of trophic links that do not come from web studies at all, but
from sources that report only lists of prey for a given predator or lists of
predators for a given prey or general food habits. This type of
information is readily available in online encyclopedias and greatly increases
the scale of available knowledge in terms of taxa, habitat, and geographic coverage.
It augments the more comprehensive food web studies.
Food web attributes
Food web researchers often compare overall web characteristics, such as
number of species or nodes (S), number of links (L), and connectance (L/S2).
We included the published or original values for the webs, and others we calculated
based on our new de-aggregated taxa and links. Though other quantitative
attributes of webs are possible to calculate (Pimm,
2002), we did not attempt to do so because of known lack of comparability of
these measures across datasets collected under diverse conditions. Quantitative
link strength or flow measures were not possible to compute because we
currently use only presence/absence data which is more widely available. Given
our newly mapped taxon names, we were able to compute the percentage of each
webs taxa which were species (or subspecies), above species, or unknown
(either the entity is not truly taxonomic or its level could not be determined).
Taxon attributes
The common name and rank of each organism were obtained from the
taxonomic sources described above. To demonstrate how natural history
characteristics could be integrated, we downloaded
maximum mass from Animal Diversity Web (Myers et al., 2006) using their advanced
inquiry search. Finally, we determined from our data tables the number of food
web studies in which each organism appears.
The goal of EcoLens is to allow biologists to explore a collection of
food webs, find webs of interest, and then visualize an individual food web. As
described above, our data consists of several
elements such as food web study details, taxa, and habitats. Inspired by our
successful experience with PaperLens, we tightly coupled multiple views to show
relationships among these data elements. Within each food web, trophic
relationships among taxa are important as well. Therefore, we wanted our design
to combine the PaperLens overview technique with a guiding metaphor proposed in
TreePlus (Lee et
al., in press): Plant a seed and watch it grow. Using this philosophy, users start
with a specific node and incrementally explore the network, avoiding complexity
until it is necessary. Through the overview, users can
easily find not only interesting trends and patterns in the dataset but also particular
webs of interest. Once they find desired webs to look at, they can investigate
each food web. We consider labels essential for both overviews and details, while the need to see
every single item in a single overview is not important. While EcoLens is
implemented with a food web dataset, we aimed to support a variety of general tasks
having to do with understanding multiple datasets and their integration.
For internal evaluation, we constructed a list of questions that
EcoLens might help a biologist answer. We then tried using EcoLens to answer
these questions and reported the results in the Dataset characteristics section
below.
For external evaluation, we asked ten ecologists to use
EcoLens several times and fill out a survey. Four responded. These ecologists
had contributed data and were asked to evaluate the mapping of the food web nodes
to taxonomic names. We also asked them specific questions
about the interface. The survey included both Likert-scale questions and
open-ended questions. This kind of formative
qualitative evaluation is not expected to demonstrate clear advantages
over existing systems but provide insight into advantages worth quantitative
study.

Figure 1. EcoLens provides easy exploration of relational
data tables by sorting and selecting in tabular form from complete lists to selected
lists (for
both web list and taxon list views, the list at the bottom is complete list and
the one at the top is slected list),
coupled with graphical representations in a bar chart (1), degrees of separation view (4), and network visualization (5).
As shown in Fig. 1, EcoLens consists of five main views: 1) web habitats; 2) web list; 3) taxon list; 4) degrees of
separation links (DOSL); and 5) TreePlus. The web habitats view
shows the list of habitats with the number of food webs in
each one. Users can sort the view either by habitat name or by the number of webs. The bottom of the web list view shows
all the webs in the database. When some webs are selected either by users or by
the system, they are shown in the Selected Webs list at the top of the view.
Similarly, the
bottom of the taxon list view shows all the taxa in the database and the currently
selected taxa are shown in the Selected Taxa list. When users double click on a
taxon in the Selected Taxa list, EcoLens opens a dialog box to show the list of studies that contain the selected taxon (Fig. 2). The
TreePlus view visualizes the food web as a node-link, tree-like diagram and the
DOSL view shows one of the food chains from one taxon to the other in the web currently
being visualized by the TreePlus view.

Figure 2. When users double click
on a taxon, Balanus balanoides, in the Selected Taxa list, EcoLens opens a
dialog box to show the list of studies that
contain the selected taxon.
These views are tightly coupled. When users select a
habitat in the web habitats view, all the webs from the
selected habitat are highlighted in the Webs list. Furthermore, they are
displayed in the Selected Webs list for easy access. In addition, all the taxa
in these webs are highlighted in the Taxa list and
displayed in the Selected Taxa list. For these three views web habitats, web
list, and taxon list user interactions are symmetric. For example, users can
select webs from the Webs list to see habitats for particular food webs or get
lists of taxa. Habitats of the selected webs are highlighted in the web habitats
view and taxa in the selected webs are shown in the taxon list view. Users can
also copy reference information of the selected food webs to the clipboard by
selecting the Copy Reference menu option after right clicking on the selected
food webs.
Users
may visualize an individual food web in the TreePlus view to see trophic links among taxa by double clicking on
a web in the Selected Webs list or in the Webs list. They can also press the
Graph It button after clicking on a web in the Selected Webs list. EcoLens then
builds a food web from the database and visualizes it using TreePlus. Since it uses
the default root selection mechanism in TreePlus, the taxon with the most
connections to others is chosen to be the root. EcoLens also adds all
taxa in the web to the From combo box in the degrees of separation links
view. Once a taxon is selected from the From combo box, EcoLens displays all
the taxa reachable from the selected taxon in the To combo box with the
corresponding degrees of separation. When a taxon is selected from the To
combo box, EcoLens displays one of the shortest food chains between two taxa. When
users click on a node in the degrees of separation links view, EcoLens opens
the selected taxon within TreePlus. Similarly, when users click on a node in
the TreePlus view, EcoLens highlights the selected taxon within the degrees of
separation links view if it is already displayed.
Based
on requests from users, an Export to Excel feature was implemented for the
web and taxon list views. This launches Microsoft Excel with the table data
contained in the list where the button was used. For web lists, averages and
standard deviations are automatically calculated for web statistics.
EcoLens
was implemented in C# and runs on any standard Windows PC. To visualize each
food web, it uses TreePlus (Lee et
al., in press), a reusable component we developed to visualize networks. The web
habitats and degrees of separation links views are implemented with Piccolo.NET,
a shared source toolkit that supports scalable structured 2D graphics (Bederson
et al., 2004). EcoLens accesses the MySQL server using a MySQL data provider, which
links a data source and .NET code. To make each window dockable, it uses the
DockPanel Suite (Luo,
2006), an open source docking library. EcoLens is now available for download
at http://www.cs.umd.edu/biodiversity/#EcoLens.
TreePlus is an interactive graph visualization component based
on a tree layout approach. It transforms a graph into a tree plus cross links
(i.e. the additional links that are not represented by the spanning tree) and
visualizes the tree instead of the graph. TreePlus uses a guiding metaphor of
Plant a seed and watch it grow. This allows users to start with a node and
expand the graph as needed.
TreePlus
reveals the missing graph structure with visualization and interaction
techniques while preserving good label readability. It highlights and previews
adjacent nodes when a node is focused by a single click (Fig. 3). TreePlus updates the tree structure when a node is
opened by a double click. TreePlus carefully animates the transitions[1] so
that users can follow changes. The color of the node background and arrows
indicates the link
direction relative to the focus node. TreePlus
uses the color blue for outgoing links, red for incoming links, and purple for
bidirectional links. For example, in Fig. 3, the red node
(Homo sapiens) eats Todarodes pacifus while Todarodes pacifus eats blue nodes
(e.g., Sergia
lucens). TreePlus also provides users with the option to show preview bars representing how
fruitful it would be to go down a path. Color bar graphs placed below the nodes
represent how many organisms are reachable in each direction.

Figure 3.. Homo sapiens was set as
the root, and users selected Scomber japonicus which added all its adjacent nodes to the tree. A single click on Engraulis japonicus gives it the focus and shows a preview of
its adjacent nodes in the preview panel on the right. Red or blue color
indicates the direction of the link.
These results about the database were obtained using EcoLens and simple
spreadsheet manipulations. Trends identified here could subsequently be
analyzed statistically but that is beyond the scope of the present exercise.
Example 1: Obtaining an overview of the integrated
data
Using EcoLens, it is possible to quickly get an overview of the
integrated dataset (Table 1). The most frequently studied habitat in our database
is plant substrates, including data from 13 countries involving 543 distinct
taxa. 28% of the 4594 distinct taxa are found in more than one study. It is
obvious that studies have increasingly included more taxa and links, and that
early webs appear to need more de-aggregation
than later webs (Table 1). Across webs, a wide variety of taxa have been
studied. Detritus is the most often included taxon in food webs (N=140). Humans
appear in 20 different food webs, and two of these are prehistoric food webs.