The following paper is copyright IEEE (Institute of Electrical and Eletctronics Engineers) and cannot be reproduced without expressed written permission of the IEEE.

 

 

A Prototype Metadata Database for Online Analytical Processing of Environmental Data

 

Harold Geller

CEOSR, George Mason University

hgeller@science.gmu.edu

John Ertlschweiger

BDM Corporation

jertlsch@erols.com

Sarah Conger

Hughes/STX

conger@mustang.nrl.navy.mil

August Ryberg

PRC, Inc.

aryberg@science.gmu.edu

 

 

Abstract

We present preliminary results on the development of a prototype database system demonstrating the utility of the integration of environmental metadata within an online analytical processing environment. We utilized existing data derived from CD-ROMs of the National Snow and Ice Data Center (NSIDC), the Consortium for International Earth Science Information Network (CIESIN) and the U.S. Geological Survey (USGS). We populated a prototype metadata database whose architecture facilitates the scientific and statistical investigations of geophysical parameters associated with the polar regions, allowing for data fusion from other regions and earth science disciplines, facilitating interdisciplinary studies. The user can extract information combining the knowledge of two disparate sources of geophysical data to allow a query that would result in a useful product. Furthermore, we demonstrate the utility of allowing access to this database via the World Wide Web using an interface to the underlying Oracle database management system. Figure 1 summarizes the overarching approach.

 

 

 

Figure 1. Polar Ice and Aerosol Data System Diagram

 

 

 

 

  1. Background and Technical Approach

 

Scientists who wish to analyze geophysical parameters in the polar regions often obtain data from central depositories such as the National Snow and Ice Data Center (NSIDC). The data often requires reformatting for scientific visualization and statistical analysis. Our system architecture allows query and access of such data while providing the researcher a means for inspecting the data and determining which data should be studied further. The architecture is loosely analogous to a relational online analytical processing (ROLAP) architecture as it is: based on a relational data base management system (RDBMS), composed of an SQL generator optimized for the target database, client-server based and transparent to a multiple number of users. Such questions as the extent of sea ice coverage over a period approaching a solar cycle can be addressed quickly and efficiently.

Our approach was to utilize current metadata standards in the development of our database schema so that new datasets could be easily integrated into our system in the future, or that our datasets would be integrated into larger sets [3]. Our prototype was based upon work done at the Consortium for International Earth Science Information Network (CIESIN) [2,4]. We sought to incorporate the accessibility of the World Wide Web (Web) for the benefit of data exploration of such data. Initially we formulated a series of questions that the user would be able to answer utilizing the prototype. This required the development of statistical metadata [6]. Figure 2 summarizes the work required for this type of prototype effort as well as its connection to work previously completed. This prototype development effort is specifically depicted on the right half of Figure 2.

 

Figure 2. Metadata Database Workflow Summary

 

2. Design and Implementation Discussion

 

We chose to develop a prototype that would demonstrate simultaneous access to both sea-ice concentration data and selected aerosol concentration data. Each dataset has a unique set of metadata and data structure. For the purposes of this prototype effort we gathered of both observed aerosol data and sea ice concentration data, established metadata descriptions of this data, and determined appropriate schemas for the underlying metadata database. Therefore, our prototype Web-based query generator queries two metadata databases, and in this sense is heterogeneous.

The polar ice data consists of processed image data from January 1985 through December 1990 for geophysical parameters including ice velocity, sea surface temperature, sea ice concentration, sea surface wind speed, and cloud coverage, represented by monthly averages. Each parameter is divided into subclasses or bins of data defining a specific range of data values. Each data bin contains the number of pixels of data which satisfy the subclass characteristics, and each pixel represents a 25 by 25 kilometer square on the surface of the Earth. For our prototype, only sea ice concentration was decomposed into the concentration subclasses. Each parameter may possess different values for the data bin subclass. However, in every case, each subclass contains the number of pixels of data that represent the subclass.

Ice concentration is measured in percent coverage of the surface within the footprint area. The sea ice concentration parameter has eleven data bins. The aerosol species data were organized in a manner significantly different from that of the polar ice data. However, the metadata schema developed for the aerosol database was intended to support queries addressing both sea ice concentration data and aerosol data queries. Our prototype was implemented using Oracle allowing the addition of new parameters without adversely affecting the existing data or tables in the relational database. Additionally, triggers were implemented into the prototype to provide automatic updating of metadata parameters.

Our prototype interface was developed with the following criteria in mind: portability; ease of use; advanced visualization capability; ability to do initial analysis on-line; and, easy access to the data. We first present the user with an HTML page with links to several query forms. The user selects one of the forms, then enters a query by selecting parameters (e.g. cloud cover) and a date or time period. This query is submitted to the query engine, a common gateway interface (CGI) script, for processing. The script then: parses the HTML form query; translates the query into SQL; passes the query to the Oracle database; formats the database response; and, presents the results to the user in the form of an HTML page which has been created on-the-fly with embedded Java applets and inline JPEG images. This web-based query interface implementation is depicted in Figure 3.

Image data was initially analyzed using commercially available data analysis software from which we developed frequency distributions of pixels with specified values (i.e. histograms) of sea ice concentration or other parameter. The histograms serve as metadata that can be queried using the SQLplus module of Oracle. Each record in the database contains a value which is best interpreted as an area of the polar region (i.e. a number of 25 kilometer square regions) which share the identical value of the geophysical parameter. Thus, each region is 625 square kilometers and is representative of the geophysical parameter averaged by month.

 

Figure 3. Query Interface Implementation

 

A representative forms-based Web page of the browse imagery is depicted in Figure 4. In the lower portion of the page, the user chooses the month, year and parameter for which an image is available. Once chosen, the user then clicks the submit button and the image will be displayed in the upper portion of the screen. These choices are identical on both halves of the screen, allowing the user to display and compare the images for the requested month and year. The user can bring up a histogram of each image in a separate window or download a TIFF version of the image. The actual images are not stored within the Oracle database. The database itself maintains pointers to files that are located on the Web server.

 

Figure 4. Forms Based Web Interface

 

The Web-based analytical interface is depicted in Figure 5. Here the user chooses a specified parameter followed by a qualifier, that is, either equals, less than or greater than a chosen value for the bin or bins of interest. A start date, month and year, and an end date, month and year are also to be chosen by the user. The user can then submit a request, which is sent out as a query to the Oracle database management system. The results are displayed in the upper half of the Web page, originally developed as a bar chart type of representation. . This allows the user to perform a first order analysis of the data and determine its usefulness for their studies.

 

Figure 5. Web-Based Analytical Interface

Our approach may be viewed as a customized online analytical processing (OLAP) approach integrating a Web user interface. While this does not meet Coddís twelve rules for OLAP [1], the approach was taken due to the available resources. It is analogous to two-tier data warehousing architectures, consisting of a standard relational database management system on a mainframe and a customized query generator on a local system. This raises the question of the applicability of the two major types of OLAP tools available today, that is the MOLAP (multidimensional online analytical processing) and ROLAP.

Modern MOLAP and ROLAP tools are geared to the business community. However, with the coming data deluge in the science community, vendors may incorporate the functional and statistical requirements from the science community. Which OLAP technology is best applied in the science community is not a subject for this paper, comparison treatises on this subject are available [5]. The science community appears to be approaching this OLAP technology by customized efforts such as that ongoing within George Mason Universityís Center for Earth Observing and Space Research (CEOSR) and their prototype development effort of a Virtual Domain Applications Data Center (VDADC).

 

3. Conclusions

 

We have demonstrated one approach to the development of a scientific database which allows for some statistical analyses. This approach was based on a standard relational database management system (Oracle) and a Web-based front end with CGI scripts and Java applets for query construction and display of results. The most obvious advantage to this type of architecture is that it is portable to any platform and is relatively user friendly. This approach allows the user interface to grow and incorporate new technologies as they become available. We believe that the portability of Java to create analysis tools that are transparently downloaded to the researcherís computer is a feature that future developments should incorporate. Such enhancements will improve the ability of interdisciplinary researchers to develop and test hypotheses.

Our approach to initial access and examination of scientific data sets is one alternative being examined for implementation within the archive and analysis system at the International Arctic Research Center (IARC). This will support researchers examining both satellite and in situ data archived at the center. IARC is a joint undertaking of the governments of Japan and the United States with facilities located at the University of Alaska Fairbanks.

We did not examine the use of commercially available OLAPs to address the needs of this or similar scientific database analysis systems. However, we believe that such commercial systems, if enhanced with the assistance of the earth and remote sensing science communities may be a future source of off-the-shelf solutions for future researchers. We can only hope that the commercial vendors view this market with enough vigor to advance this application of their technologies.

 

4. Acknowledgements

 

This work was initiated as partial fulfillment of course requirements at George Mason University, CSI810 (INFT864), with Larry Kerschberg and George Michaels. A portion of this work was undertaken as partial fulfillment of course requirements for CSI 996 with Menas Kafatos. We thank Tom Sanders for assistance in preparation of Figure 4. We thank Yannis Ioannidis for numerous comments and suggestions regarding this manuscript. Access to this paper and the web interface is being made available at the following URL http://www.site.gmu.edu/~jertlsch/INFT864.

 

 

 

5. References

 

[1] E.F.Codd, "Twelve Rules for On Line Analytical Processing", Computerworld, April 13, 1995.

[2] P. Colvin, F. Tanis, C. Chiesa and H. Geller, "Design and Development of an Arctic Geographic Information System for Global Change Research", Eos, American Geophysical Union, Volume 74, No.43, 1993, p.87.

[3] FGDC, The Federal Geographic Data Committee, Content Standards for Digital Geospatial Metadata (June 8), Federal Geographic Data Committee, Washington, D.C., 1994.

[4] H. Geller and P. Colvin, "Utilization of Model and Empirical Data in an Arctic GIS for Geophysical Model Refinement", Eos, American Geophysical Union, Volume 75, No.44, 1994, p.88.

[5] N. Raden, "Choosing the Right OLAP Technology", in Planning and Designing the Data Warehouse by R.C. Barquin, and H.A. Edelstein, eds., Prentice Hall PTR, Upper Saddle River, New Jersey, 1997, pp. 199-224.

[6] E.P. Shelley and B.D. Johnson, "Metadata: Concepts and Models", in Proceedings of the Third National Conference on the Management of Geoscience Information and Data, organized by the Australian Mineral Foundation, Adelaide, Australia, 18-20 July 1995, pp 4.1-5.