GENESIS Database Support: SOAR, TSA-Tree and ProPolyne

(Final Technical Report on JPL Contract Nr. 961518)

Cyrus Shahabi

shahabi@usc.edu

Tel.: 213-740-8162

Dept. of Computer Science

University of Southern California

Los Angeles, CA 90089-0781

 

In the past four years, as part of the GENESIS ESIP project, we have built a 3-tier data storage and access architecture, termed SOAR, at the Information Laboratory (http://infolab.usc.edu) of the University of Southern California (USC).  Due to the general-purpose design of SOAR (e.g., using an OR-DBMS) and its standard web protocols (e.g., using web-services), SOAR supports both traditional and spatio-temporal queries over the GENESIS data products.  Also under the GENESIS ESIP project, we have introduced two novel data-mining algorithms, called TSA-Tree and ProPolyne, to expedite the scientific analysis on very large multidimensional data sets.  Our recent results reported in prestigious database conferences such as ACM PODS [Schmidt and Shahabi, 2002b] prove TSA-Tree and ProPolyne to be effective and promising techniques. We now explain the SOAR architecture and the main idea behind TSA-Tree and ProPolyne. 

SOAR:

The current implementation of SOAR provides flexible access to Level 1, 2a and 2b products of GPS occultation data from all three NASA receivers on CHAMP, SAC-C and GRACE.  The three-tier architecture consists of a database server, a middleware and several web-accessible user interfaces. 

At the top tier, there are the three alternative web-accessible user interfaces.  These interfaces can form database queries not only on metadata but also on detailed <lat, long, altitude, time> measurements across occultation files. The results of the queries can be displayed as profile graphs as well as saved into XML files for later data exchanges. Demonstrations of all three user-interfaces are available at: http://infolab.usc.edu/projects/genesis/oldindex.html. Two of the interfaces are simple form-based html pages that can be run from any Internet browser even if Java is not enabled. The other interface, which is more graphical and allows interactions with world-map and profile graphs, is implemented as a Java application and must be downloaded to the user machine before execution.  The download process is automatic and transparent from the user.  The interfaces talk to the middle-tier through the http protocol.

SOAR’s 2nd tier middleware has been implemented as a set of web-services such that the database queries can be activated directly by calling these web-services rather than going through the user interfaces.  This would allow the immediate integration of SOAR into other frameworks. The web-services talk to the application server running JSP codes which are connected in turn to the lowest tier, the database server, through JDBC.

The 3rd tier is an object-relational database management system (OR-DBMS) that is currently Informix Universal Server IUS v9.2. We are proud to have proposed this design in the original GENESIS proposal more than six years ago when no other ESIP had decided to use OR-DBMS’s at the time. The ESRI and Geodetic spatial datablades are installed within the OR-DBMS in order to facilitate spatio-temporal queries based on <lat, long, altitude, time>. A spatial index structure, R-tree, has been used to index these four dimensions to expedite queries.  The upload of occultation files (as they become available) from a server at JPL to SOAR database server at USC’s InfoLab is fully automatic. The XML format has been used for all the intermediate files as well as stored query results, which would facilitate data exchanges with other systems such as ESIP2NET.

Finally, we have demonstrated the integration of an outside radiosondes data source into the SOAR application by wrapping the site and querying it on demand. This operation can be simplified if the remote source follows the Web services framework so that instead of wrapping the site, SOAR could simply call the corresponding web-services. 

TSA-Tree and ProPolyne:

As we were deeply involved in the design, development and testing of SOAR, we realized the very slow response time for a certain type of queries, termed range-aggregate queries. An example of such queries is “Find the average temperatures for each grid cell for a given granularity of lat, long, altitude and time grid”. Although a user can form such a query very easily by utilizing our graphical user interface and drawing a 4-D grid, and the corresponding SQL queries are easy to generate, the database server needs a lot of time to compute the aggregation function (in this example “average”) on the measure attribute (in this example “temperature”) for all the grid cells.  Moreover, the system cannot rely on pre-computations of such queries because in the general case, many parameters are unknown.  That is, the resolution at each dimension (e.g., hourly, daily, monthly on time dimension), the aggregate function (e.g., sum, average, variance, or covariance) and the measure attribute (e.g., temperature, pressure, or altitude) can vary from one query to the other. 

Therefore, two years in the GENESIS project we proposed a wavelet-based index structure, termed TSA-tree, to address this problem [Shahabi et al., 2000; Shahabi et al., 2001a; Shahabi et al., 2001b].  As we learned more about the problem, we realized some of the shortcomings of TSA-tree and hence we generalized our designs to a (still wavelet-based) technique known as ProPolyne:  Progressive Polynomial range-aggregate query evaluation [Schmidt and Shahabi, 2002a].  We extended ProPolyne to support grid queries, an example of which was mentioned above, in a subsequent publication using NASA atmospheric observation data [Schmidt and Shahabi, 2002b].  A demonstration of ProPolyne implemented as web-services using Microsoft .NET framework on top of the NASA atmospheric data sets is available at: http://mahour.usc.edu/proda/.

ProPolyne Details:

ProPolyne can support any polynomial range-aggregate query (up to a degree specified when the database is populated) using a single set of pre-computed aggregates. This extra power comes with little extra cost: the query, update, and storage costs are comparable to the best-known MOLAP (Multidimensional OnLine Analytical Processing) techniques (see [Schmidt and Shahabi, 2002a]). We achieve this by observing that polynomial range-aggregates can be translated and evaluated in the wavelet domain. When the wavelet filter is chosen to satisfy an appropriate moment condition, most of the query wavelet coefficients vanish making the query evaluation faster. We made this observation practical by introducing the lazy wavelet transform, an algorithm that translates polynomial range-aggregates to the wavelet domain in poly-logarithmic time.

Wavelets are often thought of as a data approximation tool, and have been used this way for approximate range query answering [Vitter, Wang and Iyer, 98; Shanmugasundaram et al., 1999; Chakrabarti et al. 2000; Gilbert et al., 2001].

The efficacy of this approach is highly data dependent; it only works when the data have a concise wavelet approximation. Furthermore the wavelet approximation is difficult to maintain. To avoid these problems, we use wavelets to approximate incoming queries rather than the underlying data. Note that the data set is still transformed using wavelet; however, it is not approximated since we keep all the coefficients. By using our exact polynomial range-sum technique, but using the largest query wavelet coefficients first, we are able to obtain accurate, data-independent query approximations after a small number of I/Os. This approach naturally leads to a progressive algorithm. We brought these ideas together by introducing ProPolyne (Progressive Polynomial Range-Aggregate Evaluator), a polynomial range-aggregate evaluation method which:

·         Treats all dimensions, including measure dimensions, symmetrically and supports range-sum queries where the measure is any polynomial in the data dimensions (not only COUNT, SUM and AVERAGE, but also VARIANCE, COVARIANCE and more). All computations are performed entirely in the wavelet domain.

·         Uses the lazy wavelet transform to achieve query and update cost comparable to the best-known exact techniques.

·         By using the most important query wavelet coefficients first, provides excellent approximate results and guaranteed error bounds with very little I/O and computational overhead, reaching low relative error far more quickly than analogous data compression methods.

Our experimental results on NASA atmospheric datasets showed that the approximate results produced by ProPolyne are very accurate long before the exact query evaluation is complete. These experiments also showed that the performance of wavelet based data approximation methods varies wildly with the dataset, while query approximation based ProPolyne delivers consistent, and consistently better, results.

References

C. Shahabi, X. Tian and W. Zhao, TSA-tree: A Wavelet-Based Approach to Improve the Efficiency of Multi-Level Surprise and Trend Queries on Time-Series Data, The 12th International Conference on Scientific and Statistical Database Management (SSDBM 2000), Berlin, Germany, July 2000.

C. Shahabi, S. Chung, M. Safar and G.Hajj, 2D TSA-tree: A Wavelet-Based Approach to Improve the Efficiency of Multi-Level Spatial Data Mining, The 13th International Conference on Scientific and Statistical Database Management (SSDBM 2001), Fairfax, Virginia, July 2001.

C. Shahabi, S. Chung and M. Safar, A Wavelet-Based Approach to Improve the Efficiency of Multi-Level Surprise Mining, PAKDD International Workshop on Mining Spatial and Temporal Data 2001, Hong Kong, May 2001.

R. R. Schmidt and C. Shahabi, Wavelet Based Density Estimators for Modeling Multidimensional Data Sets, SIAM Workshop on Mining Scientific Data Set, Chicago, IL, April 2001.

R. R. Schmidt and C. Shahabi, PROPOLYNE: A Fast Wavelet-based Algorithm For Progressive Evaluation of Polynomial Range-Sum Queries, Proceedings of the 8th International Conference on Extending Database Technology (EDBT'2002), Prague, Czech Republic, March 2002.

R. R. Schmidt and C. Shahabi, How to Evaluate Multiple Range-Sum Queries Progressively,  Proceedings of the International Symposium on Principle of Database Systems (PODS’2002), Madison, Wisconsin, June 2002.

J. S. Vitter, M. Wang, and B. R. Iyer. Data cube approximation and histograms via wavelets. In Proc. of the 7th Int'l Conf. on Information and Knowledge Management, pages 96-104, 1998.

A. C. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. J. Strauss. Optimal and approximate computation of summary statistics for range aggregates. In PODS 2001, Proceedings of the Twenteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pages 228-237, 2001.

J. Shanmugasundaram, U. Fayyad, and P. Bradley. Compressed data cubes for OLAP aggregate query approximation on continuous dimensions. In Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 1999.

K. Chakrabarti, M. N. Garofalakis, R. Rastogi, and K. Shim. Approximate query processing using wavelets. In VLDB 2000, Proceedings of 26th International Conference on Very Large Data Bases, pages 111-122, 2000.