Modern data analysis systems need to perform complex statistical queries on very large multidimensional datasets; thus, a number of multivariate statistical methods must be supported. On top of that, the desired accuracy varies per application, user and/or dataset and it can well be traded-off for faster response time. These characteristics lead us to believe that the wavelet transform, with its inherent multiresolution property, will become a likely tool for future database query processing. We are building a general system that utilizes the wavelet decomposition of multidimensional data to not only enhance answering aggregate queries but also to be able to facilitate data-mining functionalities.
The design of such a system faces challenges coming from diverse areas in the database community. Because most of these problems have not been dealt with in the past, or addressed in a different context, some interesting research topics emerge. These can be divided into 3 main categories.
 |
Data Management |
 |
| |
- Efficiently transforming very large multidimensional datasets:
Currently, our system (ProDA) operates on a data set that has been transformed off-line and prepared for the system. Moreover, no incremental update of the data set is supported. For our large data sets, the transformation sometimes takes days to complete. To integrate ProDA into SciFlow, its web-services should be extended with transformation and update web-services. This, however, is more than a simple implementation effort. The challenge is that current wavelet transformation techniques assume data can be fit into the main memory during the transformation. However, for large data sets, an I/O efficient transformation technique is required. We intend to build an appropriate buffer management technique so that the intermediate wavelet arrays can be broken into chunks that can be read and written optimally to minimize I/O during transformation. Similarly, as new data sets become available, the update web-service should be able to update the coefficients without requiring a full transformation.
- Creating data structures to handle extremely sparse data cubes
As datacubes grow in size and sparsity, efficient query processing becomes harder and harder. We are trying to come up with the most efficient techniques to deal with both dense and sparse data. Consequently, we are trying to explore new ways for, firstly recognizing the dense and sparse data regions in multidimensional datacubes and secondly the most efficient method for indexing.
|
|
 |
|
 |
 |
Query Processing |
 |
| |
- Providing fast answers that progressively become exact
Data analysis systems require range-aggregate query answering of
large multidimensional datasets. ProDA utilizes the wavelet transformation of query and
data hypercubes to provide fast answers with progressively increasing accuracy in support of range
aggregate queries. While prior work focused on the ordering of
either the query or the data coefficients, we propose a class of
hybrid ordering techniques that exploits both query and data
wavelets in answering queries progressively.
- Cutting down retrieval cost by exploiting query workload statistics
The wavelet transformation has been extensively used in applications where approximate, progressive
or even fast exact answers to queries are required. However, for general database queries, the
complete wavelet transform is not always the optimal form in which to store the data. We explore
and exploit the properties of the full tree of the wavelet decomposition, in order to find a
representation for the dataset that minimizes the retrieval cost for a set of queries.
|
|
 |
|
 |
 |
Data Fusion |
 |
| |
- Progressively fusing large multidimensional cubes
|
|
 |
|
 |
|