OLAP is an elegant approach to efficiently support analytical
queries on massive multidimensional datasets. Several fundamental
classes of OLAP queries, such as aggregate queries, slice-and-dice
queries, or roll-up queries, have been addressed by numerous
researchers for the last couple of years. Most of these methods
share the disadvantage of either providing only approximate answers
by compressing the data or sacrificing the update cost for better
query processing performance. However, we propose to employ wavelet
transform to efficiently process such queries with exact results and
without significant increases on update performance. Leveraging from
multi-resolution property of wavelets, we also incorporate
approximate query processing in case of space or time limitation as
our approach is fundamentally progressive. Furthermore, we prepare
and maintain wavelet data by providing a framework to efficiently
create, insert, and update data. By developing a real system with
its deployment in practice, we address all the essential steps
toward realization of our proposed work. We are currently extending our
framework by supporting a new class of analytical queries, plot
queries, as the most frequently performed queries in scientific
applications for extraction of useful information.
Cyrus Shahabi, Dimitris Sacharidis, and Mehrdad Jahangiri,Wavelets for Querying Multidimensional Datasets, Encyclopedia of Data Warehousing and Mining, ISBN 1-59140-557-2, Vol. 1, Idea Group Reference, April 2005
Subspace Method (SM) is a process to find a lower-dimensional representation
of high-dimensional data with particularly useful mathematic properties. It
is therefore useful when dealing with multidimensional data sets that are
commonly and inevitably suffering from the curse of dimensionality. In
Linear Subspace Method (LSM), the lower-dimensional subspace is spanned by a
small number of new components (or directions) that are formed as a linear
function of the original variables. Therefore, the process is simply to find
the linear coefficients of these linear functions by minimizing or
maximizing a certain criterion.
LSM have been profitable exploited in many disciplines such as dimension
reduction, feature selection/extraction, data compression/reconstruction,
data visualization, etc, typically based on or extended from Principal
Component Analysis (PCA), Linear Discriminant Analysis (LDA),
Multidimensional Scaling (MS), and more. PCA is to find a subspace that
maximally preserves the variability in data, LDA is to find a
discriminative subspace with maximum separation between classes, and
MS is to find a two-dimensional subspace that maximally preserves the
inter-point distances between the data items.
At InfoLab, LSM was exploited for feature subset selection of Multivariate
Time Series (MTS) data set. The linear coefficients obtained from LSM are
often interpreted as the contributions of the corresponding original
variables, which was exploited to find an optimal subset of original
variables least redundant each other and potentially effective to the
subsequent data mining task as a whole. In specific, a subspace that is
common to all MTS data items in the set and most effectively preserves the
variability in each MTS item was first extracted, then the original
variables were clustered based on their coefficients/contributions on
forming this common subspace, and one representative variable from each
cluster was finally selected as the subset. More recently, LSM has been
exploited for the visualization of high-dimensional data. That is, a new
linear subspace method, name Progression-Preserving Projection, was proposed
to find a two- or three-dimensional subspace that most effectively preserves
the class separation and the within-class progression structure so as to
visually discover the most useful and informative structures presented in
the original multidimensional data, which is particularly useful in the
application of motor rehabilitation after stroke. Now, we are exploiting LSM
approaches for the predictive modeling of multidimensional data.
Hyunjin Yoon, Kiyoung Yang, and Cyrus Shahabi,Feature Subset Selection and Feature Ranking for Multivariate Time Series, IEEE Transactions on Knowledge and Data Engineering (TKDE) - Special Issue on Intelligent Data Preparation, Volume: 17, Issue: 9, pp. 1186- 1198, ISSN: 1041-4347, September 2005
An important class of spatial queries consists of nearest-neighbor (NN) query and its variations. These queries search for data objects that minimize a distance-based function with reference to one or more query objects (e.g., points). In location-based services, a group of mobile users want to find the location of their K closest objects to their current location (KNN). One obvious requirement with KNN queries is that the location of the query point(s) needs to be known in order to perform the query. However, in many applications such as in location-based services, a user may not want to reveal its location in order to preserve his/her privacy. The idea behind our approach is to utilize one-way transformations to map the space of all static and dynamic objects to another space and resolve the query blindly in the transformed space. However, in order to become a viable approach, the transformation used should be able to resolve NN queries in the transformed space accurately and more importantly prevent malicious use of transformed data by untrusted entities. Such a transformation can be viewed as a space encryption scheme. We use Hilbert curves as efficient one-way transformations and design algorithms to evaluate a KNN query in the Hilbert transformed space.
A visibility query in a geospatial environment would be to check if a given query point can see a given data point, considering the terrain elevation. This class of spatial queries can be more challenging, when instead of a data point, a path is needed to be checked for the visibility. A general application to this problem can be a grand prix example, where there exist a specified race track, and some seats available. Each seat's ticket price is dependent on how much of the track can be visible from that seat. If the entire area around the track have potentiality of be chosen as seats, the problem needs an expensive visibility computation. The idea behind our approach is to utilize approximation techniques in order to reduce the large amount of computation.
Our research looks at practical and efficient ways to use images taken on a campaign setting using handheld devices for the purpose of creating textures to be used on virtualized locations. The motivation for this work stems by our intent to device a rapid and accurate modeling pipeline able to produce quality models on a large scales. In addition, existing models commonly obtained using LiDar imagery have poor textures and we expect to be able to find a pipeline that allow for their re-use.
Our approach attemps to blend techniques from computer vision and computer graphics and use all available pre-existing information to constrain the solution space. Available constraints include existing geometric models and textures of buildings associated with geospatial data, as well as information readily available from the mobile devices such as their GPS position at the time of acquisition.
Finally we plan to explore how those textures can be linked to information in a geographical information system to facilitate decision making type applications.
People Involved:
Luciano Nocera
Balki Ranganathan
Frederick Zyda
With current advances in mobile technology, generating the texture
information from the images taken by users from an urban environment
is becoming increasingly feasible. We envision GeoSIM (Geo Social
Image Mapping) as a system with which a group of individuals with
camera equipped mobile phones participate in collaborative/social
mapping of such texture information in a virtualized urban area. Such
virtualized goelocations might already include partial texture
information from various sources such as Google Earth. An important issue in this system is to generate participation plans for the users for fast, efficient and comprehensive
social image mapping. The participation plans should satisfy
various user constraints (e.g., a specific movement direction) while
maximizing the exposure of uncovered sight to users. For each user,
the participation plan consists of a traversal path with
multiple stop points at which the user takes photos based on some instructions for image acquisition (e.g. how to set the zoom factor and lighting). Finding such
participation plans involves issuing a set of novel spatiotemporal queries for
multi-destination path planning with constraints (i.e., variants of
Traveling Salesman Problem (TSP)) as well as complex
visibility/line-of-sight queries for exposure analysis in the urban
area. Clearly visibility analysis is not possible for every point of the area.
In this research we are looking on how to efficiently issue both category of queries to generate users participation plans.
The wide spread usage of GPS devices and latest developments in wireless technologies has led to extensive usage of location-based services as well as emerging global positioning services. With rapidly growing interest in GPS enabled handheld devices, end users are able to report their positions hence becoming a point of interests and utilize various location-based services from the providers. One main class of such services is k-nearest neighbor ((k-NN) queries over moving objects on road networks. A k-nearest neighbor query finds the k point of interests (ie: restaurant, taxis) in a dataset that lie closest to a given query point. In the past, many efficient algorithms have been proposed to address k-NN query processing in Euclidian space. In real-world scenarios, however, the query points move in road networks, and the distance between it’s point of interest is defined as the length of the shortest path connecting them. When processing k-NN queries over moving objects, the main challenges arise from a) the mobility of query points and/or point of interests b) fast road network distance calculation c) management of location updates and indexes. To cope with challenges, we propose a novel approach which initially retrieves the k-NN from a grid structure and then utilizes threshold-based algorithms to refine the result set. In addition, our approach takes into account changes in the network edges because of varying road conditions such as traffic, road construction, accidents. All these algorithms are designed for update streams, where the central server manages the location updates from a particular set of queries and objects.
A very important class of queries in Geo-spatial database is k Nearest
Neighbor (kNN) query: Given a query point q, a kNN query searches in the
object data set P and returns the k nearest neighbors of q. Various studies
have been taken out on the problems of kNN, reverse kNN, continue kNN and
network kNN. Moreover, real world applications often result in the
formulation of kNN queries on land surface. Surface kNN query (skNN) is also
a kind of constraint based query. In such queries the objects can only move
on the terrain. A skNN query actually involves two tasks, it needs to select
out the k nearest neighbors to q, just as a traditional kNN query. What is
more, the users are not satisfied by only telling which one is nearer, they
also want to know how to get there. Thus the system has to point out the
path from q to such points on the surface.
We propose a novel approach to process skNN query with two spatial indexes,
namely Loose Surface Index (LSI) and Tight Surface Index (TSI). With those
two indexes, we bring down the computation cost (both I/O and CPU time) for
surface distance computation. Our algorithm finds out skNN in an "expanding
strategy": firstly we locate the query point, and search in the around area
for top 1 nearest neighbor, then we expand the search area with certain
manner to report k nearest neighbors incrementally.
A very important class of queries in GIS applications is the class of K-Nearest Neighbor queries, where a set of K closest points of interest (e.g., hospitals or restaurants) to a query point is requested. Moreover, new real-world applications often result in the formulation of new variations of the KNN queries each demanding new a solution in presence of certain constraints. The challenge in all of these queries is to process the query as fast as possible when: 1) The dataset is usually large, and hence, conventional techniques usually lead to very long query processing time. 2) The distance function is usually computationally complex, which again usage of conventional approaches (e.g., Dijkstra's algorithm to find the shortest path between two points on a road network) will be impractical. Different approaches to address these challenges have been proposed. These approaches are usually based on approximation of the actual distance function with a simpler one (e.g., using Euclidean distance instead of shortest path) and utilization of spatial index structures. The problem with these approaches is that the simple distance function may not (and often does not) provide an acceptable approximation for the distance function. At USC InfoLab, we focus on nearest neighbor problem for road networks, where the distance function between two points is usually their shortest path. We address this problem from different angels. First, we propose a more accurate (as compare to Euclidean distance) approximation of the shortest path distance function by transforming the original network to a higher dimensional space. This approach is appropriate when some level of inaccuracy can be tolerated. Second, we propose an approach based on pre-calculating and indexing the solution space. This approach, based on Voronoi diagram, provides 100% accuracy and can also be extended to address constrained nearest neighbor queries and to find the distance between two arbitrary points. Finally, we studied an unexploited and novel form of KNN queries named Optimal Sequenced Route (OSR) query in both vector and metric spaces. OSR strives to find a route of minimum length starting from a given source location and passing through a number of typed locations in a specific sequence imposed on the types of the locations. Our experiments verified that our algorithms significantly outperform the naive approach in terms of processing time (up to two orders of magnitude) and required workspace (up to 90% reduction on average).
Mehdi Sharifzadeh, Mohammad Kolahdouzan, and Cyrus Shahabi,The Optimal Sequenced Route Query, University of Southern California, Technical Report 05-840, Computer Science Department, January 2005
There is a vast amount of geospatial data available, including image, map, vector, point, and elevation data. The users of these data products often want these different data sources displayed in some integrated fashion. However; accurately and automatically integrating diverse geo-spatial datasets is a difficult task, since there are often certain spatial inconsistencies between geo-spatial datasets. They are multiple reasons why different data products may not align: they may have been collected at different resolutions, they may use different projections of the earth, they may have been corrected in different ways, etc. Conflation is often a term used to describe the integration or alignment of different geospatial data sets. However, the difficulty with current conflation techniques is that they require the manual identification of a set of control points to properly conflate two data sets. We are developing techniques to automatically conflating different sources of geospatial data. The critical challenge in automatic conflation is to automatically identify an appropriate set of control points on the two sources to be integrated. We propose to do this by combining different sources of information from each of the sources to be integrated. For example, in integrating vector data with imagery, we can exploit what is known or can be inferred about each of the data sources. For the imagery, we can exploit points that have been identified in the imagery or perform imager processing on the image. For vector data of roads, we can exploit meta-data about the vectors, such as address ranges, road names, or even the number of lanes and type of road surface. In addition , we can only analyze the road network to determine the location of intersections. For other types of vector data, there are similar types of meta data and inferences that can be exploited. This information can then be augmented with other sources of data. For example, points on the imagery can be determined by looking up the points in an online telephone book. Furthermore, we propose an information integration approach, which utilizes common vector datasets as "glue" to automatically conflate imagery with street maps. We have also demonstrated the utility of our approach using multiple evaluation methods.
Craig A. Knoblock and Cyrus Shahabi,Geospatial Data Integration, The Handbook of Geographic Information Science, ISBN: 9781405107969, Pages 185-196, Editors: John P. Wilson , A. Stewart Fotheringham; Publisher: Blackwell, 2008
Cyrus Shahabi,Geospatial Decision Making and Spatial Skyline Queries, Peking University (PKU), Beijing, China, July 2007
Cyrus Shahabi,Geospatial Decision Making and Spatial Skyline Queries, Chinese Academy of Sciences (CAS), Beijing, China, July 2007
Cyrus Shahabi,Geospatial Information Extraction and Integration, Lawrence Livermore National Laboratory, Livermore, CA, February 2007
A spatial data stream is a data stream consisting of geometric objects such as points, lines, polygons or intervals associated with a set of non-spatial attributes. These data items are
streamed to a central processing module from a stream source such as a sensor node within a sensor network, a GPS-equipped truck adventuring in a desert or a tracker device attached to a user's body in an immersive environment. The emerging research area of
algorithms for spatial data streams is the development of novel one-pass computational geometry algorithms which use small (sub-linear) amount of memory space and per-item processing time. An example of such algorithms is to find and maintain an approximate convex hull of a stream of 2-d points over a specified time window. The algorithm can be applied to a network of moving sensors to monitor the extent of the area they cover. Such geometric algorithms are building blocks of spatial queries on spatial data streams. They are also capable of processing non-spatial streams. For example, flows in an IP network may be modeled as intervals ([start time, end time]) in a 1-d space, or user document downloads on a peer-to-peer network can be mapped into a high dimensional space generating a spatial data stream.
At USC infolab, we intend to focus on addressing different query processing issues on spatial data streams. We study the practicality of our algorithms on sensor networks and immersive environments as our running applications.
Immersive data streams or immersidata consist of multidimensional sensor data streams acquired from a user's interactions with an immersive environment. One example of this type of data is the set of data streams received from the sensors of a Cyber Glove (a virtual-reality glove). The Cyber Glove data streams can then be used for hand-gesture recognition. While traditional gesture recognition techniques often suffer from being highly device-dependent and not extendible for recognizing new signs, we propose a novel multi-layer framework for representing immersive sensor data streams that addresses both shortcomings. Our framework achieves both device independence and extendibility by separating the process from data within its multiple layers.
This framework consumes the streams of data and constructs a multi-level, spatio temporal representation of the data for further manipulation and processing. The lowest layer, termed 'raw data layer' continuously acquires sensor data streams. The next layer contains a set of predicates that describe general posture of the hand and its movement in space. The top layer contains a set of templates that corresponds to different hand gestures. The focus of this research is on the 'temporal' aspect of the data streams to extract hand movement trajectories. In particular we plan to:
Define a set of temporal predicates that can describe the movements of the hand,
Detect the trajectory of the hand movement from the sensor data streams,
Design an algorithm that can extract the temporal predicates based on these trajectories, and
Recognize the gestures by combining the extracted temporal templates with the spatial templates.
The Multivariate Time Series (MTS) dataset is a common data type in
various scientific, medical and financial fields, where more than one
variable (e.g., sensor) is used for monitoring the environment and its
status changes over time. An MTS is usually very high dimensional with its
main distinguishing characteristic being the inter-correlations and/or
interdependencies among its variables. Consequently, MTS may not be easily
broken into multiple univariate time series and should be treated as a
whole.
There are several research directions for Multivariate Time Series
Analysis (MTSA). One is to find an appropriate similarity measure for
comparing two MTS's. The similarity measure should take the
characteristics of MTS into consideration, and must be efficiently
computable. Another is to devise indexing structures for MTS data. The
traditional SAM (Spatial Access Method) techniques may not work here due
to the high dimensionality of the MTS datasets. Currently, we are
extending SVD-based similarity measures for MTS datasets by representing
an MTS as a matrix. For more general cases, an n-way SVD/PCA technique
should be employed to deal with MTS's represented as tensors. So far, we
have focused on k Nearest Neighbor search for MTS data, and we are trying
to extend this to more general analysis on MTS streams.
Hyunjin Yoon, Kiyoung Yang, and Cyrus Shahabi,Feature Subset Selection and Feature Ranking for Multivariate Time Series, IEEE Transactions on Knowledge and Data Engineering (TKDE) - Special Issue on Intelligent Data Preparation, Volume: 17, Issue: 9, pp. 1186- 1198, ISSN: 1041-4347, September 2005
Cyrus Shahabi,Wavelets and Web-Services (W2) for Online Scientific Data Analysis, , Lawrence Livermore National Laboratory, Livermore, CA, August 2005
M. McLaughlin, G. Sukhatme, J. Hespanha, C. Shahabi, A. Ortega, and G. Medioni,The Haptic Museum, Proc. EVA 2000 Conference on Electronic Imaging and the Visual Arts, Florence, Italy, 2000
The
next generation of distributed database systems is a
family of massive self-organizing data networks; we
term Querical Data Networks (QDNs). QDNs consist
of a dynamic and large set of peer autonomous nodes,
a dynamic dataset that is naturally distributed among
the nodes in extra fine-grain, and a transient-form
communication infrastructure that interconnects the
nodes. Peer-to-peer and sensor networks are well-known
examples of QDN. Distributed query processing within
a QDN is far more challenging as compared to that of
the traditional distributed database systems. Our approach
is to categorize QDNs as instances of "Complex
Systems" and apply the Complex System Theory (CST)
to study QDNs. As our focus problem, we study CST-based
mechanisms for efficient data location (search) within
QDNs. The search primitive is one of the most fundamental
functionalities vital to QDN query processing.
Some
QDNs are structurable/indexible whereas some others,
due to the extreme dynamism of the QDN topology and
the extreme autonomy of the QDN nodes, render any attempt
to even semi-structurize the network to an index-like
structure inefficient and/or impossible. For structurable
QDNs, we study self-organizing mechanisms to structurize
the topology of the QDN to a search-efficient topology.
Our approach is inspired by the "small-world"
models that try to explain efficient communication in
a social network, which is a semi-structured complex
system. For unstructurable QDNs, we study efficient
query flooding mechanisms. Efficient flooding can be
applied as a method to disseminate broadcast search
queries as well as other unicast and multicast queries.
We employ percolation theory, an analytical tool borrowed
from CST, to formalize and analyze these flooding mechanisms.
The Web Services infrastructure is a distributed computing environment for service-sharing. In this environment, resource discovery is required as a primitive functionality for the service requesters to be able to locate the services, the shared resources. A discovery service with centralized architecture, such as UDDI (Universal Discovery Description and Integration), restricts the scalability of this environment as it grows to the scales comparable with the size of the web itself. In addition, current extensively used web service standards (e.g. UDDI, WSDL), do not support discovery at a semantic level.
Hence, we propose a fully decentralized and interoperable discovery service with semantic-level search capability, termed p2p-WSD for peer-to-peer web-service discovery. The first decentralization characteristic of p2p-WSD is in synch with autonomy and distribution design objectives and its semantic-level matching capability is based on our knowledge representation design principles.
In sum, we believe a decentralized architecture of the semantic-enabled p2p-WSD not only satisfies the design requirements for efficient and accurate discovery in distributed environments, but also is compatible with the nature of the Web Services environment as a self-organized federation of peer service-providers without any particular sponsor.
As the demand for digital information and the amount of data
dramatically increase, the task of acquiring accurate information has
become more and more essential in various domains. In most of
information-locating systems, providing a customized result set based
upon features of data and user preferences is the ultimate objective.
The main challenge is that both the data features and the user
preferences are represented with several dimensions. Hence,
aggregation, similarity matching, clustering, and classification of
multidimensional data sets need to be studied.
We study these issues in the context of three application domains: 1)
a content-based image retrieval system, 2) an e-commerce
recommendation system, and 3) a neuroscience knowledge management
system. We propose a customized query model, named Yoda, which can
uniformly incorporate physical characteristics (data features),
semantic meanings (attributes), and users' preferences during the
query process. Moreover, Yoda also utilizes the users' relevance
feedback to improve query results by using genetic algorithms (GA).
Yi-Shin Chen and Cyrus Shahabi,Improving User Profiles for E-Commerce by Genetic Algorithms, E-Commerce and Intelligent Methods Studies in Fuzziness and Soft Computing, Vol. 105, VIII, J. Segovia, P.S. Szczepaniak, M. Niedzwiedzinski (Eds.), 2002
Continuous media (CM) servers are unique in that they must provide continuous,
real-time delivery of media with low start-up latency to clients. In addition,
CM servers must be able to support multiple concurrent clients.
We have designed and developed Yima, a multi-node CM server that is
built with commodity components and is fully compatible with industry
standards. Yima distinguishes itself from other CM servers in the
following: 1) elimination of single point of failure with its multi-node
shared nothing architecture, 2) frame/sample accurate
synchronization of several streams, 3) independence from the media format,
4) network-failure tolerance with
selective re-transmission, and 5) efficient flow control with
multi-threshold buffering to support VBR traffic.
Currently we are equipping Yima with a disk scaling
component, SCADDAR, that allows on-line addition and/or removal of disks
with zero down-time to scale up/down the storage capacity adaptively as
required. The random block-placement mechanism employed by SCADDAR ensures
uniform distribution of blocks among disks (for balanced load), while the
number of re-distributed blocks is also minimal.
We developed a new shape-based object retrieval technique based on minimum bounding circles (MBCs). Our MBC-based method can be used for efficient retrieval of 2D objects by utilizing three different index structures on features that are extracted from the object’s MBC. Besides similarity retrieval, we also investigated how to support spatial queries (i.e. topological and direction queries) based on a subset of features extracted from the objects MBC. First, we focused on the support of topological relations as defined by the 9-intersection model. We proposed some filtering techniques based on special cases for MBC in order to reduce the number of false hits. Next, we proposed an extra filtering step to expedite direction relations between objects utilizing their MBCs based on cone-directions method. We evaluated and compared the performance of our method to different boundary based methods for shape representation and retrieval. The results showed that the similarity retrieval accuracy of our MBC-based method was better than other methods, while it has the lowest computation cost to generate the shape signatures of the objects. Moreover, it has low storage requirement, and a comparable computation cost to compute the similarity between two shape signatures. In addition, MBC-based method requires no normalization of the objects. Further more, our experiments showed the robustness of our method to the introduction of noise to the database and the query objects.