USC
 

 

 




WOLAP

MDA


OLAP is an elegant approach to efficiently support analytical queries on massive multidimensional datasets. Several fundamental classes of OLAP queries, such as aggregate queries, slice-and-dice queries, or roll-up queries, have been addressed by numerous researchers for the last couple of years. Most of these methods share the disadvantage of either providing only approximate answers by compressing the data or sacrificing the update cost for better query processing performance. However, we propose to employ wavelet transform to efficiently process such queries with exact results and without significant increases on update performance. Leveraging from multi-resolution property of wavelets, we also incorporate approximate query processing in case of space or time limitation as our approach is fundamentally progressive. Furthermore, we prepare and maintain wavelet data by providing a framework to efficiently create, insert, and update data. By developing a real system with its deployment in practice, we address all the essential steps toward realization of our proposed work. We are currently extending our framework by supporting a new class of analytical queries, plot queries, as the most frequently performed queries in scientific applications for extraction of useful information.

 
 

People Involved:



 

Publications:




SUBSPACE

SM


Subspace Method (SM) is a process to find a lower-dimensional representation of high-dimensional data with particularly useful mathematic properties. It is therefore useful when dealing with multidimensional data sets that are commonly and inevitably suffering from the curse of dimensionality. In Linear Subspace Method (LSM), the lower-dimensional subspace is spanned by a small number of new components (or directions) that are formed as a linear function of the original variables. Therefore, the process is simply to find the linear coefficients of these linear functions by minimizing or maximizing a certain criterion.

LSM have been profitable exploited in many disciplines such as dimension reduction, feature selection/extraction, data compression/reconstruction, data visualization, etc, typically based on or extended from Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), Multidimensional Scaling (MS), and more. PCA is to find a subspace that maximally preserves the variability in data, LDA is to find a discriminative subspace with maximum separation between classes, and MS is to find a two-dimensional subspace that maximally preserves the inter-point distances between the data items.

At InfoLab, LSM was exploited for feature subset selection of Multivariate Time Series (MTS) data set. The linear coefficients obtained from LSM are often interpreted as the contributions of the corresponding original variables, which was exploited to find an optimal subset of original variables least redundant each other and potentially effective to the subsequent data mining task as a whole. In specific, a subspace that is common to all MTS data items in the set and most effectively preserves the variability in each MTS item was first extracted, then the original variables were clustered based on their coefficients/contributions on forming this common subspace, and one representative variable from each cluster was finally selected as the subset. More recently, LSM has been exploited for the visualization of high-dimensional data. That is, a new linear subspace method, name Progression-Preserving Projection, was proposed to find a two- or three-dimensional subspace that most effectively preserves the class separation and the within-class progression structure so as to visually discover the most useful and informative structures presented in the original multidimensional data, which is particularly useful in the application of motor rehabilitation after stroke. Now, we are exploiting LSM approaches for the predictive modeling of multidimensional data.

 
 

People Involved:



 

Publications:






Geospatial Information Management

Blind Evaluation of Spatial Queries Using Hilbert Space-Filling Curves

vsknn


An important class of spatial queries consists of nearest-neighbor (NN) query and its variations. These queries search for data objects that minimize a distance-based function with reference to one or more query objects (e.g., points). In location-based services, a group of mobile users want to find the location of their K closest objects to their current location (KNN). One obvious requirement with KNN queries is that the location of the query point(s) needs to be known in order to perform the query. However, in many applications such as in location-based services, a user may not want to reveal its location in order to preserve his/her privacy. The idea behind our approach is to utilize one-way transformations to map the space of all static and dynamic objects to another space and resolve the query blindly in the transformed space. However, in order to become a viable approach, the transformation used should be able to resolve NN queries in the transformed space accurately and more importantly prevent malicious use of transformed data by untrusted entities. Such a transformation can be viewed as a space encryption scheme. We use Hilbert curves as efficient one-way transformations and design algorithms to evaluate a KNN query in the Hilbert transformed space.

 
 

People Involved:




Visibility Computation for Trajectories

Trajectory Coverage


A visibility query in a geospatial environment would be to check if a given query point can see a given data point, considering the terrain elevation. This class of spatial queries can be more challenging, when instead of a data point, a path is needed to be checked for the visibility. A general application to this problem can be a grand prix example, where there exist a specified race track, and some seats available. Each seat's ticket price is dependent on how much of the track can be visible from that seat. If the entire area around the track have potentiality of be chosen as seats, the problem needs an expensive visibility computation. The idea behind our approach is to utilize approximation techniques in order to reduce the large amount of computation.

 
 

People Involved:




Texture Generation

Texture Generation


Our research looks at practical and efficient ways to use images taken on a campaign setting using handheld devices for the purpose of creating textures to be used on virtualized locations. The motivation for this work stems by our intent to device a rapid and accurate modeling pipeline able to produce quality models on a large scales. In addition, existing models commonly obtained using LiDar imagery have poor textures and we expect to be able to find a pipeline that allow for their re-use.

Our approach attemps to blend techniques from computer vision and computer graphics and use all available pre-existing information to constrain the solution space. Available constraints include existing geometric models and textures of buildings associated with geospatial data, as well as information readily available from the mobile devices such as their GPS position at the time of acquisition.

Finally we plan to explore how those textures can be linked to information in a geographical information system to facilitate decision making type applications.

 
 

People Involved:


  • Luciano Nocera
  • Balki Ranganathan
  • Frederick Zyda


Image Acquisition Planning

Image Acquisition Planning


With current advances in mobile technology, generating the texture information from the images taken by users from an urban environment is becoming increasingly feasible. We envision GeoSIM (Geo Social Image Mapping) as a system with which a group of individuals with camera equipped mobile phones participate in collaborative/social mapping of such texture information in a virtualized urban area. Such virtualized goelocations might already include partial texture information from various sources such as Google Earth. An important issue in this system is to generate participation plans for the users for fast, efficient and comprehensive social image mapping. The participation plans should satisfy various user constraints (e.g., a specific movement direction) while maximizing the exposure of uncovered sight to users. For each user, the participation plan consists of a traversal path with multiple stop points at which the user takes photos based on some instructions for image acquisition (e.g. how to set the zoom factor and lighting). Finding such participation plans involves issuing a set of novel spatiotemporal queries for multi-destination path planning with constraints (i.e., variants of Traveling Salesman Problem (TSP)) as well as complex visibility/line-of-sight queries for exposure analysis in the urban area. Clearly visibility analysis is not possible for every point of the area. In this research we are looking on how to efficiently issue both category of queries to generate users participation plans.

 
 

People Involved:




k-Nearest Neighbor Queries Over Moving Objects on Road Networks

Moving Object Example


The wide spread usage of GPS devices and latest developments in wireless technologies has led to extensive usage of location-based services as well as emerging global positioning services. With rapidly growing interest in GPS enabled handheld devices, end users are able to report their positions hence becoming a point of interests and utilize various location-based services from the providers. One main class of such services is k-nearest neighbor ((k-NN) queries over moving objects on road networks. A k-nearest neighbor query finds the k point of interests (ie: restaurant, taxis) in a dataset that lie closest to a given query point. In the past, many efficient algorithms have been proposed to address k-NN query processing in Euclidian space. In real-world scenarios, however, the query points move in road networks, and the distance between it’s point of interest is defined as the length of the shortest path connecting them. When processing k-NN queries over moving objects, the main challenges arise from a) the mobility of query points and/or point of interests b) fast road network distance calculation c) management of location updates and indexes. To cope with challenges, we propose a novel approach which initially retrieves the k-NN from a grid structure and then utilizes threshold-based algorithms to refine the result set. In addition, our approach takes into account changes in the network edges because of varying road conditions such as traffic, road construction, accidents. All these algorithms are designed for update streams, where the central server manages the location updates from a particular set of queries and objects.

 
 

People Involved:




Surface KNN queries

Surface


A very important class of queries in Geo-spatial database is k Nearest Neighbor (kNN) query: Given a query point q, a kNN query searches in the object data set P and returns the k nearest neighbors of q. Various studies have been taken out on the problems of kNN, reverse kNN, continue kNN and network kNN. Moreover, real world applications often result in the formulation of kNN queries on land surface. Surface kNN query (skNN) is also a kind of constraint based query. In such queries the objects can only move on the terrain. A skNN query actually involves two tasks, it needs to select out the k nearest neighbors to q, just as a traditional kNN query. What is more, the users are not satisfied by only telling which one is nearer, they also want to know how to get there. Thus the system has to point out the path from q to such points on the surface. We propose a novel approach to process skNN query with two spatial indexes, namely Loose Surface Index (LSI) and Tight Surface Index (TSI). With those two indexes, we bring down the computation cost (both I/O and CPU time) for surface distance computation. Our algorithm finds out skNN in an "expanding strategy": firstly we locate the query point, and search in the around area for top 1 nearest neighbor, then we expand the search area with certain manner to report k nearest neighbors incrementally.

 
 

People Involved:







MDA

A very important class of queries in GIS applications is the class of K-Nearest Neighbor queries, where a set of K closest points of interest (e.g., hospitals or restaurants) to a query point is requested. Moreover, new real-world applications often result in the formulation of new variations of the KNN queries each demanding new a solution in presence of certain constraints. The challenge in all of these queries is to process the query as fast as possible when: 1) The dataset is usually large, and hence, conventional techniques usually lead to very long query processing time. 2) The distance function is usually computationally complex, which again usage of conventional approaches (e.g., Dijkstra's algorithm to find the shortest path between two points on a road network) will be impractical. Different approaches to address these challenges have been proposed. These approaches are usually based on approximation of the actual distance function with a simpler one (e.g., using Euclidean distance instead of shortest path) and utilization of spatial index structures. The problem with these approaches is that the simple distance function may not (and often does not) provide an acceptable approximation for the distance function. At USC InfoLab, we focus on nearest neighbor problem for road networks, where the distance function between two points is usually their shortest path. We address this problem from different angels. First, we propose a more accurate (as compare to Euclidean distance) approximation of the shortest path distance function by transforming the original network to a higher dimensional space. This approach is appropriate when some level of inaccuracy can be tolerated. Second, we propose an approach based on pre-calculating and indexing the solution space. This approach, based on Voronoi diagram, provides 100% accuracy and can also be extended to address constrained nearest neighbor queries and to find the distance between two arbitrary points. Finally, we studied an unexploited and novel form of KNN queries named Optimal Sequenced Route (OSR) query in both vector and metric spaces. OSR strives to find a route of minimum length starting from a given source location and passing through a number of typed locations in a specific sequence imposed on the types of the locations. Our experiments verified that our algorithms significantly outperform the naive approach in terms of processing time (up to two orders of magnitude) and required workspace (up to 90% reduction on average).

 
 

People Involved:



 

Publications:




MDA

There is a vast amount of geospatial data available, including image, map, vector, point, and elevation data. The users of these data products often want these different data sources displayed in some integrated fashion. However; accurately and automatically integrating diverse geo-spatial datasets is a difficult task, since there are often certain spatial inconsistencies between geo-spatial datasets. They are multiple reasons why different data products may not align: they may have been collected at different resolutions, they may use different projections of the earth, they may have been corrected in different ways, etc. Conflation is often a term used to describe the integration or alignment of different geospatial data sets. However, the difficulty with current conflation techniques is that they require the manual identification of a set of control points to properly conflate two data sets. We are developing techniques to automatically conflating different sources of geospatial data. The critical challenge in automatic conflation is to automatically identify an appropriate set of control points on the two sources to be integrated. We propose to do this by combining different sources of information from each of the sources to be integrated. For example, in integrating vector data with imagery, we can exploit what is known or can be inferred about each of the data sources. For the imagery, we can exploit points that have been identified in the imagery or perform imager processing on the image. For vector data of roads, we can exploit meta-data about the vectors, such as address ranges, road names, or even the number of lanes and type of road surface. In addition , we can only analyze the road network to determine the location of intersections. For other types of vector data, there are similar types of meta data and inferences that can be exploited. This information can then be augmented with other sources of data. For example, points on the imagery can be determined by looking up the points in an online telephone book. Furthermore, we propose an information integration approach, which utilizes common vector datasets as "glue" to automatically conflate imagery with street maps. We have also demonstrated the utility of our approach using multiple evaluation methods.

 
 

People Involved:



 

Publications:





DS


A spatial data stream is a data stream consisting of geometric objects such as points, lines, polygons or intervals associated with a set of non-spatial attributes. These data items are streamed to a central processing module from a stream source such as a sensor node within a sensor network, a GPS-equipped truck adventuring in a desert or a tracker device attached to a user's body in an immersive environment. The emerging research area of algorithms for spatial data streams is the development of novel one-pass computational geometry algorithms which use small (sub-linear) amount of memory space and per-item processing time. An example of such algorithms is to find and maintain an approximate convex hull of a stream of 2-d points over a specified time window. The algorithm can be applied to a network of moving sensors to monitor the extent of the area they cover. Such geometric algorithms are building blocks of spatial queries on spatial data streams. They are also capable of processing non-spatial streams. For example, flows in an IP network may be modeled as intervals ([start time, end time]) in a 1-d space, or user document downloads on a peer-to-peer network can be mapped into a high dimensional space generating a spatial data stream.
At USC infolab, we intend to focus on addressing different query processing issues on spatial data streams. We study the practicality of our algorithms on sensor networks and immersive environments as our running applications.

 
 

People Involved:



 

Publications:



 

TAI

Immersive data streams or immersidata consist of multidimensional sensor data streams acquired from a user's interactions with an immersive environment. One example of this type of data is the set of data streams received from the sensors of a Cyber Glove (a virtual-reality glove). The Cyber Glove data streams can then be used for hand-gesture recognition. While traditional gesture recognition techniques often suffer from being highly device-dependent and not extendible for recognizing new signs, we propose a novel multi-layer framework for representing immersive sensor data streams that addresses both shortcomings. Our framework achieves both device independence and extendibility by separating the process from data within its multiple layers.
This framework consumes the streams of data and constructs a multi-level, spatio temporal representation of the data for further manipulation and processing. The lowest layer, termed 'raw data layer' continuously acquires sensor data streams. The next layer contains a set of predicates that describe general posture of the hand and its movement in space. The top layer contains a set of templates that corresponds to different hand gestures. The focus of this research is on the 'temporal' aspect of the data streams to extract hand movement trajectories. In particular we plan to:

  • Define a set of temporal predicates that can describe the movements of the hand,
  • Detect the trajectory of the hand movement from the sensor data streams,
  • Design an algorithm that can extract the temporal predicates based on these trajectories, and
  • Recognize the gestures by combining the extracted temporal templates with the spatial templates.

 
 

People Involved:



 

Publications:



 


MTSA

The Multivariate Time Series (MTS) dataset is a common data type in various scientific, medical and financial fields, where more than one variable (e.g., sensor) is used for monitoring the environment and its status changes over time. An MTS is usually very high dimensional with its main distinguishing characteristic being the inter-correlations and/or interdependencies among its variables. Consequently, MTS may not be easily broken into multiple univariate time series and should be treated as a whole.
There are several research directions for Multivariate Time Series Analysis (MTSA). One is to find an appropriate similarity measure for comparing two MTS's. The similarity measure should take the characteristics of MTS into consideration, and must be efficiently computable. Another is to devise indexing structures for MTS data. The traditional SAM (Spatial Access Method) techniques may not work here due to the high dimensionality of the MTS datasets. Currently, we are extending SVD-based similarity measures for MTS datasets by representing an MTS as a matrix. For more general cases, an n-way SVD/PCA technique should be employed to deal with MTS's represented as tensors. So far, we have focused on k Nearest Neighbor search for MTS data, and we are trying to extend this to more general analysis on MTS streams.


 
 

People Involved:



 

Publications:




The next generation of distributed database systems is a family of massive self-organizing data networks; we term Querical Data Networks (QDNs). QDNs consist of a dynamic and large set of peer autonomous nodes, a dynamic dataset that is naturally distributed among the nodes in extra fine-grain, and a transient-form communication infrastructure that interconnects the nodes. Peer-to-peer and sensor networks are well-known examples of QDN. Distributed query processing within a QDN is far more challenging as compared to that of the traditional distributed database systems. Our approach is to categorize QDNs as instances of "Complex Systems" and apply the Complex System Theory (CST) to study QDNs. As our focus problem, we study CST-based mechanisms for efficient data location (search) within QDNs. The search primitive is one of the most fundamental functionalities vital to QDN query processing.

Some QDNs are structurable/indexible whereas some others, due to the extreme dynamism of the QDN topology and the extreme autonomy of the QDN nodes, render any attempt to even semi-structurize the network to an index-like structure inefficient and/or impossible. For structurable QDNs, we study self-organizing mechanisms to structurize the topology of the QDN to a search-efficient topology. Our approach is inspired by the "small-world" models that try to explain efficient communication in a social network, which is a semi-structured complex system. For unstructurable QDNs, we study efficient query flooding mechanisms. Efficient flooding can be applied as a method to disseminate broadcast search queries as well as other unicast and multicast queries. We employ percolation theory, an analytical tool borrowed from CST, to formalize and analyze these flooding mechanisms.

 
 

People Involved:



 

Publications:





WSD


The Web Services infrastructure is a distributed computing environment for service-sharing. In this environment, resource discovery is required as a primitive functionality for the service requesters to be able to locate the services, the shared resources. A discovery service with centralized architecture, such as UDDI (Universal Discovery Description and Integration), restricts the scalability of this environment as it grows to the scales comparable with the size of the web itself. In addition, current extensively used web service standards (e.g. UDDI, WSDL), do not support discovery at a semantic level.
Hence, we propose a fully decentralized and interoperable discovery service with semantic-level search capability, termed p2p-WSD for peer-to-peer web-service discovery. The first decentralization characteristic of p2p-WSD is in synch with autonomy and distribution design objectives and its semantic-level matching capability is based on our knowledge representation design principles.
In sum, we believe a decentralized architecture of the semantic-enabled p2p-WSD not only satisfies the design requirements for efficient and accurate discovery in distributed environments, but also is compatible with the nature of the Web Services environment as a self-organized federation of peer service-providers without any particular sponsor.

 
 

People Involved:



 

Publications:



 


YODA

As the demand for digital information and the amount of data dramatically increase, the task of acquiring accurate information has become more and more essential in various domains. In most of information-locating systems, providing a customized result set based upon features of data and user preferences is the ultimate objective. The main challenge is that both the data features and the user preferences are represented with several dimensions. Hence, aggregation, similarity matching, clustering, and classification of multidimensional data sets need to be studied.
We study these issues in the context of three application domains: 1) a content-based image retrieval system, 2) an e-commerce recommendation system, and 3) a neuroscience knowledge management system. We propose a customized query model, named Yoda, which can uniformly incorporate physical characteristics (data features), semantic meanings (attributes), and users' preferences during the query process. Moreover, Yoda also utilizes the users' relevance feedback to improve query results by using genetic algorithms (GA).

 
 

People Involved:



 

Publications:




YIMA

Continuous media (CM) servers are unique in that they must provide continuous, real-time delivery of media with low start-up latency to clients. In addition, CM servers must be able to support multiple concurrent clients. We have designed and developed Yima, a multi-node CM server that is built with commodity components and is fully compatible with industry standards. Yima distinguishes itself from other CM servers in the following: 1) elimination of single point of failure with its multi-node shared nothing architecture, 2) frame/sample accurate synchronization of several streams, 3) independence from the media format, 4) network-failure tolerance with selective re-transmission, and 5) efficient flow control with multi-threshold buffering to support VBR traffic.
Currently we are equipping Yima with a disk scaling component, SCADDAR, that allows on-line addition and/or removal of disks with zero down-time to scale up/down the storage capacity adaptively as required. The random block-placement mechanism employed by SCADDAR ensures uniform distribution of blocks among disks (for balanced load), while the number of re-distributed blocks is also minimal.

 
 

People Involved:



 

Publications:




SR

We developed a new shape-based object retrieval technique based on minimum bounding circles (MBCs). Our MBC-based method can be used for efficient retrieval of 2D objects by utilizing three different index structures on features that are extracted from the object’s MBC. Besides similarity retrieval, we also investigated how to support spatial queries (i.e. topological and direction queries) based on a subset of features extracted from the objects MBC. First, we focused on the support of topological relations as defined by the 9-intersection model. We proposed some filtering techniques based on special cases for MBC in order to reduce the number of false hits. Next, we proposed an extra filtering step to expedite direction relations between objects utilizing their MBCs based on cone-directions method. We evaluated and compared the performance of our method to different boundary based methods for shape representation and retrieval. The results showed that the similarity retrieval accuracy of our MBC-based method was better than other methods, while it has the lowest computation cost to generate the shape signatures of the objects. Moreover, it has low storage requirement, and a comparable computation cost to compute the similarity between two shape signatures. In addition, MBC-based method requires no normalization of the objects. Further more, our experiments showed the robustness of our method to the introduction of noise to the database and the query objects.

 
 

People Involved:



 

Publications: