The data archive (.zip) accompanying the instructions contains useful things to help you submit your technical report:
In short: you will first contact the technical report maintainer for a technical report number. You then create the technical report according to the instructions above, plus a bibtex entry (again, see the instructions), and an optional zip file containing other data you wish to accompany the technical report. Submit these three items to the maintainer and we'll get your report posted.
GMU-CS-TR-2015-10
Obfuscation-Resilient, Efficient, and Accurate Detection and Family Identification of Android Malware
Joshua Garcia and Mahmoud Hammad and Bahman Pedrood and Ali Bagheri-Khaligh and Sam Malek
The number of Android malware apps are increasing very quickly. Simply detecting and removing malware apps is insufficient, since they can damage or alter other files, data, or settings; install additional applications; etc. To determine such behavior, a security engineer can significantly benefit from identifying the specific family to which an Android malware belongs. Techniques for detecting Android malware, and determining their families, lack the ability to deal with obfuscations (i.e., transformations of application to thwart detection). Moreover, some of the prior techniques are highly inefficient, making them inapplicable for real-time detection of threats. To address these limitations, we present a novel machine learning-based Android malware detection and family identification approach, RevealDroid, that provides selectable features. We assess RevealDroid to determine a selection of features that enable obfuscation resiliency, efficiency, and accuracy for detection and family identification. We assess RevealDroid's accuracy and obfuscation resilience on an updated dataset of malware from a diverse set of families, including malware obfuscated using various transformations, and compare RevealDroid against an existing Android malware-family identification approach and another Android malware detection approach.
GMU-CS-TR-2015-11
Discerning Machine Learning Degradation via Ensemble Classifier Mutual Agreement Analysis
Charles Smutz and Angelos Stavrou
Machine learning classifiers are a crucial component of modern malware and intrusion detection systems. However, past studies have shown that classifier-based detection systems are susceptible to evasion attacks in practice. Improving the evasion resistance of learning based systems is an open problem. In this paper, we analyze the effects of mimicry attacks on real-world classifiers. To counter such attacks, we introduce a novel approach that not only exposes attempts to evade the classifier, but also evaluates the ability of the system to operate reliably. We advance mutual agreement analysis for ensemble classifiers: during the detection operation, when a sufficient number of votes from individual classifiers disagree, the ensemble classifier prediction is shown to be unreliable. This method allows the detection of classifier evasion without additional external information. To evaluate our approach, we used thousands of both malicious and benign PDF documents. Applying our method to two recent evasion attack studies, we show that most evasion scenarios are detected. Furthermore, we show that our approach can be generalized to weaken the effectiveness of the Gradient Descent and Kernel Density Estimation attacks against Support Vector Machines (SVM) by creating an ensemble classifier using many base SVMs. We discovered that feature bagging is the most important property for enabling mutual agreement based evasion detection.
GMU-CS-TR-2015-12
Near-optimal Allocation of VMs from IaaS Providers by SaaS Providers
Aldhalaan, Arwa and Menasce, Daniel A.
Software as a Service (SaaS) allows companies and individuals to use software, hosted and managed by a SaaS provider, on a pay-per-use basis instead of paying for the entire upfront, maintenance, and upgrade cost. SaaS providers can lease their computing infrastructure to instantiate VMs that run their software services from Infrastructure as a Service (IaaS) providers on a pay per use basis. SaaS customers can subscribe to and unsubscribe from a software service at anytime. Thus, the SaaS cloud provider should dynamically determine the number of needed VMs to run software services as a function of the demand in a way that minimizes the SaaS cost of using VMs from an IaaS but at the same time guaranteeing an agreed upon Quality of Service (QoS) to the SaaS customers. This report presents two heuristic techniques that can be used by SaaS providers to determine the type and quantity of VMs to be leased in order to best satisfy customer demands for software services. Our experiments showed that the number of states visited by the proposed method is very low (on the order of 10-4 of the entire space) while the solution obtained is 2% more expensive in most cases, 13% more expensive in a few cases, and 31% more expensive in only one case, when compared with the optimal solution.
GMU-CS-TR-2015-13
A Taxonomy of Job Scheduling on Distributed Computing Systems
Raquel Lopes and Daniel Menascé
Hundreds of papers on job scheduling for distributed systems are published every year and it becomes increasingly difficult to classify them. Our analysis revealed that half of these papers are barely cited. This technical report presents a general taxonomy for scheduling problems and solutions in distributed systems. This taxonomy was used to classify and make publicly available the classification of 100 scheduling problems and their solutions. These 100 problems were further clustered into eight groups based on the features of the taxonomy. The proposed taxonomy will facilitate researchers to build on prior art, increase new research visibility, and minimize redundant effort.
GMU-CS-TR-2015-14
Decision Guidance Approacgh to Power Network Analysis: Developing an Optimization Technology Solution to Support Management and Operations of Electric Power Components
Roberto Levy and Alexander Brodsky and Juan Luo
This paper focuses on developing an approach and technology for actionable recommendations on the operation of components of an electric power network. The overall direction of this research is to model the major components of a Hybrid Renewable Energy System (HRES), including power generation, transmission/distribution, power storage, energy markets, and end customer demand (residential and commercial). First, we propose a conceptual diagram notation for power network topology, to allow the representation of an arbitrary complex power system. Second, we develop a formal mathematical model that describes the HRES optimization framework, consisting of the different network components, their respective costs, and associated constraints. Third, we implement the HRES optimization problem solution through a mixed-integer linear programming (MILP) model by leveraging IBM Optimization Programming Language (OPL) CPLEX Studio. Lastly, we demonstrate the model through an example of a simulated network, showing also the ability to support sensitivity / what-if analysis, to determine the behavior of the network under different configurations.
GMU-CS-TR-2014-1
Epochs: Trace-Driven Analytical Modeling of Job Execution Times
Daniel A. Menascé and Shouvik Bhardan
Queuing theory has extensively studied the problem of estimating job execution times in steady state conditions both in the case of single queues and queuing networks. This paper discusses the use of closed Queuing Network (QN) models during finite time intervals to estimate the execution time of jobs submitted to a computer system. More specifically, the paper presents the Epochs algorithm that allows job traces and randomly generated job inter-arrival times to be used as input to analytic models of computer systems. In other words, the paper combines methods used in discrete event simulation for the characterization of job arrivals with efficient analytic methods to model contention for resources in a computer system. The Epochs algorithm was validated against experimental results using jobs derived from a micro-benchmark and real jobs. The validation shows that the absolute relative error between measurements and execution time predictions obtained with the Epochs algorithm is below 10% in most cases and is at most 15%.
GMU-CS-TR-2014-2
Trace Driven Analytic Modeling for Evaluating Schedulers for Clusters of Computers
Shouvik Bhardan and Daniel A. Menascé
Large enterprises use clusters of computers with varying computing power to process workloads that are heterogenous in terms of the type of jobs and the nature of their arrival processes. The scheduling of jobs from a workload has a significant impact on their execution times. This paper presents a trace-driven analytic model (TDAM) method that can be used to assess the impact of different schedulers on job execution times. The TDAM approach uses an implementation of the scheduler to generate job traces that are fed into analytic models of the computers in the cluster. These analytic models use closed queuing network methods to estimate congestion at the various nodes of the cluster. The paper demonstrates the usefulness of the TDAM method by showing how four different types of schedulers affect the execution times of jobs derived from well-known benchmarks. The paper also demonstrates how the method can be applied to computer clusters such as the ones used to run MapReduce jobs.
GMU-CS-TR-2014-3
Temporal Manufacturing Query Language (tMQL) for Domain Specific Composition, What-if Analysis, and Optimization of Manufacturing Processes With Inventories
Mohan Krishnamoorthy and Alexander Brodsky and Daniel A. Menascé
Smart manufacturing requires streamlining operations and optimizing processes at a global and local level. This paper considers manufacturing processes that involve physical or virtual inventories of products, parts and materials, that move from machine to machine. The inventory levels vary with time and are a function of the configuration settings of the machines involved in the process. These environments require analysis, e.g., answering what-if questions, and optimization to determine optimal operating settings for the entire process. The modeling complexities in performing these tasks are not always within the grasp of production engineers. To address this problem, the paper proposes the temporal Manufacturing Query Language (tMQL) that allows the composition of modular process models for what-if analysis and decision optimization queries. tQML supports an extensible and reusable model knowledge base against which declarative queries can be posed. Additionally, the paper describes the steps to translate the components of a tMQL model to input data files used by a commercial optimization solver.
GMU-CS-TR-2014-4
Toward Reusable Models: System Development for Optimization Analytics Language (OAL)
Abdullah Alrazgan and Alexander Brodsky
Decision optimization has been broadly used in many areas including economics, finance, manufacturing, logistics, and engineering. However, optimization modeling used for decision optimization presents two major challenges. First, it requires expertise in operation research and mathematical programming, which most users, including software developers, engineers, business analysts and end-users, typically do not have. Second, optimization models are usually highly customized, not modular, and not extensible. Sustainable Process Analytics Formalism (SPAF) has been recently proposed at National Institute of Standards and Technology (NIST) to address these challenges, by transitioning from the redevelopment of optimization solutions from scratch towards maintaining an extensible library of analytical models, against which declarative optimization queries can be asked. However, the development of practical algorithms and a system for SPAF presents a number of technical challenges, which have not been addressed. In this paper we (1) propose an object-oriented extension of SPAF, named Optimization Analytics Language (OAL); (2) present OAL system development based on methods to reduce OAL models and queries into a formal mathematical programming formulation; (3) showcase OAL through a simple case study in the domain of manufacturing; and (4) conduct a preliminary experimental study to assess the overhead introduced by OAL.
GMU-CS-TR-2014-5
Building a Hierarchical Finite-State Automaton: A Primer
Sean Luke
This primer introduces concepts in hierarchical finite-state automata (HFA) and their use, gives a rough formalism, and then builds an entire HFA library, with all the trimmings, in Lua. The primer is intended to educate the reader on how HFA can be used, how they are programmed, and how HFA libraries are built.
GMU-CS-TR-2014-6
Decision Guidance Analytics Language (DGAL): Toward Reusable Knowledge Base Centric Modeling
Alexander Brodsky and Juan Luo
Decision guidance systems are a class of decision support systems that are geared toward producing actionable recommendations, typically based on formal analytical models and techniques. This paper proposes the Decision Guidance Analytics Language (DGAL) for easy iterative development of decision guidance systems. DGAL allows the creation of modular, reusable and composable models that are stored in the analytical knowledge base independently of the tasks and tools that use them. Based on these unified models, DGAL supports declarative queries of (1) data manipulation, (2) what-if prediction analysis, (3) decision optimization based on mathematical and constraint programming, and (4) machine learning, all through formal reduction to specialized models and tools, and in the presence of uncertainty.
GMU-CS-TR-2014-7
Document Content Layout Based Exploit Protections
Charles Smutz and Angelos Stavrou
Malware laden documents are a common exploit vector, especially in targeted attacks. Most current approaches seek to detect the malicious attributes of documents whether through signature matching, dynamic analysis, or machine learning. We take a different approach: we perform transformations on documents that render exploits inoperable while maintaining the visual interpretation of the document intact. Our exploit mitigation techniques are similar in effect to address space layout randomization, but we implement them through permutations to the document layout.
Using document content layout randomization, we randomize the data block layout of Microsoft OLE files in a manner similar to the inverse of a file system defragmention tool. This relocates malicious payloads in both the original document file and in the memory of the reader program. We demonstrate that our approach indeed subdues in the wild exploits by manual validation of both Office 2003 and Office 2007 malicious documents while the transformed documents continue to render benign content properly. The document transformation can be performed offline and requires only a single document scan while the user-perceived delay when opening the transformed document is negligible. We also show that it is possible to thwart malicious heap sprays by injecting benign content in documents. This approach, however, comes at an expense of computational and memory resources.
GMU-CS-TR-2014-8
Convex Multi-task Relationship Learning using Hinge Loss
Anveshi Charuvaka and Huzefa Rangwala
Multi-task learning improves generalization performance by learning several related tasks jointly. Several methods have been proposed for multi-task learning in recent years. Many methods make strong assumptions about symmetric task relationships while some are able to utilize externally provided task relationships. However, in many real world tasks the degree of relatedness among tasks is not known a priori. Methods which are able to extract the task relationships and exploit them while simultaneously learning models with good generalization performance can address this limitation. In the current work, we have extended a recently proposed method for learning task relationships using smooth squared loss for regression to classification problems using non-smooth hinge loss due to the demonstrated effectiveness of SVM classifier in single task classification settings. We have also developed an efficient optimization procedure using bundle methods for the proposed multi-task learning formulation. We have validated our method on one simulated and two real world datasets and have compared its performance to competitive baseline single-task and multi-task methods.
GMU-CS-TR-2013-1
Parallel Model Listing and #3SAT Through Ordering Variables
Michael Connor and Fei Li
We consider listing and counting all the solutions M to a given Boolean 3SAT formula with n variables. This problem is known to be NP-hard (Garey and Johnson, 1979). A straight-forward enumeration scheme to find M takes worst-case running time O(2^n). In this paper, we minimize the total running time and memory requirement by ordering the assignments of parallel values to the literals. Our objective is to establish theoretical bounds on an exponential running time O(a^b) such that a < 2 and b < n for certain Boolean formulas. We also conduct experiments to support our theoretical analysis.
GMU-CS-TR-2013-2
Lenience towards Teammates Helps in Cooperative Multiagent Learning
Liviu Panait and Keith Sullivan and Sean Luke
Concurrent learning is a form of cooperative multiagent learning in which each agent has an independent learning process and little or no control over its teammates' actions. In such learning algorithms, an agent's perception of the joint search space depends on the reward received by both agents, which in turn depends on the actions currently chosen by the other agents. The agents will tend to converge towards certain areas of the space because of their learning processes. As a result, an agent's perception of the search space may benefit if computed over multiple rewards at early stages of learning, but additional rewards have little impact towards the end. We thus suggest that agents should be lenient with their teammates: ignore many of the low rewards initially, and fewer rewards as learning progresses. We demonstrate the benefit of lenience in a cooperative co-evolution algorithm and in a new reinforcement learning algorithm.
GMU-CS-TR-2013-3
EPIC: Efficient Path-Independent Capabilities in Mobile Ad Hoc Networks --- Design, Performance, and Security
Eric Swankoski and Sanjeev Setia
In this paper, we investigate the issues that arise in the use of network capabilities to facilitate DoS prevention and mitigation in mobile ad hoc networks. Most proposals for capability-based protocols in wired networks depend upon routers on the path between the sender and receiver maintaining state that enables them to verify a capability. Frequent route changes make it necessary for any capability-based protocol for MANETs to re-establish this state as efficiently as possible. We propose EPIC, a method based on reverse-disclosure hash chains to support the establishment of path-independent capabilities and show how they can be efficiently maintained in a high mobility environment. Simulation results show that EPIC provides a 19.3% increase in performance and a 37.8% increase in efficiency over route-dependent capabilities.
GMU-CS-TR-2013-4
Conservative Collision Prediction Among Polygons with Unknown Motion
Yanyan Lu, Zhonghua Xi and Jyh-Ming Lien
Collision detection is a fundamental geometric tool for sampling-based motion planners. On the contrary, collision prediction for the scenarios that obstacle's motion is unknown is still in its infancy. This paper proposes a new approach to predict collision by assuming that obstacles are adversarial. Our new tool advances collision prediction beyond the translational and disc robots; arbitrary polygons with rotation can be used to better represent obstacles and provide tighter bound on predicted collision time. Comparing to an online motion planner that replans periodically at fixed time interval, our experimental results provide strong evidences that the new method significantly reduces the number of replans while maintaining higher success rate of finding a valid path.
GMU-CS-TR-2013-5
A Modular Approach to Model-Based Testing of Concurrent Programs
Richard Carver and Yu Lei
This paper presents a modular approach to testing concurrent programs that are modeled using labeled transition systems. Correctness is defined in terms of an implementation relation that is expected to hold between a model of the system and its implementation. The novelty of our approach is that the correctness of a concurrent software system is determined by testing the individual threads separately, without testing the system as a whole. We define a modular implementation relation on individual treads and show that modular relations can be tested separately in order to verify a (non-modular) implementation relation for the complete system. Empirical results indicate that our approach can significantly reduce the number of test sequences that are generated and executed during model-based testing.
GMU-CS-TR-2013-6
Cesar Cadena, Jana Kosecka
Semantic Segmentation of Urban Environments into Object and Background Categories
Advancements in robotic navigation, object search and exploration rest to a large extent on robust, efficient and more advanced semantic understanding of the surrounding environment. Since the choice of most relevant semantic information depends on the task, it is desirable to develop approaches which can be adopted for different tasks at hand and which separate the aspects related to surroundings from object entities. In the proposed work we present an efficient approach for detecting generic objects in urban environments from videos acquired by a moving vehicle by means of semantic segmentation. Compared to traditional approaches for semantic labeling, we strive to detect variety of objects, while avoiding the need for large amounts of training data required for recognizing individual object categories and visual variability within and across the categories. In the proposed approach we exploit the features providing evidence about widely available non-object categories (such as sky, road, buildings) and use informative features which are indicative of the presence of object boundaries to gather the evidence about objects. We formulate the object/non-object semantic segmentation problem in the Conditional Random Field Framework, where the structure of the graph is induced by the minimum spanning tree computed over 3D reconstruction, yielding an efficient algorithm for an exact inference. We carry out extensive experiments on videos of urban environments acquired by a moving vehicle and compare our approach to existing alternatives.
GMU-CS-TR-2013-7
A Test Automation Language for Behavioral Models
Nan Li and Jeff Offutt
Model-based testers design tests in terms of models, such as paths in graphs. abstract tests cannot be run directly because they use names and events that exist in the model, but not the implementation. Testers usually solve this mapping problem by hand. Model elements often appear in many abstract tests, so testers write the same redundant code many times. This is time-consuming, labor-intensive, and error-prone. This paper presents a language to automate the creation of mappings from abstract tests to concrete tests. Three issues are addressed: (1) creating mappings and generating test values, (2) transforming graphs and using coverage criteria to generate test paths, and (3) solving constraints and generating concrete test. The paper also presents results from an empirical comparison of testers using the mapping language with manual mapping on 17 open source and example programs. We found that the automated test generation method took a fraction of the time the manual method took, and the manual tests contained numerous errors in which concrete tests did not match their abstract tests.
GMU-CS-TR-2013-8
Plagiarism
Sean Luke
Students are often confused by what constitutes plagiarism in academia and the seriousness with which the offense is treated. The objective of this essay is to explain the issue and provide examples to explain what is plagiarism, and hopefully act as a useful guide.
GMU-CS-TR-2013-9
Future MASON Directions: Community Recommendations (Report of the 2013 MASON NSF Workshop)
Nicolas Payette and Marius Bujorianu and Glen Ropella and Ken Cline and Jeffrey Schank and Matt Miller and Sara Jonsson and Laszlo Gulyas and Richard Legendi and Olaf Bochmann and Luís de Sousa and Vlasios Voudouris and Daniil Kiose and Przemyslaw Szufel and Steve Saul and John McManus and Vittorio Scarano and Gennaro Cordasco and Chris Hollander and Paul Wiegand and Vera Kazakova and Brian Hrolenok and J. Daniel Rogers and Michael Schader and Sean Luke and Kenneth De Jong and Mark Coletti and Paul Schopf and Claudio Cioffi-Revilla and Keith Sullivan and Khaled Talukder and Ahmed Elmolla and Ermo Wei
MASON is an open source multiagent simulation library geared towards simulating very large numbers of relatively lightweight interacting agents. On June 15 and 16, 2013, approximately two dozen invitees convened at George Mason University to discuss future directions for MASON and needs of the MASON community. This meeting formed the 2013 MASON NSF Workshop. During the Workshop, participants identified nine areas of interest to the community where they felt MASON should be improved or extended. This document details the reports of those working groups.
GMU-CS-TR-2012-1
Dynamic RGB-D Mapping
Michael Paton and Jana Kosecka
Localization and mapping has been an area of great importance and interest to the robotics and computer vision community. It has traditionally been accomplished with range sensors such as lasers and sonars. Recent improvements in processing power coupled with advancements in image matching and motion estimation has allowed development of vision based localization techniques. Despite much progress, there are disadvantages to both range sensing and vision techniques making localization and mapping that is inexpensive and robust hard to attain. With the advent of RGB-D cameras which provide synchronized range and video data, localization and mapping is now able to exploit both range data as well as RGB features. This thesis exploits the strengths of vision and range sensing localization and mapping strategies and proposes novel algorithms using RGB-D cameras. We show how to combine existing strategies and present through evaluation of the resulting algorithms against a dataset of RGB-D benchmarks. Lastly we demonstrate the proposed algorithm on a challenging indoor dataset and demonstrate improvements where either pure range sensing or vision techniques perform poorly.
GMU-CS-TR-2012-2
Mining the Execution History of a Software System to Infer the Best Time for its Adaptation
Kyle R. Canavera, Naeem Esfahani, and Sam Malek
An important challenge in dynamic adaptation of a software system is to prevent inconsistencies (failures) and disruptions in its operations during and after change. Several prior techniques have solved this problem with various tradeoffs. All of them, however, assume the availability of detailed component dependency models. This paper presents a complementary technique that solves this problem in settings where such models are either not available, difficult to build, or outdated due to the evolution of the software. Our approach first mines the execution history of a software system to infer a "stochastic component dependency model", representing the probabilistic sequence of interactions among the system's components. We then demonstrate how this model could be used at runtime to infer the "best time" for adaptation of the system's components. We have thoroughly evaluated this research on a multi-user real world software system and under varying conditions.
GMU-CS-TR-2012-3
Multiobjective Optimization of Co-Clustering Ensembles
Francesco Gullo and AKM Khaled Ahsan Talukder and Sean Luke and Carlotta Domeniconi and Andrea Tagarelli
Co-clustering is a machine learning task where the goal is to simultaneously develop clusters of the data and of their respective features. We address the use of co-clustering ensembles to establish a consensus co-clustering over the data. As is obvious from its name, co-clustering is naturally multiobjective. Previous work tackled the problem using both rudimentary multiobjective optimization and expectation maximization, then later a gradient ascent approach which outperformed both of them. In this paper we develop a new preference-based multiobjective optimization algorithm to compete with the gradient ascent approach. Unlike this gradient ascent algorithm, our approach once again tackles the co-clustering problem with multiple heuristics, but also applies the gradient ascent algorithm's joint heuristic as a preference selection procedure. As a result, we are able to significantly outperform the gradient ascent algorithm on feature clustering and on problems with smaller datasets.
GMU-CS-TR-2012-4
Adaptive Search for Interesting Things
William Squires and Sean Luke
The results of a parameter sweep over a multidimensional parameter space are often used to gain an understanding of the space beyond simply identifying optima. But sweeps are costly, and so it is highly desirable to adaptively sample the space in such a way as to concentrate precious samples on the more "interesting" areas of the space. In this paper we analyze and expand on a previous work which defined such areas as those in which the slope of some quality function was high. We develop a performance metric in terms of generalizability of the resulting sampled model, examine the existing method in terms of scalability and accuracy, and then propose and examine two population-based approaches which address some of its shortcomings.
GMU-CS-TR-2012-5
Malicious PDF Detection Using Metadata and Structural Features
Charles Smutz and Angelos Stavrou
Owed to their versatile functionality and size, PDF documents have become a popular avenue for exploitation ranging from broadcasted phishing attacks to targeted attacks. In this paper, we present a method for efficiently detecting malicious documents based on feature extraction of document metadata and structure. We demonstrate that by carefully extracting a wide-range of feature sets we can offer a robust malware classifier. Indeed, using multiple datasets containing an aggregate of 5,000 malicious documents and 100,000 benign ones, our classification rates are well above 99% while maintaining low false negatives 0.01% or less for different classification parameters and scenarios. Surprisingly, we also discovered that by artificially reducing the influence of the top features we can achieve robust detection even in an adversarial setting where the attacker is aware of the top features and our normality model. Therefore, due to the large number of the extracted features, a defense posture that employs detectors trained on random sets of features, is robust even against mimicry attacks with intimate knowledge of the training set.
GMU-CS-TR-2012-6
Elucidating Activity-related Physico-chemical Features in Antimicrobial Peptides
Daniel Veltri and Amarda Shehu
The rise of drug-resistant bacteria has brought attention to antimicrobial peptides (AMPs) as targets for novel antibacterial drug research. Many machine learning methods aim to improve recognition of AMPs. Sequence-derived features are often employed in the context of supervised learning through Support Vector Machines (SVMs). This can be useful for expediently screening databases for AMP-like peptides. However, AMPs are characterized by great sequence diversity. Moreover, biological studies focusing on AMP modification and de novo design stand to benefit from computational methods capable of exposing underlying features important for activity at the amino-acid level position. We take the first steps in this direction by considering an extensive list of amino-acid physico-chemical features. We gradually narrow this list down to relevant features in the context of SVM classification. We focus on a specific AMP class, cathelicidins, due to the abundance of documented sequences, to improve their recognition over carefully-designed decoy sequences. Analysis of the features important for the classification reveals interesting physico-chemical properties to preserve when modifying or designing novel AMPs in the wet laboratory.
GMU-CS-TR-2011-0
How to Make a Technical Report (Rev. 3)
Sean Luke
The following document describes the preferred formatting of technical reports and submission guidelines. This document itself is formatted according to the preferred formatting rules.
GMU-CS-TR-2011-1
GPU-Euler: Sequence Assembly using GPGPU
Syed Faraz Mahmood and Huzefa Rangwala
Advances in sequencing technologies have revolutionized the field of genomics by providing cost effective and high throughput solutions. In this paper, we develop a parallel sequence assembler implemented on general purpose graphic processor units (GPUs). Our work was largely motivated by a growing need in the genomic community for sequence assemblers and increasing use of GPUs for general purpose computing applications. We investigated the implementation challenges, and possible solutions for a data parallel approach for sequence assembly. We implemented an Eulerian-based sequence assembler (GPU-Euler) on the nVidia GPUs using the CUDA programming interface. GPU-Euler was benchmarked on three bacterial genomes using input reads representing the new generation of sequencing approaches. Our empirical evaluation showed that GPU-Euler produced lower run times, and showed comparable performance in terms of contig length statistics to other serial assemblers. We were able to demonstrate the promise of using GPUs for genome assembly, a computationally intensive task.
GMU-CS-TR-2011-2
Large Scale Empirical Analysis of Cooperative Coevolution
Sean Luke and Keith Sullivan and Faisal Abidi
We present a study of cooperative coevolution applied to moderately complex optimization problems in large-population environments. The study asks three questions. First: what collaboration methods perform best, and when? Second: how many subpopulations are desirable? Third: is it worthwhile to do more than one trial per fitness evaluation? We discovered that parallel methods tended to work better than sequential ones, that “shuffling” (a collaboration method) predominated in performance in more complex problems, that more subpopulations generally did better, and that more trials performed marginally better.
GMU-CS-TR-2011-3
Guided Exploration of the Architectural Solution Space in the Face of Uncertainty
Naeem Esfahani and Sam Malek
A system's early architectural decisions impact its properties (e.g., scalability, dependability) as well as stakeholder concerns (e.g., cost, time to delivery). Choices made early on are both difficult and costly to change, and thus it is paramount that the engineer gets them 'right'. This leads to a paradox, as in early design, the engineer is often forced to make these decisions under uncertainty, i.e., not knowing the precise impact of those decisions on the various concerns. How could the engineer make the 'right' choices in such circumstances? This is precisely the question we have tackled in this paper. We present GuideArch, a framework aimed at guiding the exploration of the architectural solution space under uncertainty. It provides techniques and tools that help the engineer to make informed decisions. The approach has been thoroughly evaluated in a project aimed at reengineering a mobile software system.
GMU-CS-TR-2011-4
Fast and Robust Generation of City-Scale Seamless 3D Urban Models
Yanyan Lu and Evan Behar and Stephen Donnelly and Jyh-Ming Lien and Fernando Camelli and David Wong
Since the introduction of the concept of “Digital Earth”, almost every major international city has been re-constructed in the virtual world. A large volume of geometric models describing urban objects has become freely available in the public domain via software like ArcGlobe and Google Earth. Although mostly created for visualization, these urban models can benefit many applications beyond visualization including city scale evacuation planning and earth phenomenon simulations. However, these models are mostly loosely structured and implicitly defined and require tedious manual preparation that usually takes weeks if not months before they can be used. Designing algorithms that can robustly and efficiently handle unstructured urban models at the city scale becomes a main technical challenge. In this paper, we present a framework that generates seamless 3D architectural models from 2D ground plans with elevation and height information. These overlapping ground plans are commonly used in the current GIS software such as ESRI ArcGIS and urban model synthesis methods to depict various components of buildings. Due to measurement and manual errors, these ground plans usually contain small, sharp, and various (nearly) degenerate artifacts. In this paper, we show both theoretically and empirically that our framework is efficient and numerically stable. Based on our review of the related work, we believe this is the first work that attempts to automatically create 3D architectural meshes for simulation at the city level. With the goal of providing greater benefit beyond visualization from this large volume of urban models, our initial results are encouraging.
GMU-CS-TR-2011-5
Hierarchical Multi-Robot Learning from Demonstration
Keith Sullivan and Sean Luke
Developing robot behaviors is a tedious task requiring multiple coding, trial, and debugging cycles. This makes attractive the notion of learning from demonstration, whereby a robot learns behaviors in real time from the examples of a demonstrator. Learning from demonstration can be problematic, however, because of the number of trials necessary to gather sufficient samples to learn correctly. The problem is compounded in a multi-robot setting due to the potentially much larger design space arising from the number of and interactions between the robots. In this paper, we propose a learning from demonstration system capable of rapidly training multiple robots to perform a collaborative task. Our supervised learning method applies user domain knowledge to decompose complex behaviors into a hierarchy of simpler behaviors, which are easier to train and learn, and require many fewer samples to do so. The system further reduces the state space by only considering environmental features and actions pertinent to each decomposed simple behavior. Decomposition occurs not only within individual robot behaviors but also at the hierarchical group behavior level. Experiments using Pioneer robots in a patrol scenario illustrate our system.
GMU-CS-TR-2011-6
Netgator: Malware Detection Through Program Interactive Proofs
Brian Schulte, Rhandi Martin, Haris Andrianakis, Angelos Stavrou
Exfiltration of data using internet-borne attacks has become a credible threat for organization and enterprises. History has shown that crafted targeted attacks and zero-day malware are capable of penetrating even the most sophisticated defenses. To make matters worse, intrusion detection systems that perform analysis of network traffic are dependent on the timely information provided by blacklisting, signature schemes, or anomaly patterns. This is especially true for heavily used communication protocols where blocking decisions affect the everyday operations of the organization. Even when the intrusion is detected in a timely manner, valuable data might have already been stolen. In this paper, we propose a new approach to distinguish legitimate browser software from malware that generates identical network traffic signatures. Our system consists of two parts: a signature-based passive detection module, and a module that issues Program Interactive Challenges (PICs) to client applications. Initially, we perform passive detection utilizing content inspection techniques that analyze network header element order. Of course this is not enough: we demonstrate that passive detection can be easily subverted. To amend that, we introduce an interactive detection module in the form of inline network proxy that challenges the detected browser forcing it to exercise known specific internal functionality. For browsers, we can leverage HTML, Flash, Javascript and other common browser components to form challenges that malware will not be able to respond to due to lack of functionality. Contrary to interractive challenges, our challenges are transparent to the user allowing for a seamless browsing experience. We demonstrate the effectiveness of active challenges on thousands of real world malware samples. Our results show that, depending on the deployment scenario, PICs incure no or minimal (average of 0.35 seconds) latency per inspected objected.
GMU-CS-TR-2011-7
SecureSwitch: BIOS-Assisted Isolation and Switch between Trusted and Untrusted Commodity OSes
Kun Sun and Jiang Wang and Fengwei Zhang and Angelos Stavrou
Protecting commodity desktop systems that run commercial operating systems (OS) without adversely impacting performance or usability remains an open problem. To make matters worse, the overall system security depends on desktop applications with complex code-bases that perform multiple and inter-dependent tasks often dictated by Internet-borne code. Recent research has indicated the need for context-dependent trustworthy environments where the user can segregate different activities in an effort to lower risk and safeguard private information. In this paper, we introduce a new BIOS-assisted mechanism for the secure generation and management of trusted execution environments, tailored to separate security-sensitive activities from untrusted ones. A key characteristic of our system is usability: the capability to quickly and securely switch between operating environments in a single physical machine without the need for any specialized hardware or extensive code modifications. Our goal was to eliminate any mutable, non-BIOS codesharing while reusing existing processor capabilities. We demonstrate that even if the untrusted OS becomes compromised, it cannot perform an exfiltration or inference attack on data or applications in the trusted OS. To avoid sophisticated attacks that fake a trusted environment, we provide visible indication to the user about the current environment. Moreover, to alternate between environments, we require the user to physically press a button, an action that cannot be reproduced by software. Using our prototype, we measured the switching process to be approximately six seconds. This short switching time empowers the user to frequently and seamlessly switch between trusted and untrusted environments.
GMU-CS-TR-2011-8
An Analysis of System Management Mode (SMM)-based Integrity Checking Systems and Evasion Attacks
Jiang Wang and Kun Sun and Angelos Stavrou
System Management Mode (SMM) is an x86 processor feature designed to assist debugging for hardware manufacturers. Recent research has shown that SMM can also be used to protect the run-time integrity of software by invoking SMM to periodically check current system state and compare it with known pristine or trusted software states. Researchers and practitioners have claimed that any unauthorized state modification can be detected with an SMM-based system integrity-checking mechanism. In this paper, we demonstrate that all hardware-based, periodic integrity mechanisms can be defeated by a new class of attacks, which we refer to as “evasion attacks.” Such attacks use a compromised software stack to remove any attack traces before the integrity checks begin and to continue the execution of the malicious code after the integrity checks are completed. We detail two categories of evasion attacks: directly-intercepting System Management Interrupt (SMI) and indirectly-deriving SMI invocations. Finally, we measure the performance impact of our proof-of-concept prototypes for all of the attacks and present countermeasures for these attacks.
GMU-CS-TR-2011-9
Finding Critical Changes in Dynamic Configuration Spaces
Yanyan Lu and Jyh-Ming Lien
Given a motion planning problem in a dynamic but fully known environment, we propose the first roadmapbased method, called critical roadmap, that has the ability to identify and exploit the critical topological changes of the free configuration space. Comparing to the existing methods that either ignore temporal coherence or only repair their roadmaps at fixed times, our method provides not only a more complete representation of the free (configuration-time) space but also provides significant efficiency improvement. Our experimental results show that the critical roadmap method has a higher chance of finding solutions, and it is at least one order of magnitude faster than some well-known planners.
GMU-CS-TR-2010-1
Security Policy Cognizant Module Composition
Paul Seymer, Angelos Stavrou, Duminda Wijesekera, and Sushil Jajodia
Component-based software development and deployment is based on developing individual software modules that are composed on an as needed basis. Such modules expose the computations they provide and their dependencies on providing these computations --- that results in a well known requires-provides specifications for modules. This paper provides a framework to combine modules that specify their requires-provides interfaces in a policy dependent way. Our framework specify policies as combinations of Constraint Logic Programming (CLP) based rules and our policies can cover multiple aspects associated of compositions, such as security and quality of service. We apply our framework to specify Quality of Protection (QoP) and Quality of Service (QoP) policies. An example shows the applicability of our policy language to a teleconferencing application with multiple security and resource usage policies.
GMU-CS-TR-2010-2
Multiple Kernel Learning for Fold Recognition
Huzefa Rangwala
Fold recognition is a key problem in computational biology that involves classifying protein sharing structural similarities into classes commonly known . Recently, researchers have developed several efficient kernel based discriminatory methods for fold classification using sequence information. These methods train one-versus-rest binary classifiers using well optimized kernels from different data sources and techniques.
Integrating this vast amount of data in the form of kernel matrices is an interesting and challenging problem. The semidefinite positive property of the various kernel matrices makes it attractive to cast the task of learning an optimal weighting of several kernel matrices as a semi-definite programming optimization problem. We experiment with a previously introduced quadratically constrained quadratic optimization problem for kernel integration using 1-norm and 2-norm support vector machines. We integrate state-of-the-art profilebased direct kernels to learn an optimal kernel matrix. Our experimental results show a small significant improvement in terms of the classification accuracy across the different fold classes. Our analysis illustrates the strength of two dominating kernels across different fold classes, which suggests the redundant nature of the kernel matrices being combined.
GMU-CS-TR-2010-3
Following a Large Unpredictable Group of Targets Among Obstacles
Christopher Vo and Jyh-Ming Lien
Camera control is essential in both virtual and real-world environments. Our work focuses on an instance of camera control called target following, and offers an algorithm, based on the ideas of monotonic tracking regions and ghost targets, for following a large coherent group of targets with unknown trajectories, among known obstacles. In multiple-target following, the camera's primary objective is to follow and maximize visibility of multiple moving targets. For example, in video games, a third-person view camera may be controlled to follow a group of characters through complicated virtual environments. In robotics, a camera attached to robotic manipulators could also be controlled to observe live performers in a concert, monitor assembly of a mechanical system, or maintain task visibility during teleoperated surgical procedures. To the best of our knowledge, this work is the first attempting to address this particular instance of camera control.
GMU-CS-TR-2010-4
Scalable and Robust Shepherding via Deformable Shapes
Joseph F. Harrison, Christopher Vo and Jyh-Ming Lien
In this paper, we present a new motion planning strategy for shepherding in environments containing obstacles. This instance of the group motion control problem is applicable to a wide variety of real life scenarios, such as animal herding, civil crowd control, and oil-spill cleanup. However, the problem is challenging in terms of scalability and robustness because it is dynamic, highly underactuated, and involves multi-agent coordination. Our previous work showed that high-level probabilistic motion planning algorithms combined with simple shepherding behaviors can be beneficial in situations where low-level behaviors alone are insufficient. However, inconsistent results suggested a need for a method that performs well across a wider range of environments. In this paper, we present a new method, called Deform, in which shepherds view the flock as an abstracted deformable shape. We show that our method is more robust than our previous approach and that it scales more effectively to larger teams of shepherds and larger flocks. We also show Deform to be surprisingly robust despite increasing randomness in the motion of the flock.
GMU-CS-TR-2010-5
HyperCheck: A Hardware-Assisted Integrity Monitor
Jiang Wang, Angelos Stavrou, and Anup Ghosh
Over the past few years, virtualization has been employed to environments ranging from densely populated cloud computing clusters to home desktop computers. Security researchers embraced virtual machine monitors (VMMs) as a new mechanism to guarantee deep isolation of untrusted software components. Unfortunately, their widespread adoption promoted VMMs as a prime target for attackers. In this paper, we present HyperCheck, a hardware-assisted tampering detection framework designed to protect the integrity of VMMs and, for some classes of attacks, the underlying operating system (OS). HyperCheck leverages the CPU System Managed Mode (SMM), present in x86 systems, to securely generate and transmit the full state of the protected machine to an external server. Using HyperCheck, we were able to ferret-out rootkits that targeted the integrity of both the Xen hypervisor and traditional OSes. Moreover, HyperCheck is robust against attacks that aim to disable or block its operation. Our experimental results show that Hypercheck can produce and communicate a scan of the state of the protected software in less than 40ms.
GMU-CS-TR-2010-6
Extra Buffer Resources Improving Competitiveness in Minimizing Energy Consumption
Zhi Zhang and Fei Li
In this paper, we consider energy management algorithms for scheduling jobs in power-scare scenarios such as embedded computer systems and sensor networks. We focus on investigating the impact of buffer resources in minimizing the total energy cost in an online setting. The online algorithms do not have any assumptions on job arrivals; their worst-case performance is measured in term of competitive ratio, when they are compared with the optimal algorithms with clairvoyance. We prove that with appropriate extra buffer space, an online algorithm can beat an weak optimal offline algorithm in terms of the total energy required. Our research result helps to quantitatively estimate the optimal on-chip buffer resources allocated in real-time systems with power constraints. We also present the lower bound of competitive ratio that any deterministic online algorithm cannot achieve.
GMU-CS-TR-2010-7
An Information Theoretic Approach for the Analysis of RNA and DNA Binding Sites
Sheng Li and Huzefa Rangwala
Proteins perform several critical biological processes by interacting with other macromolecules (DNA, RNA) and small molecules. Several computational approaches have been developed to determine the protein interaction sites using sequence and structure features. Instead of building another adhoc prediction algorithm, the purpose of this study is to understand the contribution of a s residue in a RNA-binding event and compare it with the DNA-binding process. We evaluate several sequence and structure-based features using mutual information theory. We show that solvent accessibility and profile-based features can be used for developing good protein-RNA binding site determination algorithms. We also recommend features that could discriminate between RNA and DNA binding sites. This work can be extended to understand protein-protein and protein-ligand interactions as well.
GMU-CS-TR-2010-8
Energy Management for Time-Critical Harvesting Wireless Sensor Networks
Bo Zhang, Robert Simon and Hakan Aydin
As Cyber-Physical Systems (CPSs) evolve they will be increasingly relied to support time-critical monitoring and control activities. Further, many CPSs that utilizing Wireless Sensor Networking (WSN) technologies will require the use of energy harvesting methods to extend their lifetimes. For this application class, there are currently few effective models that allow the simulation and analysis of new algorithms or system performance. To address this problem, we define a general purpose WSN model to support a time-critical CPS system. We then present a set of Harvesting Aware Speed Selection (HASS) algorithms. Our technique maximizes the minimum energy reserve for all the nodes in the network, thus ensuring highly resilient performance under emergency or fault-driven situations. We present an optimal centralized solution, along with an efficient, fully distributed solution. We propose a CPS-specific experimental methodology, enabling us to evaluate our approach. Our experiments show that our algorithms yield significantly higher energy reserves than baseline methods.
GMU-CS-TR-2010-9
Evaluation of Short Read Metagenomic Assembly
Anveshi Charuvaka and Huzefa Rangwala
Advances in sequencing technologies have equipped researchers with the ability to sequence the collective genome of entire microbial communities commonly referred to as metagenomics. These microbes are are omnipresent within the human body and environments across the world. As such, characterizing and understanding their roles is crucial for improving human health and the environment. The problem of using short reads obtained from current next generation sequencing technologies to assemble the genomes within the community sample is challenging for several reasons. In this study we assess the performance of a state-of-the-art Eulerian-based graph assembler on a series of simulated dataset with varying complexity. We evaluate the feasibility of metagenomic assembly with reads restricted to be 36 base pairs obtained from the Solexa/Illumina platform. We developed a pipeline to evaluate the quality of assembly based on contig length statistics and accuracy. We studied the effect of overlap parameters used for the metagenomic assembly and developed a clustering solution to pool the contigs obtained from different runs of the assembly algorithm which allowed us to obtain longer contigs. We also computed an entropy/impurity metric to assess how mixed the assembled contigs were. Ideally a contig should be assembled from reads obtained from the same organism. We also compared the metagenomic assemblies to the best possible solution that could be obtained by assembling individual source genomes. Our results show that accuracy was better than expected for the metagenomic samples with a few dominant organisms and was especially poor in samples containing many closely related strains.
GMU-CS-TR-2010-10
Taming Uncertainty in Self-Adaptation through Possibilistic Analysis
Naeem Esfahani and Ehsan Kouroshfar and Sam Malek
Self-adaptation endows a software system with the ability to satisfy certain objectives by automatically modifying its behavior. While many promising approaches for the construction of self-adaptive software systems have been developed, the majority of them ignore the uncertainty underlying the adaptation decisions. This has been one of the key inhibitors to wide-spread adoption of self-adaption techniques in risk-averse real-world settings. In this paper, we describe an approach, called POssIbilistic SElf-aDaptation (POISED), for tackling the challenge posed by uncertainty in making adaptation decisions. POISED builds on possibility theory to assess the positive and negative consequences of uncertainty. It makes adaptation decisions that result in the best range of potential behavior. We demonstrate POISED's application to the problem of improving a software system's quality attributes via runtime reconfiguration of its customizable software components. We have extensively evaluated POISED using a prototype of a robotic software system.
GMU-CS-TR-2010-11
A New Method for Mapping the Configuration Space Obstacles of Polygons
Evan Behar and Jyh-Ming Lien
Configuration space (C-space) plays an important role not only in motion planning but also in geometric modeling, shape and kinematic reasoning, and is fundamental to several basic geometric operations, such as continuous collision detection and generalized penetration depth estimation, that also find their applications in motion planning, animation and simulation. In this paper, we developed a new method for constructing the boundary of the C-space obstacles (C-obst) of polygons. This method is simpler to implement and often more efficient than the existing techniques. These main advantages are provided by a new algorithm that allows us to extract the Minkowski sum from the reduced convolution of the input polygons. We also developed a method for estimating the generalized penetration depth by computing the distance between the query point and the C-obst surface.
GMU-CS-TR-2010-12
Dynamic Minkowski Sum of Convex Shapes
Evan Behar and Jyh-Ming Lien
Computing the Minkowski sums of rotating objects has always been done naively by re-computing every Minkowski sum from scratch. The correspondences between the Minkowski sums are typically completely ignored. We propose a method, called DYMSUM, that can efficiently update the Minkowski sums of rotating convex polyhedra. We show that DYMSUM is significantly more efficient than the traditional approach, in particular when the size of the input polyhedra are large and when the rotation is small between frames. From our experimental results, we show that the computation time of the proposed method grows slowly with respect to the size of the input comparing to the naive approach.
GMU-CS-TR-2010-13
Using Ontological Information to Enhance Responder Availability in Emergency Response
Paul Ngo and Duminda Wijesekera
Ensuring effective communications during emergencies is an important issue for any functional government. One way to address this issue is to ensure the availability of the key personnel capable of making the appropriate decisions and taking timely actions with sufficient resources. Many XML-based languages such as the Emergency Data Exchange Language (EDXL) and associated Common Alert Protocol (CAP) have been designed to provide a basis for such communications. To ensure that messages are delivered in a timely manner, we propose some role and task based ontological enhancements for these languages. We show by example how the ontological enhancements can be used to enhance availability of emergency personnel in case of a need.
GMU-CS-TR-2010-14
Fixed Length Representations of Sequence Data using a Variant of the Hidden Markov Model
Sam Blasiak and Huzefa Rangwala
Sequence classification is central to many practical problems within machine learning. Classification algorithms often center around the notion of a distance metric between examples. Unlike sequences, the Euclidean distance metric between vectors often has an intuitive meaning which transfers naturally to a meaning in the classification domain. Distances metrics between arbitrary pairs of sequences, however, can be harder to define because sequences can vary in both length and the information contained in the order of sequence elements is lost when standard distance metrics are applied. We present a scheme that employs a Hidden Markov Model variant to produce a set of fixed-length vectors from a set of sequences. We then define three inference algorithms, a Baum-Welch variant, a Gibbs Sampling algorithm, and a variational algorithm, to infer model parameters. Finally, we show experimentally that the fixed length representation produced by these inference methods is useful for classifying proteins by structural taxonomy.
GMU-CS-TR-2010-15
Detection of Communities and Bridges in Weighted Networks
Tanwistha Saha and Carlotta Domeniconi and Huzefa Rangwala
Traditional graph-based clustering methods group vertices into discrete non-intersecting clusters under the assumption that each vertex can belong to only a single cluster. On the other hand, recent research on graph-based clustering methods, applied to real world networks (e.g., protein-protein interaction networks and social networks), shows overlapping patterns among the underlying clusters. For example, in social networks, an individual is expected to belong to multiple clusters (or communities), rather than strictly confining himself/herself to just one. As such, overlapping clusters enable better models of real-life phenomena. Soft clustering (e.g., fuzzy c-means) has been used with success for non-graph data, when the objects are allowed to belong to multiple clusters with a certain degree of membership. In this paper, we propose a fuzzy clustering based approach for community detection in a weighted graphical representation of social and biological networks, for which the ground truth associated to the nodes is available. We compare our results with a baseline method for both multi-labeled and single labeled datasets.
GMU-CS-TR-2010-16
GeoMason: Geospatial Support for MASON
Keith Sullivan and Mark Coletti and Sean Luke
MASON is a free, open-source Java-based discrete event multi-agent simulation toolkit that has been used to model network intrusions, unmanned aerial vehicles, nomadic migrations, and farmer/herder conflicts, among others. Many multi-agent models use georeferenced data which represent such things as road networks, rivers, vegetation coverage, population, and topology. However, MASON does not directly support georeferenced data. Therefore practitioners using MASON must hand craft such support, which may be difficult and error prone. In this paper we describe newly added geospatial functionality in MASON that addresses this problem. We discuss the design of this functionality, called GeoMASON, and its use and limitations. Finally, we give examples on how to import and manipulate georeferenced data.
GMU-CS-TR-2010-17
Which Should We Try First? Ranking Information Resources through Query Classification
Joshua Church and Amihai Motro
Users seeking information in distributed environments of large numbers of disparate information resources are often burdened with the task of repeating their queries for each and every resource. Invariably, some of the searched resources are more productive (yield more useful documents) than others, and it would be undoubtedly useful to try these resources first. If the environment is federated and a single search tool is used to process the query against all the disparate resources, then a similar issue arises: Which information resources should be searched first, to guarantee that useful answers are streamed to users in a timely fashion. In this paper we propose a solution that incorporates techniques from text classification, machine learning and information retrieval. Given a set of pre-classified information resources and a keyword query, our system suggests a relevance ordering of the resources. The approach has been implemented in prototype form, and initial experimentation has given promising results.
GMU-CS-TR-2010-18
Predicting Network Response Times Using Social Information
Chen Liang and Sharath Hiremagalore and Angelos Stavrou and Huzefa Rangwala
Social networks and discussion boards have become a significant outlet where people communicate and express their opinion freely. Although the social networks themselves are usually well-provisioned, the participating users frequently point to external links to substantiate their discussions. Unfortunately, the sudden heavy traffic load imposed on the external, linked web sites causes them to become unresponsive leading to what people call the “Flash Crowds” effect. Flash Crowds present a real challenge, as it is not possible to predict their intensity and occurrence time. Moreover, although increasingly capable, most present-day web hosting servers and caching systems are designed to handle a nominal load of requests before they become unresponsive. This can happen either due to the limited bandwidth or processing power allocated to the hosting site. In this paper, we quantify the prevalence of flash crowd events for a popular social discussion board (Digg). Using PlanetLab, we measured the response times of 1289 unique popular websites. We were able to verify that 89% of the popular URLs suffered variations in their response times. In an effort to identify flash crowds ahead of time, we evaluate and compare traffic forecasting mechanisms. We show that predicting network traffic using network measurements has very limited success and cannot be used for large-scale prediction. However, by analyzing the content and structure of the social discussions, we were able to classify 86% of the popular web sites within 5 minutes of their submission and 95% of the sites when more (5 hours) of social content became available. Our work indicates that we can effectively leverage social activity to forecast network events that will be otherwise infeasible to anticipate.
GMU-CS-TR-2010-19
TAC-ELM: Metagenomic Taxonomic Classification with Extreme Learning Machines
Zeehasham Rasheed and Huzefa Rangwala
Next-generation technologies have allowed researchers to determine the collective genomes of all organisms within specific environments or communities. Varying species abundance, length and complexities within different communities, coupled with discovery of new species makes the problem of taxonomic assignment to short DNA sequence reads extremely challenging. We have developed a new sequence composition-based taxonomic classifier, TAC-ELM for metagenomic analysis. TAC-ELM uses the framework of extreme learning machines to quickly and accurately learn the weights for a neural network model, with input features consisting of GC content and oligonucleotides. TAC-ELM is evaluated on two standard metagenomic benchmarks with sequence read lengths reflecting the traditional and current technologies. Our empirical results indicate the strength of the developed approach, which outperforms state-of-the-art taxonomic classifiers in terms of accuracy, training time and implementation complexity. We also perform experiments that evaluate the pervasive case within metagenome analysis, where a species may not have been previously sequenced or discovered and will not exist in the reference genome databases.
GMU-CS-TR-2010-20
Ruminate: A Scalable Architecture for Deep Network Analysis
Charles Smutz and Angelos Stavrou
Traditionally, Network Intrusion Detection Systems (NIDS) inspect packet header and payload data for malicious content. While each system is different, most NIDS perform limited analysis on network streams and network protocols. Unfortunately, current NIDS are typically susceptible to evasion through network protocol encoding, such as base64 encoding of SMTP/MIME or gzip compression of HTTP. In addition, malicious desktop application payloads (e.g., PDF documents, Flash multimedia files) are beyond the inspection capabilities of popular NIDS. To address these limitations, we introduce Ruminate, a scalable object-centric traffic inspection and analysis architecture. Ruminate provides a distributed platform for deep analysis of network payload content. This includes full decoding of network protocols and recursive extraction of client application objects transferred over the network. While traditional NIDS utilize static packet load balancing to provide scalability, Ruminate employs dynamic load distribution of reassembled network streams and embedded objects, outsourcing the heavy processing to other processors or connected hosts. Therefore, high latency or computationally expensive analysis can be performed on commodity servers. Furthermore, our approach empowers system administrators to provision resources and preferentially treat traffic not only depending on the packet header but also on the data objects it carries. To achieve this, each object inspection algorithm is implemented as a separate component or service offered through a highly scalable producer-consumer architecture. We demonstrate using real-world traffic that our load balancing is far superior to existing techniques. This is because its granularity depends on the reconstructed objects rather than packet or simple stream analysis. Unlike existing systems, Ruminate can prevent NIDS evasion that leverages encoding or compression of malicious objects in network protocols, desktop application file formats, or encapsulation within other objects.
GMU-CS-TR-2009-1
Scheduling Weighted Packets with Deadlines Over a Fading Channel
Fei Li and Zhi Zhang
We consider scheduling weighted packets with time constraints over a fading channel. Packets arrive at the transmitter in an online manner. Each packet has a value and a hard deadline by which it should be sent. The fade state of the channel determines the throughput obtained per unit of time and the channel's quality may change over time. In this paper, we design both offline and online algorithms to maximize weighted throughput, which is defined as the total value of the packets sent by their respective deadlines. We first present polynomial-time exact offline algorithms for this problem. Then, we present online algorithms and their competitive analysis as well as the lower bounds of competitive ratios. Our work is the first one addressing weighted throughput for this important problem in the areas of information theory and communications.
GMU-CS-TR-2009-2
Geometry-based Computation of Symmetric Homo-oligomeric Protein Complexes
Christopher Miles and Brian Olson and Amarda Shehu
The need to engineer novel therapeutics and functional materials is driving the in-silico design of molecular complexes. This paper proposes a method to compute symmetric homo-oligomeric protein complexes when the structure of the replicated protein monomer is known and rigid. The relationship between the structure of a protein and its biological function brings the in-silico determination of protein structures associated with functional states to the forefront of computational biology. While protein complexes, arising from associations of protein monomers, are pervasive in every genome, determination of their structures remains challenging. Given the difficulty in computing structures of a protein monomer, computing arrangements of monomers in a complex is mainly limited to dimers. A growth in the number of protein complexes studied in wet labs is allowing classification of their structures. A recent database shows that most naturally-occurring protein complexes are symmetric homo-oligomers. The method presented here exploits this database to propose structures of symmetric homo-oligomers that can accommodate spatial replications of a given protein monomer. The method searches the database for documented structures of symmetric homo-oligomers where the replicated monomer has a geometrically-similar structure to that of the input protein monomer. The proposed method is a first step towards the in-silico design of novel protein complexes that upon further refinement and characterization can serve as molecular machines or fundamental units in therapeutics or functional materials.
GMU-CS-TR-2009-3
Towards Session-aware RBAC Administration and Enforcement with XACML
Min Xu and Duminda Wijesekera and Xinwen Zhang and Deshan Cooray
An administrative role-based access control (ARBAC) model specifies administrative policies over a role-based access control (RBAC) system, where an administrative permission may change an RBAC policy by updating permissions assigned to roles, or assigning/revoking users to/from roles. Consequently, enforcing ARBAC policies over an active access controller while some users are using protected resources would result in conflicts: a policy may be in effect in the RBAC system while being updated by an ARBAC operation. Towards solving this concurrency problem, we propose a session-aware administrative model for RBAC. We show how the concurrency problem can be resolved by enhancing the eXtensible Access Control Markup Language (XACML) reference implementation. In order to do so, we develop an XACML-ARBAC profile to specify ARBAC policies, and enforce these polices by building an ARBAC enforcement module and a session administrative module. The former synchronizes with the evaluation of access control requests. The latter revokes conflicting ongoing user sessions immediately prior to enforcing administrative operations. Experimental shows reasonable performance characteristics of our initial enhancement to Sun's reference implementation
GMU-CS-TR-2009-4
A Control Architecture for Autonomous Mobile Robotics
Randall Steck
The field of mobile robotics is on the forefront of robotics research around the world. Control architectures for complex autonomous mobile robots have largely settled on hybrid architectures for their suitability at dealing with the opposing forces of planning and reactivity. We present a general, heterogeneous 3-Tier hybrid architecture for control of an autonomous mobile robot and discuss an implementation in the domain of campus navigation. The architecture features a useful organization structure for high-level skills and offers flexible construction options for low-level behavior hierarchies.
GMU-CS-TR-2009-5
Behavior-Based Motion Planning for Group Control
Christopher Vo and Joseph F. Harrison and Jyh-Ming Li
Despite the large body of work in both motion planning and multi-agent simulation, little work has focused on the problem of planning motion for groups of robots using external "controller" agents. We call this problem the group control problem. This problem is complex because it is highly underactuated, dynamic, and requires multi-agent cooperation. In this paper, we present a variety of new motion planning algorithms based on ESt, RRT, and PRM methods for shepherds to guide flocks of robots through obstacle-filled environments. We show using simulation on several environments that under certain circumstances, motion planning can find paths that are too complicated for naive "simulation only" approaches. However, inconsistent results indicate that this problem is still in need of additional study.
GMU-CS-TR-2009-6
Towards a Universal Text Classifier: Transfer Learning using Encyclopedic Knowledge
Pu Wang and Carlotta Domeniconi and Jian Hu
Document classification is a key task for many text mining applications. However, traditional text classification requires labeled data to construct reliable and accurate classifiers. Unfortunately, labeled data are seldom available, and often too expensive to obtain. In this work, we propose a universal text classifier, which does not require any labeled training document. Our approach simulates the capability of people to classify documents based on background knowledge. As such, we build a classifier that can effectively group documents based on their content, under the guidance of few words, which we call discriminant words, describing the classes of interest. Background knowledge is modeled using encyclopedic knowledge, namely Wikipedia. Wikipedia's articles related to the specific problem domain at hand are selected, and used during the learning process for predicting labels of test documents. The universal text classifier can also be used to perform document retrieval, in which the pool of test documents may or may not be relevant to the topics of interest for the user. In our experiments with real data we test the feasibility of our approach for both the classification and retrieval tasks. The results demonstrate the advantage of incorporating background knowledge through Wikipedia, and the effectiveness of modeling such knowledge via probabilistic topic modeling. The accuracy achieved by the universal text classifier is comparable to that of a supervised learning technique for transfer learning.
GMU-CS-TR-2009-7
Digging Digg: Comment Mining, Popularity Prediction, and Social Network Analysis
Salman Jamali and Huzefa Rangwala
Using comment information available from Digg we define a co-participation network between users. We focus on the analysis of this implicit network, and study the behavioral characteristics of users. Using an entropy measure, we infer that users at Digg are not highly focused and participate across a wide range of topics. We also use the comment data and social network derived features to predict the popularity of online content linked at Digg using a classification and regression framework. We show promising results for predicting the popularity scores even after limiting our feature extraction to the first few hours of comment activity that follows a Digg submission.
GMU-CS-TR-2009-8
Towards Generalized Trust in Smart Spaces
Dalal A. Al-Arayed and João P. Sousa
Existing work in trust management does not satisfactorily provide a combination of functionality and generality. A model that sufficiently addresses the aspects of trust management and is general enough to be applicable in multiple problem scenarios is needed. This paper presents a base trust model that can be used in multiple problem scenarios (content management, service provision, and routing). In addition, smart spaces introduce new issues that are not addressed sufficiently by the available trust management models. The base trust model is designed to be simple to facilitate its enrichment to accommodate the additional requirements of Smart Spaces in the future.
GMU-CS-TR-2008-0
How to Make a Technical Report (Rev. 2)
Sean Luke
This document has been superceded by GMU CS Department Technical Report GMU-CS-TR-2011-0
The following document describes the preferred formatting of technical reports and submission gu idelines. This document itself is formatted according to the preferred formatting rules.
GMU-CS-TR-2008-1
Algorithms for Leximin-Optimal Fair Policies in Repeated Games
Gabriel Balan and Sean Luke and Dana Richards
Solutions to non-cooperative multiagent systems often require achieving a joint policy which is as fair to all parties as possible. There are a variety of methods for determining the fairest such joint policy. One approach, min fairness, finds the policy which maximizes the minimum average reward given to any agent. We focus on an extension, leximin fairness, which breaks ties among candidate policies by choosing the one which maximizes the second-to-minimum average reward, then the third-to-minimum average reward, and so on. This method has a number of advantages over others in the literature, but has so far been little-used because of the computational cost in employing it to find the fairest policy. In this paper we propose a linear programming based algorithm for computing leximin fairness in repeated games which has a polynomial time complexity given certain reasonable assumptions.
GMU-CS-TR-2008-2
Long-term Fairness with Bounded Worst-Case Losses
Gabriel Balan and Dana Richards and Sean Luke
How does one repeatedly choose actions so as to be fairest to the multiple beneficiaries of those actions? We examine approaches to discovering sequences of actions for which the worst-off beneficiaries are treated maximally well, then secondarily the second-worst-off, and so on. We formulate the problem for the situation where the sequence of action choices continues forever; this problem may be reduced to a set of linear programs. We then extend the problem to situations where the game ends at some unknown finite time in the future. We demonstrate that an optimal solution is NP-hard, and present two good approximation algorithms.
GMU-CS-TR-2008-3
Life After Self-Healing: Assessing Post-Repair Program Behavior
Michael E. Locasto and Angelos Stavrou and Gabriela F. Cretu
One promising technique for defending software systems against vulnerabilities involves the use of self-healing. Such efforts, however, carry a great deal of risk because they largely bypass the cycle of human-driven patching and testing used to vet both vendor and internally developed patches. In particular, it is difficult to predict if a repair will keep the behavior of the system consistent with "normal" behavior. Assuring that post-repair behavior does not deviate from normal behavior is a major challenge to which no satisfactory solutions exist. We investigate the feasibility of automatically measuring behavioral deviations in software that has undergone a self-healing repair. We provide a first examination of the problem of assessing a repair's impact on execution behavior, and we define a model for representing post-repair behavior using machine-level intraprocedural control flow. In essence, we advocate performing anomaly detection on an application's execution after it has been attacked and subsequently repaired. Our goal, however, is not to detect an attack, but rather to provide a tool for assisting a system administrator to perform vulnerability triage. Our system can help them discover the relative impact of a repair so that they can begin to track down and analyze the cause of post-repair execution anomalies.
GMU-CS-TR-2008-4
A Simple Proof of the 3-Competitiveness of the Greedy Algorithm for Scheduling Weighted Packets
Fei Li
We provide a simplified proof of the 3-competitiveness of the greedy algorithm for scheduling weighted packets in the multi-FIFO buffer model. Azar and Richter provided a proof using the zero-one principle (Azar and Richter, STOC 2004). We use a different approach and we hope our approach can lead to an improved FIFO buffering algorithm.
GMU-CS-TR-2007-1
Modeling Diffusion in a Discrete Environment
Adrian Grajdeanu
The present report details a model for implementing diffusion in a discrete space. Formulated in 2004 in support of artificial development experiments, the model is mathematically justified starting from the diffusion equation. It has physical plausibility and handles well different shaped and sized diffusion neighborhoods in the sense that it achieves isotropic diffusion in spite of the bias introduced by the discretizing grid. It provides one formulation that encapsulates the diffusion neighborhood details and renders it applicable to linear, planar, spatial and even n-dimensional constructs.
ISE-TR-07-01
Graphical Composition of State-Dependent Use Case Behavioral Models
Jon Whittle and Ana Moreira and João Araújo and Rasheed Rabbi
Maintaining a clear separation of concerns throughout the software lifecycle has long been a goal of the software community. Concerns that are separated, however, must be composed at some point. This paper presents a technique for keeping statedependent use cases separate throughout the software modeling process and a method for composing statedependent use cases. The composition method is based on the graph transformations formalism. This provides a rich yet user-friendly way of composing state dependent use cases that is built on solid foundations. To evaluate our approach, it has been applied to seven student design solutions. Each solution was originally developed using a use case-driven methodology and was reengineered to evaluate whether our technique could have been applied. The findings are that it is possible to maintain the separation of state-dependent use cases during software design and, furthermore, that expressive model composition methods are necessary to do this in practice.
ISE-TR-07-02
Evolving VirtuE
Alessandro D'Atri and Amihai Motro
One of the most attractive aspects of virtual databases is their agility: the inherent ability to adapt and evolve in response to changing market conditions. Evolving VirtuE is a formal framework within which such agility can be realized. Through the concepts of enterprise time, activity logging, and log mining, the recent behavior and performance of an enterprise may be studied, and corresponding evolutionary steps can be induced. These steps may be intended to benefit the operation of individual enterprise members, as well the enterprise as a whole. In addition, we also examine enterprise creation, a period of rapid evolution that concludes when the enterprise reaches stability and begins transacting its business activities.
ISE-TR-07-03
XACML Policies for Exclusive Resource Usage
Vijayant Dhankhar and Saket Kaushik and Duminda Wijesekera
The extensible access control markup language (XACML) is the standard access control policy specification language of the World Wide Web. XACML does not provide exclusive accesses to globally resources, and we do so by enhancing the policy execution framework with locking.
ISE-TR-07-04
Grid-VirtuE: A Layered Architecture for Grid Virtual Enterprises
Alfredo Cuzzocrea and Alessandro D'Atri and Andrea Gualtieri and Amihai Motro and Domenico Saccà
A grid virtual enterprise is a community of independent enterprises concerned with a particular sector of the economy. Its members (nodes) are small or medium size enterprises (SME) engaged in bilateral transactions. An important principle of a grid virtual enterprise is the lack of any global "guiding force" with each member of the community making its own independent decisions. In this paper we describe Grid-VirtuE, a three-layer architecture for grid virtual enterprises. The top layer of the architecture, representing its ultimate purpose, is an environment in which grid virtual enterprises can be modeled and implemented. This layer is supported by middleware infrastructure for grids, providing a host of grid services, such as node-to-node communication, bilateral transactions, and data collection. The bottom layer is essentially a distributed data warehouse for storing, sharing and analyzing the large amounts of data generated by the grid. Among other functionalities, the warehouse handles the dissemination of data among the members of the grid; it confronts issues of data magnitude with an aging mechanism that aggregates old data at a lower level of detail; and it incorporates privacy-preserving features that retain the confidentiality of individual members. Warehouse information is also used for data and process mining, aimed at analyzing the behavior of the enterprise, and subsequently inducing evolutionary changes that will improve its performance.
ISE-TR-07-05
Fuzzy Clustering of Patterns Represented by Pairwise Dissimilarities
Maurizio Filippone
Clustering is the problem of grouping objects on the basis of a similarity measure between them. This paper considers the approaches belonging to the K-means family, in particular those based on fuzzy memberships. When patterns are represented by means of non-metric pairwise dissimilarities, these methods cannot be directly applied, since they are not guaranteed to converge. Symmetrization and shift operations have been proposed, to transform the dissimilarities between patterns from non-metric to metric. It has been shown that they modify the K-means objective function by a constant, that does not influence the optimization procedure. Some fuzzy clustering algorithms have been extended, in order to handle patterns described by means of pairwise dissimilarities. The literature, however, lacks of an explicit analysis on what happens to K-means style fuzzy clustering algorithms, when the dissimilarities are transformed to let them become metric. This paper shows how the objective functions of four clustering algorithms based on fuzzy memberships change, due to dissimilarities transformations. The experimental analysis conducted on a synthetic and a real data set shows the effect of the dissimilarities transformations for four clustering algorithms based on fuzzy memberships.
ISE-TR-07-06
Weighted Cluster Ensembles: Methods and Analysis
Carlotta Domeniconi and Muna Al-Razagan
Cluster ensembles offer a solution to challenges inherent to clustering arising from its ill-posed nature. Cluster ensembles can provide robust and stable solutions by leveraging the consensus across multiple clustering results, while averaging out emergent spurious structures that arise due to the various biases to which each participating algorithm is tuned. In this paper, we address the problem of combining multiple weighted clusters which belong to different subspaces of the input space. We leverage the diversity of the input clusterings in order to generate a consensus partition that is superior to the participating ones. Since we are dealing with weighted clusters, our consensus functions make use of the weight vectors associated with the clusters. We demostrate the effectiveness of our techniques by running experiments with several real datasets, including high dimensional text data. Furthermore, we investigate in depth the issue of diversity and accuracy for our ensemble methods. Our analysis and experimental results show that the proposed techniques are capable of producing a partition that is as good as or better than the best individual clustering.
ISE-TR-07-07
Query Consolidation: Interpreting a Set of Independent Queries Using a Multidatabase Architecture in the Reverse Direction
Aybar C. Acar and Amihai Motro
We introduce the problem of query consolidation, which seeks to interpret a set of disparate queries submitted to independent databases with a single "global" query. This problem has multiple applications, from improving database design to protecting information from a seemingly innocuous set of apparently unrelated queries. The problem exhibits attractive duality with the much-researched problem of query decomposition, which has been addressed intensively in the context of multidatabase environments: How to decompose a query submitted to a virtual database into a set of local queries that are evaluated in individual databases. We set the new problem in the architecture of a canonical multidatabase system, using it in the reverse direction. The process incorporates two steps where multiplicity of solutions must be considered: At one point the system must infer the most likely set of equi-joins for a set of relations; at another point it must discover the most likely selection constraints that would be applied to a relation. In each case we develop a procedure that ranks solutions according to their perceived likelihood. The final result is therefore a ranked list of suggested consolidations.
ISE-TR-06-01
Detecting outliers using transduction and statistical significance testing
Daniel Barbará and Carlotta Domeniconi and James P. Rogers
Finding points that are outliers with respect to a set of other points is an important task in data mining. Outlier detection can uncover important anomalies in fields like intrusion detection and fraud analysis. In data streaming, the presence of a large number of outliers indicates that the underlying process that is generating the data is undergoing significant changes and the models that attempt to characterize it need to be updated. Although there has been a significant amount of work in outlier detection, most of the algorithms in the literature resort to a particular definition of what an outlier is (e.g., density-based), and use thresholds to detect them. In this paper we present a novel technique to detect outliers that does not impose any particular definition for them. The test we propose aims to diagnose whether a given point is an outlier with respect to an existing clustering model (i.e., a set of points partitioned in groups). However, the test can also be successfully utilize to recognize outliers when the clustering information is not available. This test is based on Transductive Confidence Machines, which have been previously proposed as a mechanism to provide individual confidence measures on classification decisions. The test uses hypothesis testing to prove or disprove whether a point is fit to be in each of the clusters of the model. We demonstrate, experimentally, that the test is highly robust, and produces very few misdiagnosed points, even when no clustering information is available. We also show that the test can be succesfully applied to identify outliers present inside a data set for which no other information is available, thereby provinding the user with a clean data set to identify future outliers. Our experiments also show that even if the data set used to identify further outliers is contaminated with some outliers, the test can perform succesfully.
ISE-TR-06-02
Specifying Precise Use Cases
Jon Whittle
Despite attempts to formalize the semantics of use cases, they remain an informal notation. The informality of use cases is both a blessing and a curse. Whilst it admits an easy learning curve and enables communication between software stakeholders, it is also a barrier to the application of automated methods for test case generation, validation or simulation. This paper presents a precise way of specifying use cases based on a three-level modeling paradigm strongly influenced by UML. The formal syntax and semantics of use case charts are given, along with an example that illustrates how they can be used in practice.
ISE-TR-06-03
On the Testing Maturity of Software Producing Organizations: Detailed Data
Mats Grindal and Jeff Offutt and Jonas Mellin
This paper presents data from a study of the current state of practice of software testing. Test managers from twelve different software organizations were interviewed. The interviews focused on the amount of resources spent on testing, how the testing is conducted, and the knowledge of the personnel in the test organizations. The data indicate that the overall test maturity is low. Test managers are aware of this but have trouble improving. One problem is that the organizations are commercially successful, suggesting that products must already be “good enough.” Also, the current lack of structured testing in practice makes it difficult to quantify the current level of maturity and thereby articulate the potential gain from increasing testing maturity to upper management and developers.
ISE-TR-06-04
Locally Adaptive Metrics for Clutering High Dimensional Data
Carlotta Domeniconi and Dimitrios Gunopulos and Sheng Ma and Dimitris Papadopoulos and Bojun Yan
Clustering suffers from the curse of dimensionality, and similarity functions that use all input features with equal relevance may not be effective. We introduce an algorithm that discovers clusters in subspaces spanned by different combinations of dimensions via local weightings of features. This approach avoids the risk of loss of information encountered in global dimensionality reduction techniques, and does not assume any data distribution model. Our method associates to each cluster a weight vector, whose values capture the relevance of features within the corresponding cluster. We experimentally demonstrate the gain in perfomance our method achieves with respect to competitive methods, using both synthetic and real datasets. In particular, our results show the feasibility of the proposed technique to perform simultaneous clustering of genes and conditions in gene expression data, and clustering of very high dimensional data such as text data.
ISE-TR-06-05
Policy Transformations for Preventing Leakage of Sensitive Information in Email Systems
Saket Kaushik and William Winsborough and Duminda Wijesekera and Paul Ammann
In this paper we identify an undesirable side-effect of combining different email-control mechanisms for protection from unwanted messages, namely, leakage of recipients' private information to message senders. This is because some email-control mechanisms like bonds, graph-turing tests, etc., inherently leak information, and without discontinuing their use, leakage channels cannot be closed. We formalize the capabilities of an attacker and show how she can launch guessing attacks on recipient's mail acceptance policy that utilizes leaky mechanism in its defence against unwanted mail. As opposed to the classical Dolev-Yao attacker and its extensions, attacker in our model guesses the contents of a recipient's private information. The use of leaky mechanisms allow the sender to verify her guess. We assume a constraint logic programming based policy language for specification and evaluation of mail acceptance criteria and present two different program transformations that can prevent guessing attacks while allowing recipients to utilize any email-control mechanism in their policies.
ISE-TR-06-06
Categorization of Unlabeled Documents driven by Word Weighting
Ning Kang and Carlotta Domeniconi and Daniel Barbará
In text mining we often have to handle large document collections. The labeling of such large corpuses of documents is too expensive and impractical. Thus, there is a need to develop (unsupervised) clustering techniques for text data, where the distributions of words can vary significantly from one category to another. The vector space model of documents easily leads to a 30000 or more dimensions. In such high dimensionality, the effectiveness of any distance function that equally uses all input features is severely compromised. Furthermore, it is expected that different words may have different degrees of relevance for a given category of documents, and a single word may have a different importance across different categories. In this paper we first propose a global unsupervised feature selection approach for text, based on frequent itemset mining. As a result, each document is represented as a set of words that co-occur frequently in the given corpus of documents. We then introduce a locally adaptive clustering algorithm, designed to estimate (local) word relevance and, simultaneously, to group the documents. We present experimental results to demonstrate the feasibility of our approach. Furthermore, the analysis of the weights credited to terms provide evidence that the identified keywords can guide the process of label assignment to clusters. We take into consideration both spam email filtering and general classification datasets. Our analysis of the distribution of weights in the two cases provides insights on how the spam problem distinguishes from the general classification case.
ISE-TR-06-07
An Algebra for Composing Ontologies
Sasket Kaushik and Csilla Farkas and Duminda Wijesekera and Paul Ammann
Ontologies are used as a means of expressing agreements to a vocabulary shared by a community in a coherent and consistent community members in a decentralized manner. As it happens on the internet, ontologies are created by community members in a decentralized manner, requiring that they be merged before being used by the community. We develop an algabra to do so in the Resource Discription Framework (RDF). To provide formal semantics of the proposed algebraic property names, while ontology C has been composed by ooperators, we type a fragment of the RDF syntax.
ISE-TR-06-08
BPEL Orchestration of Secure WebMail
Saket Kaushik and Duminda Wijesekera and Paul Ammann
Web Services offer an excellent opportunity to redesign and replace old and insecure applications with more flexible and robust ones. WSEmail is one such application that replaces conventional message delivery systems with a family of Web Services that achieve the same goal. In this paper we analyze the existing WSEmail specification against the standard set of use cases (and misuse cases) supported (resp. prevented) by SMTP implementations --- the current default message delivery infrastructure --- and augment it with several missing pieces. In addition, we show how the WSEmail family of Web Services, specified in WSDL, can be orchestrated using BPEL. Finally, we provide a synchronization analysis of our WSEmail orchestration and show its correctness.
ISE-TR-06-09
Kernel Optimization using Pairwise Constraints for Semi-Supervised Clustering
Bojun Yan and Carlotta Domeniconi
A critical problem related to kernel-based methods is the selection of an optimal kernel for the problem at hand. The kernel function in use must conform with the learning target in order to obtain meaningful results. While solutions to estimate optimal kernel functions and their parameters have been proposed in a supervised setting, the problem presents open challenges when no labeled data are provided, and all we have available is a set of pairwise must-link and cannot-link constraints. In this paper we address the problem of optimizing the kernel function using pairwise constraints for semi-supervised clustering. To this end we derive a new optimization criterion to automatically estimate the optimal parameters of composite Gaussian kernels, directly from the data and the given constraints. We combine the optimal kernel function computed by our technique with a recently introduced semi-supervised kernel-based algorithm to demonstrate experimentally the effectivess of our approach. The results show that our method enables the practical utilization of powerful kernel-based semi-supervised clustering approaches by providing a mechanism to automatically set the involved critical parameters.
ISE-TR-06-10
Trust-based Secure Positive Train Control (PTC) Interoperation
Mark Hartong and Rajini Goel and Duminda Wijesekera
Positive Train Control (PTC) is a wireless control system ensuring railroad safety by enforcing train separation, speed enforcement, roadway worker protection and other safety functions. Due to shared track rights over each-other's tracks in North America, company A's trains must be safely operated by company B's crew on company C's tracks, requiring different PTC systems to securely interoperate with each other. For a security framework to ensure that, we propose using a trust management system with certificates and over the air re-keying (OTAR). Back of the envelope calculations show that our solution meets timing needs of PTC.
GMU-CS-TR-2005-1
Reachability Testing of Concurrent Programs
Jeff Lei and Richard Carver
One approach to testing concurrent programs, called reachability testing, generates synchronization sequences automatically, and on-the-fly, without constructing any static models. In this paper, we present a general execution model for concurrent programs that allows reachability testing to be applied to several commonly used synchronization constructs. We also present a new method for performing reachability testing. This new method guarantees that every partially-ordered synchronization sequence will be exercised exactly once without having to save any sequences that have already been exercised. We describe a prototype reachability testing tool called RichTest and report some empirical results, including a comparison between RichTest and a partial order reduction based tool called VeriSoft. RichTest performed significantly better for the programs in our study.
GMU-CS-TR-2005-2
Probabilistic Location Recognition using Reduced Feature Set
Fayin Li and Jana Kosecka
The localization capability is central to basic navigation tasks and motivates development of various visual navigation systems. These systems can be used both as navigational aids for visually impaired or in the context of autonomous mobile systems. In this paper we describe a two stage approach for localization in indoor environments. In the first stage, the environment is partitioned into several locations, each characterized by a set of scale-invariant keypoints and their associated descriptors. In the second stage the keypoints of the query view are integrated probabilistically yielding an estimate of most likely location. The emphasis of our approach in the environment model acquisition stage is on the selection of discriminative features, best suited for characterizing individual locations. The high recognition rate is maintained with only 10% of the originally detected features, yielding a substantial speedup in recognition. The ambiguities due to the self-similarity and dynamic changes in the environment are resolved by exploiting spatial relationships between locations captured by Hidden Markov Model. Once the most likely location is determined, the relative pose of the camera with respect to the reference view can be computed.
GMU-CS-TR-2005-3
Can Good Learners Always Compensate for Poor Learners?
Keith Sullivan and Liviu Panait and Gabriel Balan and Sean Luke
Can a good learner compensate for a poor learner when paired in a coordination game? Previous work has given an example where a special learning algorithm (FMQ) is capable of doing just that when paired with a specific capable algorithm even in games which stump the poorer algorithm when paired with itself. In this paper, we argue that this result is not general. We give a straightforward extension to the coordination game in which FMQ cannot compensate for the lesser algorithm. We also provide other problematic pairings, and argue that another highquality algorithm cannot do so either.
GMU-CS-TR-2005-4
Lenience Towards Teammates Helps in Cooperative Multiagent Learning
Liviu Panait and Keith Sullivan and Sean Luke
Concurrent learning is a form of cooperative multiagent learning in which each agent has an independent learning process and little or no control over its teammates' actions. In such learning algorithms, an agent's perception of the joint search space depends on the reward received by both agents, which in turn depends on the actions currently chosen by the other agents. The agents will tend to converge towards certain areas of the space because of their learning processes. As a result, an agent's perception of the search space may benefit if computed over multiple rewards at early stages of learning, but additional rewards have little impact towards the end. We thus suggest that agents should be lenient with their teammates: ignore many of the low rewards initially, and fewer rewards as learning progresses. We demonstrate the benefit of lenience in a cooperative co-evolution algorithm and in a new reinforcement learning algorithm.
GMU-CS-TR-2005-5
Selecting Informative Actions Improves Cooperative Multiagent Learning
Liviu Panait and Sean Luke
In concurrent cooperative multiagent learning, each agent in a team simultaneously learns to improve the overall performance of the team, with no direct control over the actions chosen by its teammates. An agent's action selection directly influences the rewards received by all the agents; this results in a co-adaptation among the concurrent learning processes. Co-adaptation can drive the team towards suboptimal solutions because agents tend to select those actions that are rewarded better, without any consideration for how such actions may affect the search of their teammates. We argue that to counter this tendency, agents should also prefer actions that inform their teammates about the structure of the joint search space in order to help them choose from among various action options. We analyze this approach in a cooperative coevolutionary framework, and we propose a new algorithm, oCCEA, that highlights the advantages of selecting informative actions. We show that oCCEA generally outperforms other cooperative coevolution algorithms on our test problems.
GMU-CS-TR-2005-6
Nonparameteric Estimation of Multiple Structures with Outliers
Wei Zhang and Jana Kosecka
In many computer vision problems, several instances of a particular model need to be recovered from the noisy data. In such cases one is faced with the problem of simultaneous estimation of the number of models and their parameters. This problem becomes difficult as the measurement noise in the data increases and the data are further corrupted by outliers. This is especially the case in a variety of motion estimation problems, where the displacement between the views is large and the process of establishing correspondences is difficult. In this paper we propose a novel nonparametric sampling based method for solving this problem. The main novelty of the proposed method lies in the analysis of the distribution of residuals of individual data points with respect to the set of hypotheses, generated by a RANSAC-like sampling process. We will show that the modes of the residual distributions directly reveal the presence of multiple models and facilitate the recovery of the individual models, without making any assumptions about the distribution of the outliers or the noise process. The proposed approach is capable of handling data with large fraction of outliers. Experiments with both synthetic and real data are presented to demonstrate the effectiveness of the proposed approach.
GMU-CS-TR-2005-7
A New Inliner Identification Scheme for Robust Estimation Problems
Wei Zhang and Jana Kosecka
The data in vision problems, often heavily contaminated by outliers, call for efficient robust techniques to identify inliers for correct estimation. RANSAC algorithm is a frequently used robust estimator for computer vision problems. In traditional RANSAC scheme, when data contain significant fraction of outliers, large number of samples is needed in order to obtain at least one outlier free sample. In addition to that, each hypothesis generated from the samples is typically evaluated using all data points, which further lowers the efficiency. In this paper, we propose a novel hypothesis evaluation scheme, which enables efficient classification of the data points as outliers and inliers, while requiring a small fixed number of samples. The method is based on the observation that for each data point the properties of the distribution of the residuals with respect to the generated hypotheses reveal whether the point is an outlier or inlier. The problem of inlier/outlier identification can then be formulated as a classification problem. We demonstrate the proposed method on motion estimation problems with large fraction of outliers on both synthetic and real data.
GMU-CS-TR-2005-8
Evolving Families of Designs Using L-Systems
Elena Popovici
Evolutionary computation has proven its utility in automating the process of engineering design. However, little attention has been paid to the scalability of generated designs, which is an important issue. This paper addresses this issue and proves the viability of evolving families of designs using parameterized L-Systems as a representation. The rest of the paper is organized as follows: first, an introduction as to why scalability is important and difficult; second, a review of existing work on evolving L-Systems; the third section contains a description of the application domain used for this feasibility study, details on the L-Systems and the EA used; section four presents experiments conducted and their results; the paper ends with a discussion, drawing conclusions and setting goals for future work.
ISE-TR-05-01
Intensional Encapsulations of Database Subsets via Genetic Programming
Aybar C. Acar and Amihai Motro
Finding intensional encapsulations of database subsets is the inverse of query evaluation. Whereas query evaluation transforms an intensional expression (the query) to its extension (a set of data values), intensional encapsulation assigns an intensional expression to a given set of data values. We describe a method for deriving intensional representations of subsets of records in large database tables. Our method is based on the paradigm of genetic programming. It is shown to achieve high accuracy and maintain compact expression size, while requiring cost that is acceptable to all applications, but those that require instantaneous results. Intensional encapsulation has a broad range of applications including cooperative answering, information integration, security and data mining.
ISE-TR-05-02
Blind Custodians: A Database Service Architecture that Supports Privacy without Encryption
Amihai Motro and Francesco Parisi-Presicce
We describe an architecture for a database service that does not assume that the service provider can be trusted. Unlike other architectures that address this problem, this architecture, which we call blind custodians, does not rely on encryption. Instead, it offers confidentiality by means of information dissociation: The server only stores “fragments” of information that are considered safe (i.e., each fragment does not violate privacy), while the client stores the associations between the fragments that are necessary to reconstruct the information. We argue that this architecture allows satisfactory confidentiality, while offering two important advantages: (1) It does not restrict the types of queries that can be submitted by clients (as encryption-based methods invariably do), and (2) it requires only light processing at the client, assigning the bulk of the processing to the server (as befits a true service). Moreover, the architecture permits flexible control over the level of confidentiality that should be maintained (at the cost of additional overhead).
ISE-TR-05-03
Optimal Manufacturing Trees
Amihai Motro and Alessandro D'Atri
We consider the process of manufacturing a product from a given number of elementary components. By assembling intermediate products, the target product can be manufactured in a variety of processes, each modeled by a tree. We are interested in manufacturing turnaround: the time between receiving an order at the root and its completion. We express the turnaround time of each manufacturing process (tree) with a formula that incorporates three parameters: the time required to create elementary components, the time required to assemble a product from its components and the time required to deliver the product to its procurer (another manufacturer). We show that this turnaround formula is optimized in a manufacturing process that corresponds to a perfect (or nearly perfect) tree. The degree of the optimal tree (i.e., the ideal number of components in each sub-assembly) is shown to be independent of the number of elementary components, suggesting that in each manufacturing environment there is an ideal assembly size, which is optimal for the manufacturing of products of any scale.
ISE-TR-05-04
A Comparative Evaluation of Tests Generated from Different UML Diagrams: Diagrams and Data
Aynur Abdurazik and Jeff Offutt and Andrea Baldini
This paper presents a single project experiment on the fault revealing capabilities of model-based test sets. The tests are generated from UML statecharts and UML sequence diagrams. This experiment found that the statechart test sets did better at revealing unit level faults than the sequence diagram test sets, and the sequence diagram test sets did better at revealing integration level faults than the statechart test sets. The statecharts also resulted in more test cases than the sequence diagrams. The experiment showed that model-based testing can be used to systematically generate test data and indicates that different UML models can play different roles in testing.
ISE-TR-05-05
VirtuE: a Formal Model of Virtual Enterprises for Information Markets
Alessandro D'Atri and Amihai Motro
A vital part of a modern economy is an information market. In this market, information products are being traded in countless ways. Information is bought, modified, integrated, incorporated into other products, and then sold again. Often, the manufacturing of an information product requires the collaboration of several participants. A virtual enterprise is a community of business entities that collaborate on the manufacturing of complex products. This collaboration is often ad hoc, for a specific product only, after which the virtual enterprise may dismantle. The virtual enterprise paradigm is particularly appealing for modeling collaborations for manufacturing information products, and in this paper we present a new model, called VirtuE, for modeling such activities. VirtuE has three principal components. First, it defines a distributed infrastructure with concepts such as members, products, inventories, and production plans. Second, it defines transactions among members, to enable collaborative production of complex products. Finally, it provides means for the instrumentation of enterprises, to measure their performance and to govern their behavior.
ISE-TR-05-06
CoJava: A Unified Language for Simulation and Optimization
Alexander Brodsky and Hadon Nash
We have proposed and implemented the language CoJava, which offers both the advantages of simulation-like process modeling in Java, and the capabilities of true decision optimization. By design, the suntax of CoJava is identical to the programming language Java, extended with special constructs to (1) make a non-deterministic choice of a numeric value, (2) assert a constraint, and (3) designate a program variable as the objective to be optimized. The semantics of CoJava interprets a program as an optimal nondeterministic execution path, namely, a path that (1) satisfies the range conditions in the choice statements, (2) satisfies the assert-constraint statements, and (3) produces the optimal value in a designated program variable, among all execution paths that satisfy (1) and (2). To run a CoJava program amounts to first finding an optimal execution path, and then procedurally executing it. We have developed a CoJava constraint compiler based on a reduction of the problem of finding an optimal execution trace to a standard symbolic formulation, reinterpreting Java code as a symbolic constraint construction, and solving the resulting optimization problem on an external solver. To demonstrate the power of CoJava, we have implemented a realistic problem in the area of robot arm control in CoJava. The robot arm is constructed using self-contained components implemented as CoJava classes, that model robot's arm movements based on Newton's laws.
GMU-CS-TR-2004-1
Tunably Decentralized Algorithms for Cooperative Target Observation
Sean Luke and Keith Sullivan and Gabriel Catalin Balan and Liviu Panait
Multi-agent problem domains may require distributed algorithms for a variety of reasons: local sensors, limitations of communication, and availability of distributed computational resources. In the absence of these constraints, centralized algorithms are often more efficient, simply because they are able to take advantage of more information. We introduce a variant of the cooperative target observation domain which is free of such constraints. We propose two algorithms, inspired by K-means clustering and hill-climbing respectively, which are scalable in degree of decentralization. Neither algorithm consistenly outperforms the other across over all problem domain settings. Surprisingly, we find that hill-climbing is sensitive to degree of decentralization, while K-means is not. We also experiment with a combination of the two algorithms which draws strength from each.
GMU-CS-TR-2004-2
Location Recognition, Global Localization and Relative Positioning Based on Scale-Invariant Keypoints
Jana Kosecka and Xialong Yang
The localization capability of a mobile robot is central to basic navigation and map building tasks. We describe a probabilistic environment model which facilitates global localization scheme by means of location recognition. In the exploration stage the environment is partitioned into a class of locations, each characterized by a set of scale-invariant keypoints. The descriptors associated with these keypoints can be robustly matched despite changes in contrast, scale and viewpoint. We demonstrate the efficacy of these features for location recognition, where given a new view the most likely location from which this view came is determined. The misclassifications due to dynamic changes in the environment or inherent appearance ambiguities are overcome by exploiting neighborhood relationships captured by a Hidden Markov Model. We report the recognition performance of this approach on an indoor environment consisting of eighteen locations and discuss the suitability of this approach for a more general class of recognition problems. Once the most likely location has been determined we show how to compute the relative pose between the representative view and the current view.
GMU-CS-TR-2004-3
Experiments in Building Recognition
Wei Zhang and Jana Kosecka
In this report we study the problem of building recognition. Given a database of building images, we are interested in classifying a novel view of a building by finding the closest match from the database. We examine the efficacy of local scale-invariant features and their associated descriptors as representations of building appearance and use a simple voting scheme in the recognition phase.
GMU-CS-TR-2004-4
A New Algorithm for Reachability Testing of Concurrent Programs
Jeff Lei and Richard Carver
This document has been superceded by GMU CS Department Technical Report GMU-CS-TR-2005-1
Concurrent programs exhibit non-deterministic behavior, which makes them difficult to test. One approach to testing concurrent programs, called reachability testing, generates test sequences automatically, and on-the-fly, without constructing any static models. This approach guarantees that every partially-ordered synchronization sequence of a program with a given input will be exercised exactly once. Unfortunately, in order to make this guarantee all existing reachability testing algorithms need to save and search through the history of test sequences that have already been exercised, which is impractical for many applications. In this paper, we present a new reachability testing algorithm that does not save the history of test sequences. This new algorithm guarantees that every partially-ordered synchronization sequence is exercised at least once for an arbitrary program and exactly once for a program that satisfies certain conditions. We also describe a reachability testing tool called RichTest. Our empirical studies with RichTest indicate that our new algorithm exercises every synchronization sequence exactly once for many applications.
ISE-TR-04-01
Modeling and Testing of Dynamic Aspects of Web Applications
Ye Wu, Jeff Offutt and Xiaochen Du
Web software applications have become complex, sophisticated programs that are based on novel computing technologies. Although powerful, these technologies bring new challenges to developers and testers. Checking static HTML links is no longer suffcient; Web applications must be evaluated as complex software products. This paper focuses on two unique aspects of Web applications, an extremely loose form of coupling that features dynamic integration and the ability that users have to directly change the potential flow of execution. Taken together, these allow the control flow to vary with each execution, and means the possible control flows cannot be determined statically. Thus we cannot perform standard analysis techniques that are fundamental to many software engineering activities. This paper presents a new theoretical model of new couplings and the dynamic flow of control of Web applications. This model is based on atomic sections, which allow tools to build the equivalent of a control flow graph for Web applications. The model is used to propose new test criteria for Web applications, and results are shown from a case study on a moderate-size application.
ISE-TR-04-02
A Framework for Dynamic Semantic Web Services Management
Randy Howard and Larry Kerschberg
The concept of Web services as a means of dynamically discovering, negotiating, composing, executing and managing services to materialize enterprise-scale workflow is an active research topic. However its realization has thus far been elusive. Existing approaches involve many disparate concepts, frameworks and technologies. What is needed is a comprehensive and overarching framework that handles the processing and workflow requirements of Virtual Enterprises, maps them to a collection of service-oriented tasks, dynamically configures these tasks from available services, and manages the choreography and execution of these services. The goal is to add semantics to Web services to endow them with capabilities currently lacking in the literature, but necessary for their successful deployment in future systems. This paper introduces such a framework, called the Knowledge-based Dynamic Semantic Web Services (KDSWS) Framework, that addresses in an integrated end-to-end manner, the life-cycle of activities involved in preparing, publishing, requesting, discovering, selecting, configuring, deploying, and delivering Semantic Web Services. In particular, the following issues are addressed: 1) semantic specification of services capabilities including quality of service, trust, and security; 2) transaction control and workflow management; and 3) resource management, interoperation and evolution of the Virtual Enterprise. Two models and their associated languages are introduced to specify the features for these enhanced services: 1) the KDSWS Meta-Model, and 2) the KDSWS Process Language. Both are based on the Knowledge/Data Model. This integrated and comprehensive approach provides a unified end-to-end approach to enable dynamically-composable, semantically-rich, service-oriented systems of Web services.
ISE-TR-04-03
A Controlled Experimental Evaluation of Test Cases Generated from UML Diagrams
Aynur Abdurazik and Jeff Offutt and Andrea Baldini
This paper presents a single project experiment on the fault revealing capabilities of test sets that are generated from UML statecharts and sequence diagrams. The results of this experiment show that the statechart test sets do better at revealing unit level faults than the sequence diagram test sets, and the sequence diagram test sets do better at revealing integration level faults than the statechart test sets. The experimental data also show that the statecharts result in more test cases than the sequence diagrams. The experiment showed that UML diagrams can be used in generating test data systematically, and different UML diagrams can play different roles in testing.
ISE-TR-04-04
Integration Testing of Object-oriented Components Using FSMS: Theory and Experimental Details
Leonard Gallagher and Jeff Offutt
In object-oriented terms, one of the goals of integration testing is to ensure that messages from objects in one class or component are sent and received in the proper order and have the intended effect on the state of external objects that receive the messages. This research extends an existing single-class testing technique to integration testing. The previous method models the behavior of a single class as a finite state machine, transforms that representation into a data flow graph that explicitly identifies the definitions and uses of each state variable of the class, and then applies conventional data flow testing to produce test case specifications that can be used to test the class. This paper extends those ideas to inter-class testing by developing flow graphs and tests for an arbitrary number of classes and components. It introduces flexible representations for message sending and receiving among objects and allows concurrency among any or all classes and components. A second major result is the introduction of a novel approach to performing data flow analysis. Data flow graphs are stored in a relational database, and database queries are used to gather def-use information. This approach is conceptually simple, mathematically precise, quite powerful, and general enough to be used for traditional data flow analysis. This testing approach relies on finite state machines, database modeling and processing techniques, and algorithms for analysis and traversal of directed graphs. A proof-of-concept implementation is used to illustrate how the approach works on an extended example.
ISE-TR-04-05
Combination Testing Strategies: A Survey
Mats Grindal and Jeff Offutt and Sten F. Andler
Combination strategies are test-case selection methods where test cases are identified by combining values of the different test object input parameters based on some combinatorial strategy. This survey presents 15 different combination strategies, and covers more than 30 papers that focus on one or several combination strategies. We believe this collection represents most of the existing work performed on combination strategies. This survey describes the basic algorithms used by the combination strategies. Some properties of combination strategies, including coverage criteria and theoretical bounds on the size of test suites, are also included in this description. This survey also includes a subsumption hierarchy that attempts to relate the various coverage criteria associated with the identified combination strategies. Finally, this survey contains short summaries of all the papers that cover combination strategies.
GMU-CS-TR-2003-0
How to Make a Technical Report
Sean Luke and Jana Kosecka
This document has been superceded by GMU CS Department Technical Report GMU-CS-TR-2011-0
The following document describes the preferred formatting of technical reports and submission guidelines. This document itself is formatted according to the preferred formatting rules.
GMU-CS-TR-2003-1
Cooperative Multi-Agent Learning: The State of the Art
Liviu Panait and Sean Luke
Cooperative multi-agent systems problems are ones in which several agents attempt, through their interaction, to jointly solve tasks or to maximize their utility. Due to the interactions among the agents, multi-agent problem complexity can rise rapidly with the number of agents or their behavioral sophistication. The challenge this presents to the task of programming solutions to such problems has spawned increasing interest in machine learning (ML) techniques to automate the search and optimization process. We provide a broad survey of the cooperative multiagent learning literature. Previous surveys of this area have largely focused on issues common to specific subareas (for example, reinforcement learning or robotics). In this survey we attempt to draw from multi-agent learning work in a spectrum of areas, including reinforcement learning, evolutionary computation, game theory, complex systems, agent modeling, and robotics. We find that this broad view leads to a division of the work into two categories, each with its own special issues: applying a single learner to discover joint solutions to multi-agent problems (team learning), or using multiple simultaneous learners, often one per agent (concurrent learning). Additionally, we discuss two important topics independent of these categories: problem decomposition and communication. We conclude with a presentation of multi-agent learning problem domains, a discussion of certain challenge topics (scalability and adaptive dynamics), and a list of multi-agent learning resources.
ISE-TR-03-01
Establishing Pair-wise Keys For Secure Communication in Ad Hoc Networks: A Probabilistic Approach
Sencun Zhu and Shouhuai Xu and Sanjeev Setia and Sushil Jajodia
We present a scalable distributed protocol that enables two mobile nodes in an ad hoc network to establish a pairwise shared key on the fly, without requiring the use of a on-line key distribution center. The design of our protocol is based on a novel combination of two techniques --- probabilistic key sharing and threshold secret sharing. Our protocol is scalable since every node only needs to possess a small number of keys, independent of the network size, and it is computationally efficient because it only relies on symmetric key operations. We show that a pairwise key established between two nodes using our protocol is secure against a collusive attack by a certain number of compromised nodes, and that our protocol can be parameterized to meet the appropriate levels of performance, security and storage for the application under consideration.
ISE-TR-03-02
Adding Reliable and Self-Healing Key Distribution to the Subset Difference Group Rekeying Method for Secure Multicast
Sencun Zhu and Sanjeev Setia and Sushil Jajodia
The Subset Difference Rekeying (SDR) method is the most efficient stateless group rekeying method proposed in the literature. We study two important issues related to the SDR method. First, we address the issue of reliable rekey transport for SDR. We present a key distribution scheme, called FEC-BKR, that enables members to receive the current group key in a reliable and timely fashion despite packet losses in the network. Through simulation, we show that in most scenarios, FEC-BKR outperforms previously proposed schemes for reliable rekey transport. Second, we address the issue of self-healing key distribution for SDR. We present a group key recovery scheme that adds the self-healing property to SDR, i.e., our scheme enables a member that has missed up to a certain number m of previous rekey operations to recover the missing group keys without asking the key server for retransmission. The additional communication overhead imposed by our key recovery scheme is quite small (less than 3m additional keys).
ISE-TR-03-03
Precisely Answering Multi-dimensional Range Queries Without Privacy Breaches
Lingyu Wang and Yingjiu Li and Duminda Wijesekera and Sushil Jajodia
This paper investigates the privacy breaches caused by multi-dimensional range (MDR) sum queries in OLAP systems. We show that existing inference control methods are generally ineffective or infeasible for MDR queries. We then consider restricting users to even MDR queries (that is, the MDR queries involving even number of data values). We show that the collection of such even MDR queries is safe if and only if a special set of sum-two queries (that is, queries involving exactly two values) is safe. On the basis of this result, we give an efficient method to decide the safety of even MDR queries. Besides safe even MDR queries we show that any odd MDR query is unsafe. Moreover, any such odd MDR query is different from the union of some even MDR queries by only one tuple. We also extend those results to the safe subsets of unsafe even MDR queries.
ISE-TR-03-04
FlexFlow: A Flexible Flow Control Policy Specification Framework
Shiping Chen and Duminda Wijesekera and Sushil Jajodia
Flow control policies are important in data-flow, work-flow, transaction systems and software design. Previous work in this area concentrates either on modelling security aspects of information flow control or applying flow control policies in some specific application domain. These models permit either permissions or prohibitions for flows and normally are based on a specific meta-policy (usually the closed policy). We propose FlexFlow, a logic based flexible flow control framework to specify data-flow, work-flow and transaction systems policies that go beyond point-to-point flows. Both permissions and prohibitions are specifiable in FlexFlow and meta-policies such as permissions take precedence themselves can be specified over the meta-policy neutral policy specification environment of FlexFlow. We further show how to specify and prevent inter-flow conflicts such as those arising in role-based work-flow policies.
ISE-TR-03-05
OLAP Means On-Line Anti-Privacy
Lingyu Wang and Duminda Wijesekera and Sushil Jajodia
In this paper we investigate the privacy breaches caused by multi-dimensional range sum queries in OLAP (Online Analytic Processing) systems. Our results show that the sensitive information stored in the underlying data warehouses can be easily compromised by malicious users with legitimate range queries. This compromise is possible even when users are restricted to some special class of range queries. We present algorithms that compromise individual tuples with only range queries. We study the conditions under which the compromises are possible. We also analyze the number of queries required for each compromise, as well as the complexity and completeness of those algorithms. Our study reveals the seriousness of the privacy issue in OLAP systems and provide better understanding of the problem for further research on the control methods.
ISE-TR-03-06
Fusionplex: Resolution of Data Inconsistencies in the Integration of Heterogeneous Information Sources
Philipp Anokhin and Amihai Motro
Fusionplex is a system for integrating multiple heterogeneous and autonomous information sources, that uses data fusion to resolve factual inconsistencies among the individual sources. To accomplish this, the system relies on source features, which are meta-data on the quality of each information source; for example, the recentness of the data, its accuracy, its availability, or its cost. The fusion process is controlled with several parameters: (1) With a vector of feature weights, each user defines an individual notion of data utility; (2) with thresholds of acceptance, users ensure minimal performance of their data, excluding from the fusion process data that are too old, too costly, or lacking in authority, or data that are too high, too low, or obvious outliers; and, ultimately, (3) in naming a particular fusion function to be used for each attribute (for example, average, maximum, or simply any) users implement their own interpretation of fusion. Several simple extensions to SQL are all that is needed to allow users to state these resolution parameters, thus ensuring that the system is easy to use. Altogether, Fusionplex provides its users with powerful and flexible, yet simple, control over the fusion process. In addition, Fusionplex supports other critical integration requirements, such as information source heterogeneity, dynamic evolution of the information environment, quick ad-hoc integration, and intermittent source availability. The methods described in this paper were implemented in a prototype system that provides complete Web-based integration services for remote clients.
ISE-TR-03-07
Utility-based Resolution of Data Inconsistencies
Amihai Motro, Philipp Anokhin and Aybar C. Acar
A virtual database system is software that provides unified access to multiple information sources. If the sources are overlapping in their contents and independently maintained, then the likelihood of inconsistent answers is high. Solutions are often based on ranking (which sorts the different answers according to recurrence) and on fusion (which synthesizes a new value from the different alternatives according to a specific formula). In this paper we argue that both methods are flawed, and we offer alternative solutions that are based on knowledge about the performance of the source data; including features such as recentness, availability, accuracy and cost. These features are combined in a flexible utility function that expresses the overall value of a data item to the user. Utility allows us to (1) define meaningful ranking on the inconsistent set of answers, and offer the top-ranked answer as a preferred answer; (2) determine whether a fusion value is indeed better than the initial values, by calculating its utility and comparing it to the utilities of the initial values; and (3) discover the best fusion: the fusion formula that optimizes the utility. The advantages of such performance-based and utility-driven ranking and fusion are considerable.
ISE-TR-03-08
Providing Voice Privacy as a Service over the Public Telephone Network
Mohamad Sharif and Duminda Wijesekera
Existing design of the public telephone network is susceptible to eavesdropping on its voice stream. Consequently, any wire taper can listen to supposedly private conversations. In order to prevent such eavesdropping, we propose an architecture and its implementation for end-to-end voice encryption as a service available for interested subscribers. Our proposal consists of appropriately placing servers to authenticate telephone equipment and subscribers of the service, and certificate authorities that can cross-certify over telecommunication service providers. We show how these entities and necessary signaling mechanisms between them can be implemented using the transaction capabilities application layer (TCAP) of the signal system seven (SS7) protocol and the D channel of the digital subscriber line (DSL) connecting telephone equipment to the SS7 grid. Using published specifications, we show that the duration to initiate an encrypted telephone call takes about 19 seconds including user authentication under average load conditions. That makes our design and protocols comparable to existing intelligent services provided over public telephones and only four times larger that initiating a normal telephone call.
ISE-TR-03-09
Why Is this User Asking so Many Questions? Explaining Sequences of Queries
Aybar C. Acar and Amihai Motro
A sequence of queries submitted by a database user within a short period of time may have a single, illuminating explanation. In this paper we consider sequences of single-record queries, and attempt to guess what information their authors may be trying to accumulate. Query sequences may reflect clandestine intentions, where users attempt to avoid direct queries which may disclose their true interests, preferring instead to obtain the same information by means of sequences of smaller, less conspicuous, queries. Sequences of queries may also reflect attempts to circumvent retrieval restrictions, where users attempt to approximate information which is inaccessible, with sequences of legitimate requests (in the latter case, our explanations may lead database owners to either tighten access, or, conversely, to reorganize their interfaces to facilitate access). Because the true objective of a sequence may be clouded by the retrieval of spurious records, our approach considers all the possible aggregates that a user may accumulate with a sequence, and to rank them, search-engine style, according to their plausibility as retrieval objectives. Our method is probabilistic in nature and postulates that the likelihood that a set of records is the true objective of the user is inverse proportional to the likelihood that this set results from random selection. Our method is shown to have good performance even in the presence of noise (spurious records) as high as 40-50%.
ISE-TR-03-10
UML Specification of Access Control Policies and their Formal Verification
Manuel Koch and Francesco Parisi-Presicce
Security requirements have become an integral part of most modern software systems. In order to produce secure systems, it is necessary to provide software engineers with the appropriate systematic support. We propose a methodology to integrate the specification of access control policies into UML and provide a graph-based formal semantics for the UML access control specification which permits to reason about the coherence of the access control specification. The main concepts in the UML access control specification are illustrated with an example access control model for distributed object systems.
ISE-TR-03-11
Specifying Coherent Refactoring of Software Artefacts with Distributed Graph Transformations
Paolo Bottoni and Francesco Parisi-Presicce and Gabriele Taentzer
Refactoring is an important source of software transformation, which changes the internal structure of a software system, while preserving its behavior. Even though the input/output view of a system's behavior does not change, refactoring can have several consequences for the computing process, as expressed for instance by the sequence of method calls or by state changes of an object or an activity. Such modifications must be reflected in the system model, generally expressed through UML diagrams. We propose a formal approach, based on distributed graph transformation, to the coordinated evolution of code and model, as effect of refactorings. The approach can be integrated into existing refactoring tools. Due to its formal base, it makes it possible to reason about the behavior preservation of each specified refactoring.
ISE-TR-02-01
Provisions and Obligations in Policy Management and Security Applications
Claudio Bettini and Sushil Jajodia and X. Sean Wang and Duminda Wijesekera
Policies are widely used in many different systems and applications. Recently, it has been recognized that a “yes/no” response to every scenario is just not enough for many modern systems and applications. Many policies require certain conditions to be satisfied and actions to be performed before or after a decision is made in accordance with the policy. To address this need, this paper introduces the notions of provisions and obligations. Provisions are those conditions that need to be satisfied or actions that must be performed before a decision is rendered, while obligations are those conditions or actions that must be fulfilled by either the users or the system after the decision. This paper formalizes a rule-based policy framework that includes provisions and obligations, and investigates a reasoning mechanism within this framework. A policy decision may be supported by more than one derivation, each associated with a potentially different set of provisions and obligations (called a global PO set). The reasoning mechanism can derive all the global PO sets for each specific policy decision, and facilitates the selection of the best one based on numerical weights assigned to provisions and obligations as well as on semantic relationships among them. The paper also shows the use of the proposed policy framework in a security application and discusses through an example various aspects of how the system may compensate unfulfilled obligations.
ISE-TR-02-02
A Scalable and Reliable Key Distribution Protocol for Multicast Group Rekeying
Sanjeev Setia and Sencun Zhu and Sushil Jajodia
Scalable group rekeying is one of the important problems that needs to be addressed in order to support secure communications for large and dynamic groups. One of the challenging issues that arises in scalable group rekeying is the problem of delivering the updated keys to the members of the group in a reliable and timely manner. In this paper, we present a new scalable and reliable key distribution protocol for group key management schemes that use logical key hierarchies for scalable group rekeying. Our protocol, called WKA-BKR, is based upon two key ideas, weighted key assignment and batched key retransmission, both of which exploit the special properties of logical key hierarchies to reduce the overhead and increase the reliability of the key delivery protocol. We have evaluated the performance of our approach using detailed simulations. Our results show that for most network loss scenarios, the bandwidth used by our protocol is lower than that of previously proposed key delivery protocols.
ISE-TR-02-03
TupleRank: Ranking Discovered Content in Virtual Databases
Jacob Berlin and Amihai Motro
Recently, the problem of data integration has been newly addressed by methods based on machine learning and discovery. Such methods are intended to automate, at least in part, the laborious process of information integration, by which existing data sources are incorporated in a virtual database. Essentially, these methods scan new data sources, attempting to discover possible mappings to the virtual database. Like all discovery processes, this process is intrinsically probabilistic; that is, each discovery is associated with a specific value that denotes assurance of its appropriateness. Consequently, the rows in a discovered virtual table have mixed assurance levels, with some rows being more credible than others. In this paper we argue that rows in discovered virtual databases should be ranked, and we describe a ranking method, called TupleRank, for calculating such a ranking order. Roughly speaking, TupleRank calibrates the probabilities calculated during a discovery process with historical information about the performance of the system. The work is done in the framework of the Autoplex system for discovering content for virtual databases, and initial experimentation is reported and discussed.
ISE-TR-02-04
Cardinality-based Inference Control in Sum-only Data Cubes (Extended Version)
Lingyu Wang and Duminda Wijesekera and Sushil Jajodia
This paper deals with the inference problems in data warehouses and decision support systems such as on-line analytical processing (OLAP)systems. Even though OLAP systems restrict user accesses to predefined aggregations, the possibility of inappropriate disclosure of sensitive attribute values still exists. Based on a definition of non-compromiseability to mean that any member of a set of variables satisfying a given set of their aggregates can have more than one value, we derive sufficient conditions for non-compromiseability in sum-only data cubes. Specifically, (1) the non-compromiseability of multi-dimensional aggregates can be reduced to that of one dimensional aggregates, (2) full or dense core cuboids are non-compromiseable, and (3) there is a tight lower bound for the cardinality of a core cuboid to remain non-compromiseable. Based on those conditions, and a three-tiered model for controlling inferences, we provide a divide-and-conquer algorithm that uniformly divides data sets into chunks and builds a data cube on each such chunk. The union of those data cubes are then used to provide users with inference-free OLAP queries.
ISE-TR-02-05
A Comparative Performance Analysis of Reliable Group Rekey Transport Protocols for Secure Multicast
Sanjeev Setia Sencun Zhu and Sushil Jajodia
In this paper, we present a new scalable and reliable key distribution protocol for group key management schemes that use logical key hierarchies for scalable group rekeying. Our protocol, called WKA-BKR, is based upon two ideas --- weighted key assignment and batched key retransmission --- both of which exploit the special properties of logical key hierarchies and the group rekey transport payload to reduce the bandwidth overhead of the reliable key delivery protocol. Using both analytic modeling and simulation, we compare the performance of WKA-BKR with that of other rekey transport protocols, including a recently proposed protocol based on proactive FEC. Our results show that for most network loss scenarios, the bandwidth used by WKA-BKR is lower than that of the other protocols.
ISE-TR-02-06
Performance Optimizations for Group Key Management Schemes for Secure Multicast
Sencun Zhu Sanjeev Setia and Sushil Jajodia
Scalable group rekeying is one of the biggest challenges that need to be addressed to support secure communications for large and dynamic groups. In recent years, many group key management approaches based on the use of logical key trees have been proposed to address this issue. Using logical key trees reduces the complexity of group rekeying operation from O(N) to O(log N), where N is the group size. In this paper, we present two optimizations for logical key tree organizations that utilize information about the characteristics of group members to further reduce the overhead of group rekeying. First, we propose a partitioned key tree organization that exploits the temporal patterns of group member joins and departures to reduce the overhead of rekeying. Using an analytic model, we show that our optimization can achieve up to 31.4% reduction in key server bandwidth overhead over the un-optimized scheme. Second, we propose an approach under which the key tree is organized based on the loss probabilities of group members. Our analysis shows this optimization can reduce the rekeying overhead by up to 12.1%.
ISE-TR-02-07
Maintaining Evolving Component-Based Software with UML
Ye Wu and Jeff Offutt
Component-based software engineering has been increasingly adopted for software development. This approach relies on using reusable components as the building blocks for constructing software. On the one hand, this helps improve software quality and productivity; on the other hand, it necessitates frequent maintenance activities, such as upgrading third party components or adding new features. The cost of maintenance for conventional software can account for as much as two-thirds of the total cost, and it is likely to be more for component-based software. Thus, an effective maintenance technique for component-based software is strongly desired. This paper presents a UML-based technique that attempts to help resolve difficulties introduced by the implementation transparent characteristics of component-based software systems. This technique can also be useful for other maintenance activities. For corrective maintenance activities, the technique starts with UML diagrams that represent changes to a component, and uses them to support regression testing. To accommodate this approach for perfective maintenance activities, more challenges are encountered. We provide a UML-based framework to evaluate the similarities of the old and new components, and corresponding retesting strategies are provided.
ISE-TR-02-08
Modeling and Testing Web-based Applications
Ye Wu and Jeff Offutt
The Internet is quietly becoming the body of the business world, with web applications as the brains. This means that software faults in web applications have potentially disastrous consequences. Most work on web applications has been on making them more powerful, but relatively little has been done to ensure their quality. Important quality attributes for web applications include reliability, availability, interoperability and security. Web applications share some characteristics of client-server, distributed, and traditional programs, however there are a number of novel aspects of web applications. These include the fact that web applications are “dynamic”, due to factors such as the frequent changes of the application requirement as well as dramatic changes of the web technologies, the fact that the roles of the clients and servers change dynamically, the heterogeneity of the hardware and software components, the extremely loose coupling and dynamic integration, and the ability of the user to directly affect the control of execution. In this paper, we first analyze the distinct features of web-based applications, and then define a generic analysis model that characterizes the typical behaviors of web-based applications independently of different technologies. Based on this analysis, a family of testing techniques is discussed to ensure different level of quality control of web applications under various situations.
ISE-TR-02-09
LHAP: A Lightweight Hop-by-Hop Authentication Protocol For Ad-Hoc Networks
Sencun Zhu and Shouhuai Xu and Sanjeev Setia and Sushil Jajodia
Most ad hoc networks do not implement any network access control, leaving these networks vulnerable to resource consumption attacks where a malicious node injects packets into the network with the goal of depleting the resources of the nodes relaying the packets. To thwart or prevent such attacks, it is necessary to employ authentication mechanisms that ensure that only authorized nodes can inject traffic into the network. In this paper, we present LHAP, a scalable and light-weight authentication protocol for ad hoc networks. LHAP is based on two techniques: (i) hop-by-hop authentication for verifying the authenticity of all the packets transmitted in the network and (ii) one-way key chain and TESLA for packet authentication and for reducing the overhead for establishing trust among nodes. We analyze the security properties of LHAP. Furthermore, our performance analysis shows that LHAP is a very lightweight security protocol.
ISE-TR-01-01
Large Constraint Joins and Disjoint Decompositions
Alexander Brodsky and Victor E. Segal
This paper is devoted to the problem of evaluation of a conjunction of N disjunctions of linear constraints in a multidimensional space (constraint join), which is a common constraint database operation. Existing methods are exponential in N. Here combinatorial geometry methods are applied to establish conditions for a polynomial size of the join output. A polynomial method to compute a join for an arbitrary input is then given. As part of the solution, a new algorithm for convex decomposition of an arbitrary non-convex polyhedron is developed. The H-tree is proposed --- a new data structure combining constraint storage and indexing. H-tree and R-tree implementations of the algorithm are given, and analytical considerations regarding the performance are provided.
ISE-TR-01-02
VirtuE: Virtual Enterprises for Information Markets
Alessandro D'Atri and Amihai Motro
An essential part of a modern economy is an information market. In this market, information products are being traded in countless ways. Information is bought, modified, integrated, incorporated into other products and then sold again. Usually, the production of an information product requires the collaboration of several participants. A virtual enterprise is a community of business entities that collaborate on the manufacturing of new products. This collaboration is often ad hoc, for a specific product only, after which the virtual enterprise may dismantle. The virtual enterprise paradigm is particularly appealing for modeling collaborations for manufacturing information products, and in this paper we present a new model, called VirtuE, for modeling such activities.
ISE-TR-01-03
Generating Test Data From Requirements/Specifications: Phase IV Final Report
Jeff Offutt
This report presents results for the Rockwell Collins Inc. sponsored project on generating test data from requirements/specifications, which started January 1, 2000. The purpose of this project is to improve our ability to test software that needs to be highly reliable by developing formal techniques for generating test cases from formal specificational descriptions of the software. Formal specifications give software developers the opportunity to describe exactly what services the software should provide, providing information to software designers in a clear, unambiguous manner. Formal specifications give test engineers information about the expectations of the software in a form that can be automatically manipulated. This Phase IV, 2000 report presents progress on an empirical evaluation of the efficacy of the transition-pair criterion developed in previous years. Transition-pair tests have been developed for the Flight Guidance System (FGS) and they were run on the faulty versions of FGS developed last year. These tests and the results are summarized in this preliminary report. This report also presents progress in our test data generation tool. This tool has been significantly expanded to handle multiple SCR tables, recursively defined tables, event and condition tables, non-boolean variables, and multiple-variable expressions. It is integrated with the Naval Research Laboratory's SCRTool and Rational Software Corporation's Rational Rose tool.
ISE-TR-01-04
Maintainability of the Linux Kernel
Stephen R. Schach and Bo Jin and David R. Wright and Gillian Z. Heller and A. Jefferson Offutt
We have examined 365 versions of Linux. For every version, we counted the number of instances of common (global) coupling between each of the 17 kernel modules and all the other modules in that version of Linux. We found that the number of instances of common coupling grows exponentially with version number. This result is significant at the 99.99% level, and no additional variables are needed to explain this increase. On the other hand, the number of lines of code in each kernel modules grows only linearly with version number. We conclude that, unless Linux is restructured with a bare minimum of common coupling, the dependencies induced by common coupling will, at some future date, make Linux exceedingly hard to maintain without inducing regression faults.
ISE-TR-01-05
Covering Bitmap with Trapezoids is Hard
Changzhou Wang and X. Sean Wang
Each sub-series of a time series can be mapped to a point on a plane whose two coordinates represent the starting time and the ending time, respectively. In certain applications, points with similar properties are likely to be clustered in a special trapezoid shape. To save these points in a compact way, the following optimization problem needs to be studied: find the minimum number of trapezoids which cover a given set of points. This paper formalizes the optimization problem as a decision problem, namely, Bitmap Cover with special Trapezoids (BCT), and proves that BCT is NP-hard.
ISE-TR-01-06
Database Schema Matching Using Machine Learning with Feature Selection
Jacob Berlin and Amihai Motro
Schema matching, the problem of finding mappings between the attributes of two semantically related database schemas, is an important aspect of many database applications such as schema integration, data warehousing, and electronic commerce. Unfortunately, schema matching remains largely a manual, labor-intensive process. Furthermore, the effort required is typically linear in the number of schemas to be matched; the next pair of schemas to match is not any easier than the previous pair. In this paper we describe a system, called Automatch, that uses machine learning techniques to automate schema matching. Based primarily on Bayesian learning, the system acquires probabilistic knowledge from examples that have been provided by domain experts. This knowledge is stored in a knowledge base called the attribute dictionary. When presented with a pair of new schemas that need to be matched (and their corresponding database instances), Automatch uses the attribute dictionary to find an optimal matching. We also report initial results from the Automatch project.
ISE-TR-00-01
Suitability of the UML as an Architecture Description Language with Applications to Testing
Aynur Abdurazik
Increasingly, very high level designs of large software systems are being described by software architectures. A software architecture expresses the overall structure of the system in an abstract, structured way. The Unified Modeling Language (UML) is widely used to express mid- and low-level designs of software, and recent proposals have been made to adapt the UML for use as an architecture design language (ADL). This research is looking into problems associated with creating system-level software tests that are based on architecture descriptions. This paper discusses issues with using the UML as an ADL, and general problems with then using the architecture descriptions as a basis for generating tests.
ISE-TR-00-02
Generating Test Data From Requirements/Specifications: Phase III Final Report
Jeff Offutt
This report presents results for the Rockwell Collins Inc. sponsored project on generating test data from requirements/specifications, which started January 1, 1999. The purpose of this project is to improve our ability to test software that needs to be highly reliable by developing formal techniques for generating test cases from formal specificational descriptions of the software. Formal specifications represent a significant opportunity for testing because they precisely describe what functions the software is supposed to provide in a form that can be easily manipulated by automated means. This Phase III, 1999 report presents results from an empirical evaluation of the full predicate specification-based testing criterion developed during the first two phases of this project, and a proof-of-concept test data generation tool. The evaluation used a comparative study on a large industrial system, a research version of the Flight Guidance Mode Logic System (FGS) provided by Rockwell Collins. Full predicate tests were generated for FGS, and compared against the T-Vec generation scheme. T-Vec tests for FGS were also provided by Rockwell Collins. While creating and running the tests, one problem was found in the SCR specifications for FGS, and one problem was found in the already well tested implementation of FGS. Both T-Vec and the full predicate tests found almost the same number of faults, but T-Vec required more than five times as many tests, thus the full predicate tests were more efficient. The proof-of-concept test data generator creates full predicate and transition-pair tests from an SCR specification. Currently, the tool requires the SCR specification to be a single mode transition table, all variables must be boolean, and the transition predicates must be single-variable expressions.
ISE-TR-00-03
Autoplex: Automated Discovery of Contents for Virtual Databases
Jacob Berlin and Amihai Motro
Most virtual database systems are suitable for environments in which the set of member information sources is small and stable. Consequently, present virtual database systems do not scale up very well. The main reason is the complexity and cost of incorporating new information sources into the virtual database. In this paper we describe a system, called Autoplex, which uses machine learning techniques for automating the discovery of new content for virtual database systems. Autoplex assumes that several information sources have already been incorporated (“mapped”) into the virtual database system by human experts (as done in standard virtual database systems). Autoplex learns the features of these examples. It then applies this knowledge to new candidate sources, trying to infer views that “resemble” the examples. In this paper we report initial results from the Autoplex project.
ISE-TR-99-01
Generating Test Data From Requirements/Specifications: Phase II Final Report
Jeff Offutt
This report presents results for the Rockwell Collins Inc. sponsored project on generating test data from requirements/specifications, which started January 1, 1998. The purpose of this project is to improve our ability to test software that needs to be highly reliable by developing formal techniques for generating test cases from formal specificational descriptions of the software. Formal specifications represent a significant opportunity for testing because they precisely describe what functions the software is supposed to provide in a form that can be easily manipulated by automated means. This Phase II, 1998 report presents results and strategies for practically applying test cases generated according to the criteria presented in the Phase I, 1997 report. This report presents a small empirical evaluation of the test criteria, and algorithms for solving various problems that arise when applying test cases developed from requirements/specifications. One significant problem in specification-based test data generation is that of reaching the proper program state necessary to execute a particular test case. Given a test case that must start in a particular state S, the test case prefix is a sequence of inputs that will put the software into state S. We have addressed this problem in two ways. First is to combine various test cases to be run in test sequences that are ordered in such a way that each test case leaves the software in the state necessary to run the subsequent test case. An algorithm is presented that attempts to find test case sequences that are optimal in the sense that the fewest possible number of test cases are used. To handle situations where it is desired to run each test case independently, an algorithm for directly deriving test sequences is presented. This report also presents procedures for removing redundant test case values, and develops the idea of “sequence-pair” testing, which was presented in the 1997 Phase I report, into a more general idea of “interaction-pair” testing.
ISE-TR-99-02
Using Approximations to Scale Exploratory Data Analysis in Datacubes
Daniel Barbará and Xintao Wu
Exploratory Data Analysis is a widely used technique to determine which factors have the most influence on data values in a multi-way table, or which cells in the table can be considered anomalous with respect to the other cells. In particular, median polish is a simple, yet robust method to perform Exploratory Data Analysis. Median polish is resistant to holes in the table (cells that have no values), but it may require a lot of iterations through the data. This factor makes it difficult to apply to large multidimensional tables, since the I/O requirements may be prohibitive. This paper describes a technique that uses median polish over an approximation of a datacube, easing the burden of I/O. The results obtained are tested for quality, using a variety of measures. The technique scales to large datacubes and proves to give a good approximation of the results that would have been obtained by median polish in the original data.
ISE-TR-99-03
Chaotic Mining: Knowledge Discovery Using the Fractal Dimension
Daniel Barbará
A lot of real datasets behave in a fractal fashion, i.e., they show an invariance with respect to the scale used to look at them. Fractal sets are characterized by a family of fractal dimensions, each with a particular interpretation. In this paper we show examples of how the fractal dimension(s) can be used to extract knowledge from datasets. Techniques that use the fractal dimension to detect anomalies in time series, to characterize time patterns of association rules, discover patterns in datacubes and cluster multidimensional datasets are described as part of an on-going research effort.
ISE-TR-99-04
Recovering Watermarks from Images
Zoran Duric and Neil F. Johnson and Sushil Jajodia
Many techniques for watermarking of digital images have appeared recently. Most of these techniques are sensitive to cropping and/or affine distortions (e.g., rotations and scaling). In this paper we describe a method for recognizing images based on the concept of identification mark; the method does not require the use of the original image, it requires only a small number of salient image points. We show that, using our method, it is possible to recognize distorted images and recover their “original” appearance. Once the image is recognized we use a second technique based on the normal flow to fine-tune image parameters. The restored image can be used to recover the watermark that had been embedded in the image by its owner.
ISE-TR-99-05
Testing the Polymorphic Relationships of Object-Oriented Components
Roger T. Alexander
As we move from developing procedure-oriented to object-oriented programs, the complexity traditionally found in functions and procedures is moving to the connections among components. More faults occur as components are integrated to form higher level aggregates of behavior and state. Consequently, we need to place more effort on testing the connections among components. Although object-oriented technology provides abstraction mechanisms to build components to integrate, it also adds new compositional relations that can contain faults, which must be found during integration testing. This paper describes new techniques for analyzing and testing the polymorphic relationships that occur in object-oriented software. The application of these techniques can result in an increased ability to find faults and overall higher quality software.
ISE-TR-99-06
Change Impact Analysis of Object-oriented Software
Michelle L. Lee
As the software industry has matured, we have shifted our resources from being devoted todeveloping new software systems to making modifications in evolving software systems. A major problem for developers in an evolutionary environment is that seemingly small changes can ripple throughout the system to cause major unintended impacts elsewhere. As such, software developers need mechanisms to understand how a change to a software system will impact the rest of the system. Although the effects of changes in object-oriented software can be restricted, they are also more subtle and more difficult to detect. Maintaining the current object-oriented systems is more of an art, similar to where we were 15 years ago with procedural systems, than an engineering skill. We are beginning to see “legacy” object-oriented systems in industry. A difficult problem is how to maintain these objects in large, complex systems. Although objects are more easily identified and packaged, features such as encapsulation, inheritance, aggregation, polymorphism and dynamic binding can make the ripple effects of object-oriented systems far more difficult to control than in procedural systems. The research presented here addresses the problems of change impact analysis for object-oriented software. Major results of this research include a set of object-oriented data dependency graphs, a set of algorithms that allow software developers to evaluate proposed changes on object-oriented software, a set of object-oriented change impact metrics to evaluate the change impact quantitatively, and a prototype tool, ChaT, to evaluate the algorithms. This research also results in efficient regression testing by helping testers decide what classes and methods need to be retested, and in supporting cost estimation and schedule planning.
ISE-TR-99-07
Resolving Inconsistencies in the Multiplex Multidatabase System
Philipp Anokhin and Amihai Motro
With the development of the Internet and the on-line availability of large numbers of information sources, the problem of integrating multiple heterogeneous information sources requires reexamination, its basic underlying assumptions having changed drastically. Integration methodologies must now contend with situations in which the number of potential data sources is very large, and the set of sources changes continuously. In addition, the ability to create quick, ad-hoc virtual databases for short-lived applications is now considered attractive. Under these new assumptions, a single, complete answer can no longer be guaranteed. It is now possible that a query could not be answered in its entirety, or it might result in several different answers. Multiplex is a multidatabase system designed to operate under these new assumptions. In this paper we describe how Multiplex handles queries that do not have a single, complete answer. The general approach is to define flexible and comprehensive strategies that direct the behavior of the query processing subsystem. These strategies may be defined either as part of the multidatabase design or as part of ad-hoc queries.
ISE-TR-99-08
Using the Fractal Dimension to Cluster Datasets
Daniel Barbará and Ping Chen
Clustering is a widely used knowledge discovery technique. It helps uncovering structures in data that were not previously known. Clustering of large datasets has received a lot of attention in recent years. However, clustering is a still a challenging task since many published algorithms fail to do well in scaling with the size of the dataset and the number of dimensions that describe the points, or in finding arbitrary shapes of clusters, or dealing effectively with the presence of noise. In this paper, we present a new clustering algorithm, based in the fractal properties of the datasets. The new algorithm, which we call Fractal Clustering (FC) places points incrementally in the cluster for which the change in the fractal dimension after adding the point is the least. This is a very natural way of clustering points, since points in the same cluster have a great degree of self-similarity among them (and much less self-similarity with respect to points in other clusters). FC requires one scan of the data, is suspendable at will, providing the best answer possible at that point, and is incremental. We show via experiments that FC effectively deals with large datasets, high-dimensionality and noise and is capable of recognizing clusters of arbitrary shape.
ISE-TR-99-09
Generating Test Cases from UML Specifications
Aynur Abdurazik and Jeff Offutt
Unified Modeling Language (UML) is a third generation modeling language in object-oriented software engineering. It provides constructs to specify, construct, visualize, and document artifacts of software-intensive systems. This paper presents a technique that uses Offutt's state-based specification test data generation model to generate test cases from UML statecharts. A tool TCGen has been developed to demonstrate this technique with specifications written in Software Cost Reduction (SCR) method and Unified Modeling Language (UML). Experimental results from using this tool are presented.
ISSE-TR-98-02
Input Validation Testing: A Requirements-Driven, System Level, Early Lifecycle Technique
Jane Hayes and Jeff Offutt
Syntax-directed software accepts inputs from the user, constructed and arranged properly, that control the execution of the application. Input validation testing chooses test data that attempt to show the presence or absence of specific faults pertaining to input tolerance. A large amount of syntax-directed software currently exists and will continue to be developed that should be subjected to input validation testing. System level testing techniques that currently address this area are not well developed or formalized. There is a lack of system level testing formal research and accordingly a lack of formal, standard criteria, general purpose techniques, and tools. Systems are large (size and domain) so unit testing techniques have had limited applicability. Input validation testing techniques have not been developed or automated to assist in static input syntax evaluation and test case generation. This paper addresses the problem of statically analyzing input command syntax as defined in interface and requirements specifications and then generating test cases for input validation testing. The IVT (Input Validation Testing) technique has been developed, a proof-of-concept tool (MICASA) has been implemented, and validation has been performed. Empirical validation on actual industrial software (for the Tomahawk Cruise Missile) shows that as compared with senior, experienced testers, MICASA found more requirement specification defects, generated test cases with higher syntactic coverage, and found additional defects. Additionally, the tool performed at significantly less cost.
ISSE-TR-98-03
Quasi-Cubes: A Space-Efficient Way to Support Approximate Multidimensional Databases
Daniel Barbará and Mark Sullivan
A data cube is a popular organization for summary data. A cube is simply a multidimensional structure that contains at each point an aggregate value, i.e., the result of applying an aggregate function to an underlying relation. In practical situations, cubes can require a large amount of storage. The typical approach to reducing storage cost is to materialize parts of the cube on demand. Unfortunately, this lazy evaluation can be a time-consuming operation. In this paper, we propose an approximation technique that reduces the storage cost of the cube without incurring the run time cost of lazy evaluation. The idea is to characterize regions of the cube by using statistical models whose description take less space than the data itself. Then, the model parameters can be used to estimate the cube cells with a certain level of accuracy. To increase the accuracy, some of the “outliers,” i.e., cells that incur in the largest errors when estimated can be retained. The storage taken by the model parameters and the retained cells, of course, should take a fraction of the space of the full cube and the estimation procedure should be faster than computing the data from the underlying relations. We study three methods for modeling cube regions: linear regression, singular value decomposition and the independence approach. We study the tradeoff between the amount of storage used for the description and the accuracy of the estimation. Experiments show that the errors introduced by the estimation are small even when the description is substantially smaller than the full cube. Since cubes are used to support data analysis and analysts are rarely interested in the precise values of the aggregates (but rather in trends), providing approximate answers is, in most cases, a satisfactory compromise. Our technique allows systems to handle very large cubes that cannot either be fully stored or efficiently materialized on demand. Even for applications that can only accept tiny error in the cube data, our techniques can be used to present successive on-line refinements of the answer to a query so that a coarse picture of the answer can be presented while the remainder of the answer is fetched from disk (i.e., to support on-line aggregation).
ISSE-TR-97-01
Redistributing Secret Shares to New Access Structures and Its Applications
Yvo Desmedt and Sushil Jajodia
Proactive secret sharing deals with refreshing secret shares, i.e., redistributing the shares of a secret to the original access structure. In this paper we focus on the general problem of redistributing shares of a secret key. Shares of a secret have been distributed such that access sets specified in the access structure (e.g., t-out-of-l) can access (or use) the secret. The problem is how to redistribute the secret, without recovering it, in such a way that those specified in the new access structure will be able to recover the secret. We also adapt our scheme such that it can be used in the context of threshold cryptography and discuss its applications to secure databases.
ISSE-TR-97-02
Generating Test Data from SOFL Specifications
A. Jefferson Offutt and Shaoying Liu
Software testing can only be formalized and quantified when a solid basis for test generation can be defined. Bases that are commonly used include the source code, control flow graphs, design representations, and specifications/requirements. Formal specifications represent a significant opportunity for testing because they precisely describe what functions the software is supposed to provide in a form that can be easily manipulated. In this paper, we present a new method for generating tests from formal specifications. This method is comprehensive in specification coverage, applies at several levels of abstraction, and can be highly automated. We apply our method to SOFL specifications, describe the technique, and demonstrate the application on a case study. A preliminary evaluation using a code-level coverage criterion (mutation testing), indicates that the method can result in very effective tests.
ISSE-TR-96-01
Subsumption of Condition Coverage Techniques by Mutation Testing
A. Jefferson Offutt and Jeffrey M. Voas
Condition coverage testing is a family of testing techniques that are based on the logical flow of control through a program. The condition coverage techniques include a variety of requirements, including that each statement in the program is executed and that each branch is executed. Mutation testing is a fault-based testing technique that is widely considered to be very powerful, and that imposes requirements on testing that include, and go beyond, many other techniques. In this paper, we consider the six common condition coverage techniques, and formally show that these techniques are subsumed by mutation testing, in the sense that if mutation testing is satisfied, then the condition coverage techniques are also satisfied. The fact that condition coverage techniques are subsumed by mutation has immediate practical significance because the extensive research that has already been done for mutation can be used to support condition coverage techniques, including automated tools for performing mutation testing and generating test cases.
ISSE-TR-96-02
Algorithms for Change Impact Analysis of Object-oriented Software
Li Li and A. Jefferson Offutt
As the software industry has matured, we have shifted from almost exclusively devoting our resources to the development of new software systems to devoting large portions of our resources to making modifications to evolving software systems. A major problem for developers in an evolutionary environment is that seemingly small changes in one place can ripple throughout the system to have major unintended impacts elsewhere, thus, software developers need mechanisms to understand what impact a change to a software system will have on the rest of the system. With object-oriented software, the impacts of changes are restricted (largely because of encapsulation), but are also more subtle and harder to determine. This paper presents algorithms to analyze potential impacts of changes to object-oriented software, taking into account encapsulation, inheritance, and polymorphism. This technique allows software developers to perform “what if” analysis on the impacts of proposed changes, and thereby choose the change that has the least impact on the rest of the system. The analysis also offers valuable information to regression testing, by suggesting what classes and methods need to be retested, and to project managers, who can use the results for cost estimation and schedule planning.
ISSE-TR-96-03
A Prototype Domain Modeling Environment for Reusable Software Architectures
H. Gomaa and L. Kerschberg and V. Sugumaran and C. Bosch and I. Tavakoli
Contact authors for abstract and paper
ISSE-TR-96-04
Domain Modeling of the Spiral Process Model
Hassan Gomaa and Larry Kerschberg and Ghulam Ahmad Farrukh and Rajan Girish Mohan
Contact authors for abstract and paper
ISSE-TR-96-05
PROGEN: A Knowledge-based System for Process Model Generation, Tailoring and Reuse
Larry Kerschebrg and Hassan Gomaa and Rajan Girish Mohan and Ghulam Ahmad Farrukh
Contact authors for abstract and paper
ISSE-TR-96-06
Discovering Temporal Relationships with Multiple Granularities in Time Sequences
Claudio Bettini and X. Sean Wang and Sushil Jajodia and Jia-Ling Lin
An important usage of time sequences is to discover temporal patterns. The discovery process usually starts with a user-specified skeleton, called an “event structure”, which consists of a number of variables representing events and temporal constraints among these variables; and the goal of the discovery is to find temporal patterns, i.e., instantiations of the variables in the structure, which frequently appear in the time sequence. This paper introduces event structures that have temporal constraints with multiple granularities, defines the pattern discovery problem with these structures, and studies effective algorithms to solve it. The basic components of the algorithms include timed automata with granularities (TAGs) and a number of heuristics. The TAGs are for testing whether a specific temporal pattern, called a “candidate complex event type”, appears frequently in a time sequence. Since there are often a huge number of candidate event types for a usual event structure, heuristics are presented aiming at reducing the number of candidate event types and reducing the time spent by the TAGs to test whether a candidate type does appear frequently in the sequence. These heuristics exploit the information provided by explicit and implicit temporal constraints with granularity in the given event structure. The paper also gives the results of an experiment to show the effectiveness of the heuristics on a real data set.
ISSE-TR-96-07
The C3 Constraints Object-Oriented Database System
Alexander Brodsky and Victor E. Segal and Pavel A. Exarkhopoulo
Constraints provide a flexible and uniform way to conceptually represent diverse data capturing spatio-temporal behavior, complex modeling requirements, partial and incomplete information etc, and have been used in a wide variety of application domains. Constraint databases have recently emerged to deeply integrate data captured by constraints in databases. This paper reports on the development of the first constraint object-oriented database system, Ccube, and describes its specification, design and implementation. The Ccube system is designed to be used for both implementation and optimization of high-level constraint object-oriented query languages such as LyriC or constraint extensions of OQL, and for directly building software systems requiring extensible use of constraint database features. The Ccube data manipulation language, Constraint Comprehension Calculus, is an integration constraint calculus for extensible constraint domains within monoid comprehensions, which serve as an optimization-level language for object-oriented queries. The data model for constraint calculus is based on constraint spatio-temporal (CST) objects that may hold spatial, temporal or constraint data, conceptually represented by constraints. New CST objects are constructed, manipulated and queried by means of constraint calculus. The model for monoid comprehensions, in turn, is based on the notion of monoids, which is a generalization of collection and aggregation types to structures over which one can iterate and apply merge operator; this includes disjunctions and conjunctions of constraints. The focal point of our work is achieving the right balance between expressiveness, complexity and representation usefulness, without which the practical use of the system would not be possible. To that end, C3 constraint calculus guarantees polynomial time data complexity, and, furthermore, is tightly integrated with monoid comprehensions to allow deep global optimization.
ISSE-TR-96-08
Applying Logic-based Database to Impact Analysis of Object-oriented Software
Li Li and Jeff Offutt
Contact authors for paper
As the software industry has matured, we have shifted our resources, once devoted to developing new software systems, to making modifications in evolving software systems. A major problem for developers in an evolutionary environment is that seemingly small changes can ripple throughout the system to have major unintended impacts elsewhere. As a result, software developers need mechanisms to understand how a change to a software system will affect the rest of the system. Although the effects of changes in object-oriented are restricted, they are also more subtle and more difficult to detect. The technique presented in this paper allows software developers to perform “what if” analysis on the effect of proposed changes, and thereby choose the change that has the least influence on the rest of the system. The analysis also adds valuable information to regression testing, by suggesting what classes and methods need to be re-tested, and to project managers, who can use the results for cost estimation and schedule planning. This paper presents algorithms to analyze the potential impacts of changes to object-oriented software, taking into account encapsulation, inheritance, and polymorphism, and implements the algorithms using datalog, a deductive database model. Using the datalog makes the implementation of the algorithms relatively easier. It also expands the range of the impact analysis of the object-oriented system to any interesting questions that can be expressed by the datalog query.
ISSE-TR-96-09
Mutation Operators for Ada
A. Jefferson Offutt and Jeff Voas and Jeff Payne
Mutation analysis is a method for testing software. It provides a method for assessing the adequacy of test data. This report describes the mutation operators defined for the Ada programming language. The mutation operators are categorized using syntactic criteria, in a form suitable for an implementor of a mutation-based system, or a tester wishing to understand how mutation analysis can be used to test Ada programs. Each mutation operator is carefully defined, and when appropriate, implementation notes and suggestions are provided. We include operators for all syntactic elements of Ada, including exception handling, generics, and tasking. A summary table listing all operators for Ada, and compared with C and Fortran operators is also provided. The design described here is the result of deliberations among the authors in which all aspects of the Ada language and software development in Ada were considered. These operators can also be viewed as the culmination of previous mutation operator definitions for other languages. This report is intended to serve as a manual for the Ada mutation operators.
ISSE-TR-96-10
A General Framework for Time Granularity and Its Application to Temporal Reasoning
Claudio Bettini and X. Sean Wang and Sushil Jajodia
This paper presents a general framework to define time granularity systems. We identify the main dimensions along which different systems can be characterized, and investigate the formal relationships among granularities in these systems. The paper also introduces the notion of a network of temporal constraints with (multiple) granularities emphasizing the semantic and computational differences from constraint networks with a single granularity. Consistency of networks with multiple granularities is shown to be NP-hard in general and approximate solutions for this problem and for the “minimal network” problem are proposed.
ISSE-TR-95-00
Integration Testing Based on Software Couplings
Zhenyi Jin and A. Jefferson Offutt
Integration testing is an important part of the testing process, but few integration testing techniques have been systematically studied or defined. This paper presents an integration testing technique based on couplings between software components. The coupling-based testing technique is described, and 12 coverage criteria are defined. The coupling-based technique was compared with the category-partition method. Results shows that the coupling-based technique detected more faults, and used fewer test cases than category-partition on a subject program. This modest result indicates that the coupling-based testing approach can benefit practitioners who are performing integration testing on software.
ISSE-TR-95-01
Multilevel Relational (MLR) Data Model
Fang Chen and Ravi S. Sandhu
Many multilevel data models have been proposed, and different models have different merits. In this paper, we unify many of these merits to build a new multilevel data model as a natural extension of the traditional relational data model. We describe our new model called the Multilevel Relational (MLR) data model for multilevel relations with element-level labeling. The MLR model builds upon prior work of numerous authors in this area, and integrates ideas from a number of sources. The new data-based semantics is given to the MLR data model which combines ideas from SeaView, belief-based semantics and LDV model, and has the advantages of both eliminating ambiguity and retaining upward information flow. The model is simple, unambiguous and powerful. It has five integrity properties and five operation statements for manipulating multilevel relations. In order to support this integration, several new concepts are introduced and many old ones are redefined. A central contribution of this paper is proofs of soundness, completeness and security of the MLR data model. These proofs respectively show that any of the provided operation statements can keep database state legal (i.e., satisfying all integrity properties), every legal database state can be constructed, and the MLR data model is non-interfering. This is the first time that such properties have been formally proved for a multilevel relational data model. The expressive power of the MLR model is also discussed in this paper, and is compared with several other models.
ISSE-TR-95-02
The Effectiveness of Category-Partition Testing of Object-Oriented Software
Alisa Irvine and A. Jefferson Offutt
When migrating from conventional to object-oriented programming, developers face difficult decisions in modifying their development process to best use the new technology. In particular, ensuring that the software is highly reliable in this new environment poses different challenges. Developers need understanding of effective ways to test the software. This paper presents empirical data that show that the existing technique of category-partition testing can effectively find faults in object-oriented software, and new techniques are not necessarily needed. For this study, we identified types of faults that are common to C++ software and inserted faults of these types into two C++ programs. Test cases generated using the category-partition method were used to test the programs. A fault was considered detected if it caused the program to terminate abnormally or if the output was different from the output of the original program. The results show that the combination of the category-partition method and a tool for detecting memory management faults may be effective for testing C++ programs in general. Since there is no evidence that traditional techniques are not effective, software developers may not need new testing methods when migrating to object-oriented development.
ISSE-TR-95-03
Multiplex: A Formal Model for Multidatabases and Its Implementation
Amihai Motro
The integration of information from multiple databases has been an enduring subject of research for almost 20 years, and many different solutions have been attempted or proposed. Missing from this research has been a uniform framework. Usually, each solution develops its own ad-hoc framework, designed to address the particular aspects of the problem that are being attacked and the particular methodology that is being used. To address this situation, in this paper we define a formal model for multidatabases, which we call Multiplex. Mulitplex is a simple extension of the relational model, which may serve as a uniform abstraction for many previous ad-hoc solutions. Multiplex is based on formal assumptions of integrability, which distinguish between scheme and instance reconcilability among independent databases. Multiplex supports database heterogeneity, and it provides several degrees of freedom that allow it to model actual situations encountered in multidatabase applications. In addition, in situations in which a single answer is not obtainable (either because the global query is not answerable, or there are multiple candidate answers), Multiplex defines approximative answers. Finally, Multiplex provides a practical platform for implementation. A prototype of such an implementation is described briefly.
ISSE-TR-95-04
The Problem of Optimal Approximations of Queries Using Views and Its Applications
Alexander Brodsky and Amihai Motro
In several recent works a common scenario is described: a relational database scheme R and a set of views V of this scheme are defined, and queries on the database scheme R must be transformed to equivalent queries on the views V. Given a query Q on R, the first problem is whether there exists a query Q_V expressed exclusively on V which is equivalent to Q, and, if it exists, how to find it. Clearly, an equivalent Q_V may not always exist. In such a case, it is important to approximate Q as best as possible with a query on V. The problems here are to define approximations, to determine whether a best approximation exists, whether it is unique, and how to find it. In this paper we formalize and answer these questions. Specifically, we show that the following problems are decidable for a conjunctive query Q and a set of conjunctive views V: (1) is there a conjunctive query Q_V on V that is equivalent to Q, and (2) is there a union Q_U of conjunctive queries on V that is equivalent to Q? Furthermore, we show that Q_V and Q_U are effectively computable if they exist. Moreover, the greatest lower bound of Q exists and is unique up to equivalence. This is done by introducing lower bound classifications, and proving their existence, finiteness and effective computability. This problem has many interesting applications, including physical database design (aimed at optimization of query processing), cooperative answering (where answers are annotated with some of their properties), and multidatabase query decomposition. We examine these applications, and we show how our work extends earlier results obtained in these applications.
ISSE-TR-95-05
Temporal Semantic Assumptions and Their Use in Database Query Evaluation
Claudio Bettini and X. Sean Wang and Sushil Jajodia
Temporal data explicitly stored in a temporal database are often associated with certain semantic assumptions. Each assumption can be viewed as a way of deriving implicit information from the explicitly stored data. Rather than leaving the task of deriving (possibly infinite) implicit data to application programs, as is the case currently, it is desirable that this be handled by the database management systems. To achieve this, this paper formalizes and studies two types of semantic assumptions: point-based and interval-based. The point-based assumptions include those assumptions that use interpolation methods, while the interval-based assumptions include those that involve different temporal types (time granularities). In order to incorporate semantic assumptions into query evaluation, this paper introduces a translation procedure that converts a user query into a system query such that the answer of this system query over the explicit data is the same as that of the user query over the explicit AND the implicit data. The paper also investigates the finiteness (safety) of user queries and system queries.
ISSE-TR-95-06
Using Formal Methods To Reason About Semantics-Based Decompositions Of Transactions
Paul Ammann and Sushil Jajodia and Indrakshi Ray
Many researchers have investigated the process of decomposing transactions into smaller pieces to increase concurrency. The research typically focuses on implementing a decomposition supplied by the database application developer, with relatively little attention to what constitutes a desirable decomposition and how the developer should obtain such a decomposition. In this paper, we argue that the decomposition process itself warrants attention. A decomposition generates a set of proof obligations that must be satisfied to show that a particular decomposition correctly models the original collection of transactions. We introduce the notion of semantic histories to formulate and prove the necessary properties. Since the decomposition impacts not only the atomicity of transactions, but isolation and consistency as well, we present a technique based on formal methods that allows these properties to be surrendered in a carefully controlled manner. Note: The text of this paper appears in VLDB '95, Proceedings of the Twenty-First International Conference on Very Large Databases, Zurich, Switzerland, September, 1995. In addition, the appendices of this report contain proofs that space constraints precluded from the VLDB paper.
ISSE-TR-95-07
A Performance Oriented Design Methodology for Large-Scale Distributed Data Intensive Information Systems
Daniel A. Menascé and Hassan Gomaa and Larry Kerschberg
The Earth Observing System (EOS) Data and Information System (EOSDIS) is perhaps one of the most important examples of large-scale, geographically distributed, and data intensive systems. Designing such systems in a way that guarantees that the resulting design will satisfy all functional and performance requirements is not a trivial task. This paper presents a performance-oriented methodology to design large-scale distributed data intensive information systems. The methodology is then applied to the design of EOSDIS Core System (ECS). Performance results, based on queuing network models of ECS are also presented.
ISSE-TR-95-08
A Software Architectural Design Method for Large-Scale Distributed Data Intensive Information Systems
Hassan Gomaa and Larry Kerschberg and Daniel A. Menascé
This paper describes a Software Architectural Design Method for Large-Scale Distributed Data Intensive Information Systems. The method starts by developing a domain model of the system. The domain model is then mapped to a federated client/server software architecture, where the servers need to cooperate with each other to service client requests. To demonstrate the viability of the architecture, specific user scenarios are applied to the federated client / server software architecture using event sequence diagrams. The method is illustrated by applying it to the design of a complex software system, the Earth Observing System Data and Information System (EOSDIS) Core System.
ISSE-TR-95-09
Data and Information Architectures for Large-Scale Distributed Data Intensive Information Systems
Larry Kerschberg and Hassan Gomaa and Daniel A. Menascé
No abstract or online report available.
ISSE-TR-95-10
A Semantic Model of Program Faults
Jeff Offutt and Jane Hayes
Program faults are artifacts that are widely studied, but there are many aspects of faults that we still do not understand. In addition to the simple fact that one important goal during testing is to detect faults, a full understanding of the characteristics of faults is crucial to several research areas in testing. These include fault-based testing, testability, mutation testing, and the comparative evaluation of testing strategies. In this preliminary paper, we explore the fundamental nature of faults by looking at the differences between a syntactic and semantic characterization of faults. We offer definitions of these characteristics and explore the differentiation. Specifically, we discuss the concept of “size” of program faults --- the measurement of size provides interesting and useful distinctions between the syntactic and semantic characterization of faults. We use the fault size observations to make several predictions about testing and present data that supports this model. We also use the model to offer explanations about several questions that have intrigued testing researchers.
ISSE-TR-95-11
Finding Mode Invariants in SCR Specifications
Zhenyi Jin
This paper presents an algorithm to derive modeclass invariants from SCR specifications. Matrices are formed to help to find the invariants. A case study of a cruise control system shows that the algorithm is capable of determining the same invariants that were proved via model checking in earlier investigations. This algorithm provides a mechanical procedure that are much easier and clearer for computing invariants than computing from reachability graphs. It can be easily automated and handle big systems. It views the SCR specifications from the SCR structure's point of view and analyzes mode invariants as structural properties. It presents a new way to analyze software requirements documents and detects implicitly implied system property information that can be very useful and important in understanding, developing, and testing the software system. Related issues such as generating test cases that are independent of the design structure based on the SCR specifications are discussed to reveal possible research problems and directions in the future.
ISSE-TR-95-12
Integrating Testing with the Software Development Process
Zhenyi Jin and A. Jefferson Offutt
Software testing is usually postponed to the end of the development, after coding has started, or even after it has ended. By waiting until this late in the process, testing winds up being compressed --- there are not enough resources (time and budget) remaining, problems with previous stages have been solved by taking time and dollars from testing, and there is insufficient time to adequately plan for testing. In this paper, we present an integrated approach to testing software, where testing activities begin as soon as development activities begin, and are carried out in parallel with the development activities. We present specific activities --- including planning, active testing, and development influencing activities --- that are associated with each of the traditional lifecycle phases. These activities can be carried out by the developers or separate test engineers, and can be associated with development activities within the confines of any development process. These activities allows the tester to detect and prevent throughout the software development process, leading to more reliable and higher quality software.
ISSE-TR-95-13
Advanced Transaction Processing in Multilevel Secure File Stores
Elisa Bertino and Sushil Jajodia and Luigi Mancini and Indrajit Ray
The concurrency control requirements for transaction processing in a multilevel secure file system are different from those in conventional transaction processing systems; in particular, there is the need to coordinate transactions at different security levels avoiding both potential timing covert channels and the starvation of transactions at high security levels. Suppose a transaction at a low security level attempts to write a data item that is being read by a transaction at a higher security level. On the one hand, a timing covert channel arises if the transaction at the low security level is either delayed or aborted by the scheduler. On the other hand, the transaction at the high security level may be subjected to an indefinite delay if it is forced to abort repeatedly. This paper extends the two-phase locking mechanism that is suitable for conventional transaction processing systems, to multilevel secure data management systems. The scheme presented here, avoids the abort of higher level transactions, nonetheless guaranteeing serializability. The system programmer is provided with a powerful set of linguistic constructs which supports exception handling, partial rollback and forward recovery. The proper use of these constructs can prevent the indefinite delay in completion of a high level transaction, and allows the system programmer to trade off starvation with transaction isolation.
ISSE-TR-95-14
Applying Formal Methods to Semantic-Based Decomposition of Transactions
Paul Ammann and Sushil Jajodia and Indrakshi Ray
In some important database applications, performance requirements are not satisfied by the traditional approach of serializability, in which transactions appear to execute atomically and in isolation on a consistent database state. Although many researchers have investigated the process of decomposing transactions into steps to increase concurrency, the focus of the research is typically on implementing a decomposition supplied by the database application developer, with relatively little attention to what constitutes a desirable decomposition and how the developer should obtain such a decomposition. In this paper, we focus on the decomposition process itself. A decomposition generates a set of proof obligations that must be satisfied to show that the decomposition correctly models the original collection of transactions. We introduce the notion of semantic histories to formulate and prove the necessary properties, and the notion of successor sets to describe efficiently the correct interleavings of steps. The successor set constraints use information about conflicts between steps so as to take full advantage of conflict serializability at the level of steps. We implement the resulting, verified decomposition in a two-phase locking environment.
ISSE-TR-95-15
ViewFinder: An Object-Oriented Browser
Alessandro D'Atri and Amihai Motro and Laura Tarantino
ViewFinder is a graphical tool for browsing in databases that provides a flexible, yet intuitive environment for exploratory searches. The design approach has been to provide maximum functionality and generality without sacrificing simplicity. The constructs of ViewFinder's external model are essentially object-oriented: class and token objects, membership relationships between tokens and classes, generalization relationships between classes, inheritance, and so on. This external model is based on an internal model which resembles a semantic network. Such a network may be extracted from a variety of data models, including object-oriented, entity-relationship and extended relational models. This architecture gives ViewFinder a large degree of model independence. The main construct of the external model are displays of objects (either classes or tokens), called views. Commands are available for efficient traversal of the database, displaying views of any class or token. To speed up repetitive searches, views may be synchronized: the user sets up several views, linked in a tree-like structure, so that when the information displayed in the root view is modified (e.g., scrolled by the user), the contents of the other views change automatically. Additional commands are available to search, order, aggregate and select the information displayed in a view, thus providing a simple query facility.
ISSE-TR-94-00
An Experimental Determination of Sufficient Mutation Operators
Jeff Offutt and Ammei Lee and Gregg Rothermel and Roland Untch and Christian Zapf
Mutation testing is a technique for unit testing software that, although powerful, is computationally expensive. The principal expense of mutation is that many variants of the test program, called mutants, must be repeatedly executed. Selective mutation is a way to reduce the cost of mutation testing by reducing the number of mutants that must be executed. This paper reports experimental results that compare selective mutation testing to standard, or non-selective, mutation testing. The results support the hypothesis that selective mutation is almost as strong as non-selective mutation; in experimental trials selective mutation provides almost the same coverage as non-selective mutation, with a four-fold or more reduction in cost.
ISSE-TR-94-01
On the Expressive Power of the Unary Transformation Model
Ravi S. Sandhu and Srinivas Ganta
The Transformation Model (TRM) was recently introduced in the literature by Sandhu and Ganta. TRM is based on the concept of transformation of rights. The propagation of access rights in TRM is authorized entirely by existing rights for the object in question. It has been demonstrated in earlier work that TRM is useful for expressing various kinds of consistency, confidentiality, and integrity controls. In our previous work a special case of TRM named Binary Transformation Model (BTRM) was defined. We proved that BTRM is equivalent in expressive power to TRM. This result indicates that it suffices to allow testing for only two cells of the matrix. In this paper we study the relationship between TRM and the Unary Transformation Model (UTRM). In UTRM, individual commands are restricted to testing for only one cell of the matrix (whereas individual TRM commands can test for multiple cells of the matrix). Contrary to our initial conjecture, we found, quite surprisingly, that TRM and UTRM are formally equivalent in terms of expressive power. The implications of this result on safety analysis is also discussed in this paper. The construction used to prove the equivalence of TRM and UTRM also indicates that they might not be practically equivalent.
ISSE-TR-94-02
A Graduate Course in Object-Oriented Analysis Based on Student-Generated Projects
Bo Sanden
This paper describes the experience from several offerings of a Software Requirements course based on the Object Modeling Technique by Rumbaugh et al. The paper describes the organization of the course, which includes a semester project carried out by groups of 3-4 students. Descriptions of 20 such projects are given, most of which were based on ideas from the student teams.
ISSE-TR-94-03
Entity-Life Modeling and Ada 9X
Jeffrey R. Carter and Bo I. Sanden
Entity-life modeling (ELM) is a design approach for reactive systems. ELM designs are usually implemented in Ada and rely on its packaging and tasking facilities. An ELM solution of a Flexible Manufacturing System (FMS) has been published with an implementation in Ada 83. Ada 83 has a number of shortcomings for reactive systems, many of which are addressed by the proposed revision, Ada 9X. This paper examines the implementation of the FMS in Ada 9X and compares it to the earlier implementation. Although the FMS problem was conceived independently from the Ada 9X effort, the Ada 83 implementation is virtually a catalog of problems that are elegantly addressed by Ada 9X. This indicates the importance of the 9X revision to make Ada useful in reactive, concurrent applications.
ISSE-TR-94-04
Object-Oriented Analysis for Control Systems with Shared Resources
Bo Sanden and Anhtuan Dinh
This paper proposes a modification of existing approaches to object-oriented analysis (particularly the Object Modeling Technique, OMT) to address the problem of resource sharing in control systems. In control-system analysis, it is important to identify certain objects as the resource users and other as the (shared) resources. Once this has been done, the system can often be made deadlock free simply by ensuring that resource users always acquire shared resources in the same order. Resource users can easily be high-lighted in the object model, while events and actions associated with resource allocation can be shown in the dynamic model. A flexible manufacturing system (FMS) is given as a non-trivial example. In that example, jobs need simultaneous exclusive access to various devices.
ISSE-TR-94-05
Experiments with Data Flow and Mutation Testing
Jeff Offutt and Jie Pan and Kanupriya Tewary and Tong Zhang
This paper presents two experimental comparisons of data flow and mutation testing. These two techniques are widely considered to be effective for unit-level software testing, but can only be analytically compared to a limited extent. We compare the techniques by evaluating the effectiveness of test data developed for each. For a number of programs, we develop ten independent sets of test data; five to satisfy the mutation criterion, and five to satisfy the all-uses data flow criterion. These test sets are developed using automated tools, in a manner consistent with the way a test engineer would apply these testing techniques. We use these test sets in two separate experiments. First we apply a “cross scoring”, by measuring the effectiveness of the test data that was developed for one technique in terms of the other technique. Second, we investigate the ability of the test sets to find faults. We place a number of faults into each of our subject programs, and measure the number of faults that are detected by the test sets. Our results indicate that while both techniques are effective, mutation-adequate test sets are closer to satisfying data flow, and detect more faults.
ISSE-TR-94-06
A Safety Kernel for Traffic Light Control
Paul Ammann
The success of kernels for enforcing security has led to proposals to use kernels for enforcing safety. As a feasibility demonstration of one such proposal, we describe a kernel for traffic light control. We argue the utility of the kernel and state the safety properties the kernel must ensure. We give a formal specification of the kernel and show that the specification maintains the safety properties. We explain an implementation of the kernel in Ada and discuss use of the kernel.
ISSE-TR-94-07
Algebraic Query Languages on Temporal Databases with Multiple Time Granularities
X. Sean Wang
This paper investigates algebraic query languages on temporal databases. The data model used is a multidimensional extension of the temporal modules introduced in [1]. In a multidimensional temporal module, every non-temporal fact has a timestamp that is a set of n-ary tuples of time points. A temporal module has a set of timestamped facts and has an associated temporal granularity (or temporal type), and a temporal database is a set of multidimensional temporal modules with possibly different temporal types. Temporal algebras are proposed on this database model. Example queries and results of the paper show that the algebras are rather expressive. The operations of the algebras are organized into two groups: snapshot-wise operations and timestamp operations. Snapshot-wise operations are extensions of the traditional relational algebra operations, while timestamp operations are extensions of first-order mappings from timestamps to timestamps. Multiple temporal types are only dealt with by these timestamp operations. Hierarchies of algebras are defined in terms of the dimensions of the temporal modules in the intermediate results. The symbol Talgkm,n is used to denote all the algebra queries whose input, output and intermediate modules are of dimensions at most m, n and k, respectively. (Most temporal algebras proposed in the literature are in Talg11,1.) Equivalent hierarchies Tcalckm,n are defined in a calculus query language that is formulated by using a first-order logic with linear order. The addition of aggregation functions into the algebras is also studied.
ISSE-TR-94-08
An Application of Entity Life Modeling to Incremental Re-engineering of Fortran Reactive Program Components
F. G. Patterson Jr. and Jean-Jacques Moortgat
This paper describes a case study in which the Entity Life Modeling (ELM) technique is used to reconstruct a single module from a legacy FORTRAN application. The architecture of the module changed as a result of entity identification and mode analysis. Concurrent aspects of the problem domain were identified and modeled as concurrent tasks. To interface with the greater part of the software system that was not reconstructed, an interface object was created. The interface object may be regarded as scaffolding that will be discarded when neighboring increments of the system are reengineered. The reengineered increment has a highly maintainable design that is characteristic of complete systems constructed using the ELM methodology.
ISSE-TR-94-09
Using Constraints to Detect Equivalent Mutantsr
Jeff Offutt and Jie Pan
Mutation testing is a software testing technique that is considered to be very powerful, but is very expensive to apply. One open problem in mutation testing is how to automatically detect equivalent mutant programs. Currently, equivalent mutants are detected by hand, which makes it a very expensive and time-consuming process, and restricts the use of mutation testing. This paper presents a technique that uses mathematical constraints to automatically detect equivalent mutants. A tool Equivalencer has been developed to demonstrate this technique, and experimental results from using this tool are presented.
ISSE-TR-94-10
The Dynamic-domain Reduction Approach to Test Data Generation: Design and Algorithms
Jeff Offutt and Zhenyi Jin and Jie Pan
Although there are many techniques and tools available to support the software testing process, one of the most crucial parts of testing, generating the test data, is usually done by hand. Test data generation is one of the most technically challenging steps in testing software, but unfortunately, most practical systems do not incorporate any automation in this step. This paper presents a new method for automatically generating test data that incorporates ideas from symbolic evaluation, constraint-based testing, and dynamic test data generation. It takes an initial set of values for each input, and “pushes” the values through the control-flow graph of the program in a dynamic way, modifying the sets of values as branches in the program are taken. This method is an outgrowth of previous research in constraint-based testing, and combines several of the steps that were previously separate into one coherent process. The major difference is that this method is dynamic, and resolves path constraints immediately; it also includes an intelligent search technique, and improved handling of arrays, loops, and pointers.
ISSE-TR-94-11
Logic Design for Temporals Database with Multiple Temporal Types
X. Sean Wang and Claudio Bettini and Alexander Brodsky and Sushil Jajodia
The purpose of good database logical design is to eliminate data redundancy and insertion and deletion anomalies. In order to achieve this objective for temporal databases, the notions of temporal types and temporal functional dependencies (TFDs) are introduced. A temporal type is a monotonic mapping from ticks of time (represented by positive integers) to time sets (represented by subsets of reals) and is used to capture various standard and user-defined calendars. A TFD is a proper extension of the traditional functional dependency and takes the form X-u->Y, meaning that there is a unique value for Y during one tick of the temporal type u for one particular X value. An axiomatization for TFDs is given. Since a finite set of TFDs usually implies an infinite number of TFDs, we introduce the notion of and give an axiomatization for a finite closure to effectively capture a finite set of implied TFDs that are essential to the logical design. Temporal normalization procedures with respect to TFDs are then given. More specifically, temporal Boyce-Codd normal form (TBCNF) that eliminates all data redundancies, and temporal third normal form (T3NF) that allows dependency preservation, are defined. Both normal forms are proper extensions of their traditional counterparts, BCNF and 3NF. Decomposition algorithms are presented that give lossless TBCNF decompositions and lossless, dependency preserving, T3NF decompositions.
ISSE-TR-94-12
The LyriC Language: Querying Constraint Objects
Alexander Brodsky and Yoram Kornatzky
Presented in this paper is a novel language for querying object-oriented databases where objects may hold spatial, temporal or constraint data, conceptually represented by equality and inequality constraints. The proposed LyriC language is motivated by application realms such as (1) constraint-based design in two-, three-, or higher-dimensional space, (2) large-scale optimization and analysis, based mostly on linear programming techniques, and (3) spatial and geographic databases. LyriC is an extension of flat constraint query languages, and especially those for linear constraint databases, to complex objects. The extension is based on the object-oriented paradigm, where a database is a collection of objects some of whose attributes might be either described explicitly or implicitly by constraints. Constraints are organized in classes with associated methods. The query language is an extension of the language XSQL, suggested by Kifer, Kim and Sagiv, and is built around the idea of extended path expressions. Path expressions in a query traverse nested structures in one sweep. Constraints are used in a query to filter stored constraints and to create new constraint objects.
ISSE-TR-94-13
Semantic Update Optimization in Active Databases
Jong Pil Yoon and Larry Kerschberg
In an active database, an update may be constrained by integrity constraints, and may also trigger rules that, in turn, may affect the database state. The general problem is to effect the update while also managing the “side-effects” of constraint enforcement and rule execution. In this paper an update calculus is proposed by which updates, constraints and rules are specified and managed within the same formalism. Constraints and production rules are expressed in a constraint language based on first-order logic. These logic expressions are used to semantically transform an original update into a sequence of updates that reflect the relevant constraints and production rules. The inference mechanism associated with processing a reformulated query ensures that: 1) the pre- and post-conditions of an update are satisfied, 2) update side-effects are propagated, and 3) repairs are made to tuples exhibiting constraint violations. Thus, a user-specified “update” is transformed, through semantic reformulation techniques, into a sequence of updates which together ensure semantic integrity of the original update as well as its propagated side-effects. This research presents several contributions. Integrity constraints and production rules are expressed in a declarative formalism so that they may be reasoned about. The update calculus formalism handles the semantic reformulation of an update to reflect relevant constraints and rules governing it. Finally, an algorithm is presented to handle constraint enforcement, production rule firing, and subsequent repair actions.
ISSE-TR-94-14
A Restrictive Definition of Concurrency for Discrete-event Modeling
Bo Sanden
Concurrency is an increasingly popular modeling concept in object-oriented analysis, software design, process modeling and other areas. Many approaches view their problem domains in terms of independent, concurrent processes that occasionally need to be synchronized. This is potentially an extremely concurrent model, where “essential” processes that represent independent progress may be lost in the wealth of more trivial processes. This is particularly important in a software implementation, where various overhead costs are incurred by each process.
ISSE-TR-93-00
Functional and Test Specifications for the MiStix File System
Paul Ammann and Jeff Offutt
Mistix is a simple file system based on the Unix file system. Mistix is used in classroom exercises in graduate software engineering classes at George Mason University. In this document, we supply several different specifications for Mistix. First we give an informal specification. Next, since the informal specification turns out to be difficult to formalize directly, we supply a revised informal specification that matches subsequent formal specifications. We give a model-based specification for Mistix in Z, followed by functional test specifications based on the Category-Partition method. Finally, we give an algebraic specification for Mistix.
ISSE-TR-93-01
Empirical Comparisons of Data Flow and Mutation Testing
Jeff Offutt and Kanupriya Tewary
Data flow and mutation testing are two powerful white box testing techniques for unit-level software testing. Unfortunately, they cannot be completely compared on an analytical basis, for example, mutation is incomparable on an inclusion basis with most data flow criteria. This paper shows that mutation includes the All-defs data flow criterion, but is incomparable with other data flow criteria, and presents results from two empirical comparisons of mutation with the All-uses data flow criterion. For the first comparison, we define a coverage measure that is used for the comparison. The coverage of data flow by mutation is between 99% and 100%, which means that test data that satisfied the mutation criterion also satisfied the data flow criterion in almost all cases. The coverage of mutation by data flow was between 80 and 90%. For the second comparison, we use test data that satisfy the two criteria to detect faults, and compare the criteria on the basis of the faults found. The empirical evidence strongly indicates that while data flow-adequate test sets are close to being mutation-adequate, mutation-adequate test sets are almost invariably data flow-adequate.
ISSE-TR-93-02
Planar Lattice Security Structures for Multi-level Replicated Databases
Paul Ammann and Sushil Jajodia
We review three algorithms for concurrency control in secure, multi-level replicated databases. We show by counterexample that the various algorithms, even if suitably modified to correct published problems, still do not guarantee serializability for either general partial orders or general lattices. The counterexamples suggest that the difficulty is a general feature of concurrency control algorithms tailored to the replicated architecture. We make two constructive observations. First, sufficient centralization permits algorithms applicable to general partial orders. Second, the reviewed algorithms are applicable to more restrictive security structures. Specifically, we observe that the intuitively appealing restriction to planar lattices is sufficient.
ISSE-TR-93-03
Recognition and Approximation of NTrees
Ravi Sandhu and Srinivas Ganta
The ntree hierarchy for protection groups was recently introduced in the literature and shown to have an efficient representation by assigning a pair of lr-values to each group so that U is a proper subgroup of V if only if l[U]<l[V] and r[U]<[V]. NTrees are incrementally constructed by refinement of an existing group into another ntree of groups, with forests of trees and inverted trees as the basic ntrees to start this process. It is shown that ntrees have a simpler characterization as the class of partial orders constructed by binary refinements, in which an existing group is refined into exactly two groups. An algorithm for recognizing ntrees is developed on the basis of this characterization. It is also shown how to assign leftmost lr-values which play a central role in ensuring that refinement requires a strictly incremental change in representation of the ntree. Also two algorithms for approximating an arbitrary hierarchy by an ntree are given.
ISSE-TR-93-04
Globally Consistent Event Ordering In One-Directional Distributed Environments
Paul Ammann and Sushil Jajodia and Phyllis Frankl
We consider communication structures for event ordering algorithms in distributed environments where information flows only in one direction. Example applications are multilevel security and hierarchical databases. Although the most general one-directional communication structure is a partial order, partial orders do not enjoy the property of being consistently-ordered, a formalization of the notion that globally consistent event ordering is ensured. Our main result is that the crown-free property is necessary and sufficient for a communication structure to be consistently-ordered. We discuss the computational complexity of detecting crowns and sketch typical applications.
ISSE-TR-93-05
Using Formal Methods To Mechanize Category-Partition Testing
Paul Ammann and Jeff Offutt
We extend the category-partition method, a specification-based method for testing software. Previous work in category-partition has focused on developing structured test specifications that describe software tests. We offer guidance in making the important decisions involved in transforming test specifications to actual test cases. We present a structured approach to making those decisions, including a mechanical procedure for test derivation. With this procedure, we suggest a heuristic for choosing a minimal coverage of the categories identified in the test specifications, suggest parts of the process that can be automated, and offer a solution to the problem of identifying infeasible combinations of test case values. Our method uses formal schema-based functional specifications and is illustrated with an example of a simple file system. We conclude that our approach eases test case creation and leads to effective tests of software. We also note that by basing this procedure on formal specifications, we can identify anomalies in the functional specifications.
ISSE-TR-93-06
A Formal Framework for Integrating Inconsistent Answers from Multiple Information Sources
Ami Motro
In any environment of multiple information sources it is practically unavoidable that information sources would overlap; that is, describe the same portion of the real world. And when information sources overlap, it is practically unavoidable that information would be inconsistent; that is, describe differently the same portion of the real world. In this report we define a formal framework for resolving extensional inconsistencies that are detected in the process of processing queries against a collection of information sources. This framework establishes basic database concepts, such as schemes, instances, views, queries and integrity constraints, as well as new concepts, such as view equivalence, scheme mappings, multi-databases, multi-database queries and inconsistency. Altogether, it provides the formal foundations necessary for addressing issues of information integration and inconsistency resolution. Our approach to the integration of inconsistent answers is based on information goodness, a measure for estimating the proximity of stored information to the ideal information. We propose to maintain for each information source a goodness base: a set of basic views whose goodness will be useful for inferring the goodness of anticipated queries. When a query is answered inconsistently by individual information sources we propose to (1) estimate the goodness of each answer from the appropriate goodness base, and (2) integrate the individual answers in an answer of the highest goodness. We sketch the architecture of a software agent, called Harmony, that will implement our approach.
ISSE-TR-93-07
A Framework for Constraint Management in Object-Oriented Databases
Jong P. Yoon and Larry Kerschberg
We introduce the concept of rule schema in this paper in order to support constraints in the active object-oriented paradigm. The rule schema provides the meta-knowledge associated with constraints analogous to the way a database schema provides metadata about the database objects. It is used to compile the constraints into clauses which are then “attached” to appropriate object attributes. The compiled constraints incorporate the semantics associated with inheritance over generalization hierarchies and references to simple objects which are constituents of complex objects. These clauses are evaluated and enforced whenever an attribute value is retrieved or updated. The update semantics for update propagation of object instances is embedded in the compiled constraints; constraints therefore play a role similar to methods in the model, but are specified explicitly rather than procedurally. They can therefore be managed by the active object-oriented database.
ISSE-TR-93-08
Semantic Query Optimization in Deductive Object-Oriented Databases
Jong P. Yoon and Larry Kerschberg
This paper addresses the problem of semantic query reformulation in the context of object-oriented deductive databases. It extends the declarative object-oriented specifications of F-logic proposed by Kifer and Lausen using the semantic query optimization technique developed by Chakravarthy, Grant, and Minker. In general, query processing in object-oriented databases is expensive when a query incorporates declarative rules, methods and inherited properties. We introduce the technique of semantic query reformulation for F-logic queries which transforms the original query into an equivalent, semantically-rich query that is more efficiently processed. We also discuss the issues of conflict resolution strategies and query evaluation priorities for queries involving the upper bounds of objects in the F-logic “type” lattice.
ISSE-TR-93-09
A Framework for Knowledge Discovery and Evolution in Databases
Jong P. Yoon and Larry Kerschberg
Although knowledge discovery is increasingly important in databases, discovered knowledge is not always useful to users. It is mainly because the discovered knowledge does not fit user's interests, or it may be redundant or inconsistent with a priori knowledge. Knowledge discovery in databases depends critically on how well a database is characterized and how consistently the existing and discovered knowledge is evolved. This paper describes a novel concept for knowledge discovery and evolution in databases. The key issues of this work include: using a database query to discover new rules; using not only positive examples (answer to a query) but also negative examples to discover new rules; harmonizing existing rules with the new rules. The main contribution of this paper is the development of a new tool for (1) characterizing the exceptions in databases and (2) evolving knowledge as a database evolves.
ISSE-TR-93-10
Designing Graphical User Interfaces with Entity-life Modeling
Douglas L. Smith
This paper explores using entity-life modeling (ELM) as an effective method for designing concurrent graphical user interfaces in a message processing environment. In ELM, multiple threads of control in the software are modeled on threads of events in the problem environment, and software objects are modeled on objects in the problem. The goal is to identify a minimum number of threads in the problem environment that drive the system while operating on objects. ELM states that threads are primarily those processes that must be rescheduled at particular points in time and those that vie for resources. The intent of this paper is to show that the ELM approach to problem analysis needs to be modified in order to produce optimal designs in a message passing environment. The paper suggests that some reschedulable processes do not need their own threads of control. Instead, the desired effect can be achieved by timers that insert a message into the message queue of a given process at the appropriate time. The paper also proposes a new category of background entities to account for time-consuming calculations that must be carried out concurrently with ordinary message processing.
ISSE-TR-92-00
Subjects and Objects: Structuring Concepts for Reactive Systems
Bo Sanden
This paper proposes subject-object modeling as a new design approach for reactive software. Subject-object modeling combines the strength of the existing object and process paradigms of computation. Each paradigms stress the close relationship between the structure of the software and the structure of the problem. The object paradigm sees the world in terms of objects sending messages to each other, while the process paradigm sees it in terms of communicating sequential processes. With both paradigms, the structure of the software can be directly patterned on the structure of the problem environment, providing a seamless transition from analysis to design. The problem with the object paradigm is that it does not handle timing well. Some object-oriented authors play down the timing aspect and regard the operations on objects as “shopping lists” to be picked from in any order. But timing is essential in many systems and particularly in reactive software. The process paradigm captures time dependencies, but forces the user to regard even trivial objects in process terms, which is often cumbersome and counter-intuitive. Subject-object modeling enhances the object paradigm with a means to express time relationships beyond the scope of any one object and enhances the process paradigm with object decomposition as a means for information hiding and for the handling of simple objects. The analyst's goal is a compelling description of the problem domain based on processes operating on objects, where the processes capture the timing and ordering of operations on objects. Such a problem description is useful in its own right. It can also be mapped directly onto a software design. In an Ada environment, processes map onto tasks and the objects become packages (or instances of abstract data types.) In this paper, the subject-object approach is applied to a problem with free-floating buoys and an elevator control problem. The solution to the buoy problem is compared to an earlier-published object-based solution.
ISSE-TR-92-01
Artificial Realities, Virtual Communities, and Knowbots
Chris Dede
As the power of information technology grows and its cost diminishes, training is shifting into complex, computer-maintained worlds. The primary reason for this transition is effectiveness; richly detailed virtual environments leverage the most important variables for educational success. These types of training applications can enhance learners' motivation to spend time on task, provide collaborative experiences to foster peer teaching, tailor material to each student's needs and background, and promote the transferability of complicated knowledge and skills into real-world settings. This report describes progress in three aspects of virtual environments that draw on ideas from artificial intelligence: artificial realities, virtual communities, and knowbots.
ISSE-TR-92-02
Using Compiler Optimization Techniques to Detect Equivalent Mutants
A. Jefferson Offutt and W. Michael Craft
Mutation is a software testing technique that requires the tester to generate test data that will find specific, well-defined errors. Mutation testing executes many slightly differing versions, called mutants, of the same program to evaluate the quality of the data used to test the program. Although these mutants are generated and executed efficiently by automated methods, many of the mutants are functionally equivalent to the original program and are not useful for testing. Recognizing and eliminating equivalent mutants has traditionally been done by hand, a time-consuming and arduous task, which limits the practical usefulness of mutation testing. This paper presents extensions to previous work in detecting equivalent mutants; specifically, we present algorithms for determining several classes of equivalent mutants, and results from an implementation of these algorithms. These algorithms are based on data flow analysis and six compiler optimization techniques. We describe each of these techniques and how they are used to detect equivalent mutants. The design of the tool, and some experimental results using it are also presented. Finally, a new approach for detecting equivalent mutants that may be more powerful than the optimization techniques is introduced.