next up previous
Next: Modular Architecture   Up: CAPTURING REPRESENTING AND OPERATIONALISING   Previous: Survey



Semantic Intensity Spectrum


Semantic integration practitioners belong to, broadly speaking, two major communities: database (DB) and ontology. DB semantic integration is mainly associated with schema integration. It has a long line of research which goes back to more than twenty years when distributed databases caught the attention of practitioners and, at roughly the same time, the notion of Federated Database was introduced [14]. An integration mechanism takes as input local schemata and fuse them as either an integrated view over concrete local schemata (referred to as global-as-view) or a concrete global schema based on which local views are created (referred to as local-as-view). As research of integration evolves from an ad hoc effort by individuals to a joined effort by entire communities, a rather disappointing conclusion which seemed that was accepted by most researchers was that "a fully automatic approach to the schema integration problem is not possible." [14].
On the other end of semantic integration practice, distributed ontology development efforts became popular. Fueled by the need to (semantically) integrate the sheer numbers of publicly available ontologies, ontology practitioners face a similar problem: find mechanisms for ontology mapping. Artificial intelligence (AI) researchers argue that ontology mapping is a more complicated problem compared to DB schema matching due to the semantics and diversity inherited in various knowledge representation formalisms [13]. Such a difference is evident in the fact that many early schema matching systems and methods considered, for example, only names of attributes. However, as DB programming languages have been gradually enriched with more semantics, e.g., Enhanced Entity Relationships (EER), Object-Oriented DBs (OODB), these differences become blurred.
We are currently witnessing a shift of both communities towards a converging point [7]. This is evident from the matching and mapping techniques adopted by both communities, which have much in common and a tendency to take into account the underlying semantics.
We observe a common trend for DB and AI semantic integration practitioners: to progress from semantically-poor to semantically-rich solutions, so to speak. We therefore, used this metaphor of semantic richness to classify works from both communities along a semantic intensity spectrum. We marked several interim points in the spectrum to address string similarity, structure, context, extension and intension awareness as different layers of semantic intensity (see Figure 2.1).

Semantic Intensity Spectrum
Figure 2.1: Semantic Intensity Spectrum

String similarity, occupying the semantically-poor end of the spectrum, compares names of elements from different semantic models. A refinement of such techniques enhances the result by also taking into account the lengthy textual descriptions (a.k.a., comments) associated with concepts and properties. These techniques are based on the assumption that concepts and properties names representing semantic similarity will have similar syntactic features. A string matcher usually first normalises the input string of names and/or descriptions via stemming and tokenisation. In the simplest form, the equality of tokens will be obtained and combined to give a score of the equality for the whole string. In a slightly more complicated form, similarity of two strings is computed by evaluating their substrings, edit distance, etc. Nowadays, pure string similarity measures are seldom used in practice, but rather in combination with external resources, like user-defined lexica and/or dictionaries.
Linguistic Similarity, at a position very close to the semantically-poor end, is an example of string similarity measures blended with some sense of semantics. For instance, pronunciation and soundex are taken into account to enhance the similarity purely based on strings. Also, synonyms and hypernyms will be considered based on generic and/or domain-specific thesauri, e.g. WordNet, Dublin Core. In many cases, user-defined name matches are often treated as useful resources. For lengthy descriptions, Information Retrieval (IR) techniques can be applied to compare and score similarities.
As a basic group of matching techniques, linguistics usually are the initial step to suggest a set of raw mappings that other matchers can work with. Many systems invoke linguistic matchers at some stage: PROMPT [12] relies on a linguistic matcher to give initial suggestions of potential mappings which are then refined and updated in later stages; CUPID [9] employees linguistics at the first phase of its matching process when a thesaurus for short forms, acronyms and synonyms matches individual schema elements based on their names, data types, domains, etc.
Structure-aware, refers to approaches that take into account the structural layout of ontologies and schemata. Going beyond matching names (strings), structural similarity considers the entire underlying structure. That is, when comparing ontologies there is a hierarchical, partially ordered lattice where ontology classes are laid out. Similarly, DB schemata use a lattice of connections between tables and classes, not necessarily in a hierarchical fashion though.
In pure structural matching techniques, ontologies and schemata are transformed into trees with labelled nodes, thus matching is equivalent to matching vertices of the source graph with those of the targeted one. Similarity between two such graphs, G1 and G2 is computed by finding a subgraph of G2 that is isomorphic to G1 or vice versa. Although nodes of such graphs are labelled, their linguistic features rarely play a significant role in computing the similarity. Furthermore, labels of edges are normally ignored with the assumption that only one type of relation holds between connected nodes. For instance, suppose we have two fragments of e-Commerce schemata, one describing an arbitrary Transaction and the other one a PurchaseOrder (see Figure 2.2). Graph's isomorphism then gives us, among other possible mappings: {PO«PurchaseOrder,POShipTo«Address1,POBillTo«Address2, ¼}.
Structure Awareness
Figure 2.2: Structure Awareness

Analogous to pure string similarity methods, structure matching approaches, such as the one presented in [15] are not common in practice, but they are usually enhanced with other matching techniques. We deliberately use the notion of structure similarity in a broad sense in order to accommodate many relevant methods that relate to each other, and which could, and sometimes are used in such a combined fashion.
Typically, algorithms that do structure to structure comparison use the properties found in these structures (transitivity, cardinality, symmetry, etc) as well as their tree form similarity (for example, similar branches). Other algorithms use information at the nodes other than label, for example, attributes such as datatype, range and domain, etc., [11]. These are used as if they were labels (strings) with the range of methods discussed above available for comparing them.
Context-aware, in many cases there are a variety of relations among concepts or schema elements which makes it necessary to differentiate distinct types of connections among nodes. This gives rise to a family of matching techniques which are more semantically rich than structure similarity ones.
Both DB schema and ontology can be transferred into a labelled directed graph of which nodes could be elements and concepts, and edges, could be attributes and properties, respectively, with the names of attributes and properties as labels. A context, defined in graph jargon, is an arbitrary node together with nodes that are connected to it via particular types of edges which at the same time satisfy certain criteria, e.g., a threshold of the length of paths.
Sometimes, context-aware approaches group and weigh the edges from and to a node to impose a view of the domain of discourse from the end user perspective. Depending on whether importing external resources is allowed, there are two types of context-awareness.
In the simplest form, algorithms that compare nodes from two schemata also traverse downwards several layers along the direction of edges from the node under consideration, or upwards against the direction of edges to the node under consideration. All the visited nodes, together with the information about edges connecting them (taxonomic relationships like part-of, subclass-of, etc.) are evaluated as a whole to infer further mappings between nodes in the context. For instance, in Figure 2.3(a), the issue whether "Norway" in S1 corresponds to "Norway" in S2 is evaluated together with the information provided by their ancestors along the part-of relationship path. In this figure, these two nodes do not match, as "Norway" in S1 refers to a map of this country while "Norway" in S2 refers to the country itself.

Context Awareness
Figure 2.3: Context Awareness

Similarity flooding [10] is an example of a context-aware approach. An arbitrary schema Sn is first transformed into a directed labelled graph. The initial mappings between two schemata, S1 and S2, are obtained using certain mapping techniques, e.g., a simple string matcher comparing common prefixes and suffixes of literals, and captured to a Pairwise Connectivity Graph (PCG). Nodes of a PCG are elements from S1×S2, denoted as NS1 ×S2. An edge labelled a: (m×k) ® (n×l) (m, n Î S1 and k, l Î S2) of a PCG means that an a edge is present in the original schemata between m and n as well as k and l, i.e. a: m ® n and a: k ®l.
From a PCG, a similarity propagation graph is induced which assigns to each edge in the PCG a propagation coefficient to indicate the influence between nodes of the PCG. In other words, the weighted edges indicate how well the similarity of a given PCG node propagates to its neighbour. The accumulation of similarity is performed until a pre-set threshold is reached or terminated by the user after some maximal number of iterations. A series of filter methods are then adopted to reduce the size of the resultant mapping candidates and select the most plausible ones.
Following the same philosophy-similarity propagation, Palopoli and colleagues [] integrates multiple ER schemata by using the following principle: similarity of schema elements depends on the similarity of elements in their vicinity (nearby elements influence match more than those farther away). ER schemata are first transformed into graphs with entities, relationships, and attributes as nodes. The similarity coefficient is initialised by standard thesauruses and re-evaluated based on the similarity of nodes in their corresponding vicinities.
With the use of namespaces, along comes another type of context awareness. As illustrated in Figure 2.3(b), "United Kingdom" belongs to both "World Countries Ontology" and "UK Ontology". Articulating these two ontologies summons the resolution of different namespaces that might involve string matchers in certain forms. An example of dealing with co-reference resolution of such namespaces is given in [1].
Extension-aware, when a relatively complete set of instances can be obtained, semantics of a schema or ontology can be reflected through the way that instances are classified. A major assumption made by techniques belonging to this family is that instances with similar semantics might share features [8], therefore, an understanding of such common features can contribute to an approximate understanding of the semantics.
Formal Concept Analysis (FCA) [4] is a representative of instance-aware approaches. FCA is a field of mathematics emerged in the nineties that builds upon lattice theory and the work of Ganter and Wille on the mathematisation of concept in the eighties. It is mostly suited for analysing instances and properties of entities (concepts) in a domain of interest. FCA consists of formal contexts and concept lattices. A formal context is a triple K=(\mcO,\mcP,\mcS), where is a set of objects, is a set of attributes (or properties), and \mcS Í \mcO ×\mcP is a relation that connects each object o with the attributes satisfied by o.
The intent (set of attributes belonging to an object) and the extent (set of objects having these attributes) are given formal definitions in [4]. A formal concept is a pair áA, Bñ consisting of an extent A Í O and an intent B Í P, and these concepts are hierarchically ordered by inclusion of their extents. This partial order induces a complete lattice, the concept lattice of the context. FCA can be applied to semi-structured domains to assist in modelling with instances and properties in hierarchical, partially ordered lattices. This is the main structure most the mapping systems work with. Thus, FCA albeit not directly related to mapping, it is a versatile technology which could be used at the early stages of mapping for structuring a loosely defined domain.
Intension-aware refers to the family of techniques that establish correlations between relations among extent and intent. Such approaches are particularly useful when it is impossible or impractical to obtain a complete set of instances to reflect the semantics.
Barwise and Seligman [2] propose a mathematical theory, Information Flow, that aims at establishing the laws that govern the flow of information. It is a general theory that attempts to describe information flow in any kind of a distributed system. It is based on the understanding that information flow results from regularities in a distributed system, and that it is by virtue of regularities among the connections that information of some components of a system carries information of other components. As a notion of a component carrying information about another component, Barwise and Seligman follow the analogy of types and tokens where tokens and its connections carry information. These are classified against types and the theory of information flow aims to capture this aspect of information flow which involves both types and tokens.
When integration is our major concern, the same pattern arises: two communities with different ontologies (or schemata) will be able to share information when they are capable of establishing connections among their tokens in order to infer the relationship among their types. Kalfoglou and Schorlemmer [6] argued for the relation of information flow to a distributed system like the (Semantic) Web, where the regularities of information flowing between its parts can be captured and used to do mapping. The mathematical background of information flow theory ensures that the corresponding types (concepts) respect token (instance) membership to each of the mapped types. Their approach is community-oriented, in the sense that communities on the (Semantic) Web own and control their data (instances) and they use them (i.e., classify them) against ontologies for the purpose of knowledge sharing and reuse. It is precisely this information of classifying your own instances against ontologies that is used as evidence for computing the mapping relation between communities' heterogeneous ontologies. It is evident that information flow goes beyond extension-awareness towards the tick marked by intension-aware.
Semantic Similarity, very close to the semantically-rich end lays the family of logic satisfiability approaches which focus on the logic correspondences. Logic constructors play a significant role in expressive formalisms, such as DLs, implying that the discovery of similarity is more like finding logic consequence. The idea behind techniques in this category is to reduce the matching problem to one that can be solved by resorting to logic satisfiability techniques. Concepts in a hierarchical structure are transformed into well-formed logic formulae (wffs). To compute the relationships between two set of wffs amounts to examine whether (y, wffs1, wffs2) is satisfiable. y is the set of relationships normally containing not only equivalence but also "more general than" denoted as Ê , "less general than" denoted as Í , "disjoint with" denoted as Å, etc.
The major difference among these approaches is on how the wffs are computed with respect to each concept (and/or label of concept). Bouquet and colleagues [3] introduce an algorithm with the notions of label interpretation and contextualization, called CtxMatch. Each concept in a concept hierarchy is associated with a formula based on the WordNet senses of each word in the label of the concept. The senses associated with each label are refined according to the information provided by its ancestors and direct descendants. Matching of two concepts, C1 and C2, is then transformed into checking the satisfiability of a formula composed by contextualised senses associated with their labels and the known WordNet relations among senses expressed in logic formulae, e.g. art#1 Í WordNet humanities#1 denotes that, according to WordNet, the first sense of the word "art" is less general than the first sense of the word "humanities" where "art" and "humanities" are words from the labels of C1 and C2 respectively.
S-Match [5] goes one step further by distinguishing two different notions of concept, namely the concept of label and the concept of node. Concept of a label is context insensitive concerning only the WordNet senses of the labels of a concept. On the other hand, concept of a node is context-sensitive, its logic formula is computed as the "intersection of the concepts at labels of all the nodes from the root to the node itself." [5]. The concept of label matrix is constructed containing the relations exist between any two concepts of labels in the two hierarchies of which the matching is to be obtained. Based on such a matrix the concept of node matrix is calculated.

Bibliography

[1]
H. Alani, S. Dasmahapatra, N. Gibbins, H. Glasser, S. Harris, Y. Kalfoglou, K. O'Hara, and N. Shadbolt. Managing Reference: Ensuring Referential Integrity of Ontologies for the Semantic Web. In Proceedings of the 13th International Conference on Knowledge Engineering and Knowledge Management (EKAW'02), Siguenza, Spain, pages 317-334, Oct. 2002.
[2]
J. Barwise and J. Seligman. Information Flow: the Logic of Distributed Systems. Cambridge Tracts in Theoretical Computer Science 44. Cambridge University Press, 1997. ISBN: 0-521-58386-1.
[3]
P. Bouquet, B. Magnini, L. Scrafini, and S. Zanobini. A SAT-based Algorithm for Context Matching. In Proceedings of the 4th International and Interdisciplinary Conference on Modeling and Using Context (Context03), 2003.
[4]
B. Ganter and R. Wille. Formal Concept Analysis: mathematical foundations. Springer, 1999. ISBN: 3-540-62771-5.
[5]
F. Guinchiglia, P. Shvaiko, and M. Yatskevich. S-Match: an Algorithm and an Implementation of Semantic Matching. In Proceedings of 1st European Semantic Web Symposium (ESWS'04), Crete, Greece, page 61.
[6]
Y. Kalfoglou and M. Schorlemmer. IF-MAP: an Ontology Mapping method based on Information Flow theory. Journal on Data Semantics, 1:98-127, Oct. 2003. LNCS2800, Springer, ISBN: 3-540-20407-5.
[7]
Y. Kalfoglou, M. Schorlemmer, M. Uschold, A. Sheth, and S. Staab. Semantic Interoperability and Integration. Seminar 04391 - executive summary, Schloss Dagstuhl - International Conference and Research Centre, Sept. 2004.
[8]
J. Madhavan, P. Bernstein, C. Kuang, A. Halevy, and P. Shenoy. Corpus-based Schema Matching. In Proceedings of the IJCAI'03 Workshop on Information Integration on the Web (IIWeb-03), Acapulco, Mexico, Aug. 2003.
[9]
J. Madhavan, P. Bernstein, and E. Rahm. Generic Schema Matching with CUPID. In Proceedings of the 27th International Conference on Very Large Databases (VLDB'01), Roma, Italy, Sept. 2001.
[10]
S. Melnik, H. Garcia-Molina, and E. Rahm. Similarity Flooding: A Versatile Graph Matching algorithm and its application to Schema Matching. In Porceedings of the 18th International Conference on Data Engineering (ICDE), pages 117-128, 2002.
[11]
T. Milo and S. Zohar. Using Schema Matching to simplify heterogeneous data translation. In Proceedings of the 24rd International Conference on Very Large Data Bases (VLDB'98), New York, NY, USA, USA, pages 122-133, Aug. 1998.
[12]
F. Noy and M. Musen. PROMPT: Algorithm and Tool for Automated Ontology Merging and Alignment. In Proceedings of the 17th National Conference on Artificial Intelligence, (AAAI'00), Austin, TX, USA, July 2000.
[13]
N. Noy and M. Klein. Ontology Evolution: Not the same as Schema Evolution. Knowledge and Information Systems, 6(4):428-440, 2004.
[14]
A. Sheth and J. Larson. Federated database systems for managing distributed, heterogeneous, and autonomous databases. ACM Computing Surveys, 22(3):183-230, Sept. 1990.
[15]
J. Wang, K. Zhang, K. Jeong, and D. Shasha. A system for approximate tree matching. IEEE Transactions on Knowledge and Data Engineering, 6(4):559-571, 1994.



This material was prepared under the CROSI project. Copyright remains with the authors. Parts or the whole of this text have been published in conferences, workshops and other knowledge disseminating events.
CROSI presents this information online merely for sake of information dissemination.
This material should not be copy-pasted without acknowledging its origins.
Please contact the authors for information on how to use or reference this material.