Next: Modular Architecture
  Up: CAPTURING REPRESENTING AND OPERATIONALISING
  Previous: Survey
Semantic Intensity Spectrum
Semantic integration practitioners belong to, broadly speaking, two
major communities: database (DB) and ontology. DB semantic
integration is mainly associated with schema integration. It has a
long line of research which goes back to more than twenty years when
distributed databases caught the attention of practitioners and, at
roughly the same time, the notion of Federated Database was
introduced [14]. An integration mechanism takes as input
local schemata and fuse them as either an integrated view over
concrete local schemata (referred to as global-as-view) or a
concrete global schema based on which local views are created
(referred to as local-as-view). As research of integration
evolves from an ad hoc effort by individuals to a joined
effort by entire communities, a rather disappointing conclusion
which seemed that was accepted by most researchers was that "a
fully automatic approach to the schema integration problem is not
possible." [14].
On the other end of semantic integration practice, distributed
ontology development efforts became popular. Fueled by the need to
(semantically) integrate the sheer numbers of publicly available
ontologies, ontology practitioners face a similar problem: find
mechanisms for ontology mapping. Artificial intelligence (AI)
researchers argue that ontology mapping is a more complicated
problem compared to DB schema matching due to the semantics and
diversity inherited in various knowledge representation formalisms
[13]. Such a difference is evident in the fact that many
early schema matching systems and methods considered, for example,
only names of attributes. However, as DB programming languages have
been gradually enriched with more semantics, e.g., Enhanced Entity
Relationships (EER), Object-Oriented DBs (OODB), these differences
become blurred.
We are currently witnessing a shift of both communities towards a
converging point [7]. This is evident from the
matching and mapping techniques adopted by both communities, which
have much in common and a tendency to take into account the
underlying semantics.
We observe a common trend for DB and AI semantic integration
practitioners: to progress from semantically-poor to
semantically-rich solutions, so to speak. We therefore, used this
metaphor of semantic richness to classify works from both
communities along a semantic intensity spectrum. We marked
several interim points in the spectrum to address string similarity,
structure, context, extension and intension awareness as different
layers of semantic intensity (see Figure 2.1).

Figure 2.1: Semantic Intensity Spectrum
String similarity, occupying the semantically-poor end of
the spectrum, compares names of elements from different semantic
models. A refinement of such techniques enhances the result by also
taking into account the lengthy textual descriptions (a.k.a.,
comments) associated with concepts and properties. These techniques
are based on the assumption that concepts and properties names
representing semantic similarity will have similar syntactic
features. A string matcher usually first normalises the input string
of names and/or descriptions via stemming and tokenisation. In the
simplest form, the equality of tokens will be obtained and combined
to give a score of the equality for the whole string. In a slightly
more complicated form, similarity of two strings is computed by
evaluating their substrings, edit distance, etc. Nowadays, pure
string similarity measures are seldom used in practice, but rather
in combination with external resources, like user-defined lexica
and/or dictionaries.
Linguistic Similarity, at a position very close to the
semantically-poor end, is an example of string similarity measures
blended with some sense of semantics. For instance, pronunciation
and soundex are taken into account to enhance the similarity purely
based on strings. Also, synonyms and hypernyms will be considered
based on generic and/or domain-specific thesauri, e.g. WordNet,
Dublin Core. In many cases, user-defined name matches are often
treated as useful resources. For lengthy descriptions, Information
Retrieval (IR) techniques can be applied to compare and score
similarities.
As a basic group of matching techniques, linguistics usually are the
initial step to suggest a set of raw mappings that other matchers
can work with. Many systems invoke linguistic matchers at some
stage: PROMPT [12] relies on a linguistic matcher to give
initial suggestions of potential mappings which are then refined and
updated in later stages; CUPID [9] employees
linguistics at the first phase of its matching process when a
thesaurus for short forms, acronyms and synonyms matches individual
schema elements based on their names, data types, domains, etc.
Structure-aware, refers to approaches that take into
account the structural layout of ontologies and schemata. Going
beyond matching names (strings), structural similarity considers the
entire underlying structure. That is, when comparing ontologies
there is a hierarchical, partially ordered lattice where ontology
classes are laid out. Similarly, DB schemata use a lattice of
connections between tables and classes, not necessarily in a
hierarchical fashion though.
In pure structural matching techniques, ontologies and schemata are
transformed into trees with labelled nodes, thus matching is
equivalent to matching vertices of the source graph with those of
the targeted one. Similarity between two such graphs, G1 and
G2 is computed by finding a subgraph of G2 that is
isomorphic to G1 or vice versa. Although nodes of such
graphs are labelled, their linguistic features rarely play a
significant role in computing the similarity. Furthermore, labels of
edges are normally ignored with the assumption that only one type of
relation holds between connected nodes. For instance, suppose we
have two fragments of e-Commerce schemata, one describing an
arbitrary Transaction and the other one a
PurchaseOrder (see Figure 2.2). Graph's
isomorphism then gives us, among other possible mappings:
{PO«PurchaseOrder,POShipTo«Address1,POBillTo«Address2, ¼}.

Figure 2.2: Structure Awareness
Analogous to pure string similarity methods, structure
matching approaches, such as the one presented in [15] are
not common in practice, but they are usually enhanced with other
matching techniques. We deliberately use the notion of
structure similarity in a broad sense in order to accommodate
many relevant methods that relate to each other, and which could,
and sometimes are used in such a combined fashion.
Typically, algorithms that do structure to structure comparison use
the properties found in these structures (transitivity, cardinality,
symmetry, etc) as well as their tree form similarity (for example,
similar branches). Other algorithms use information at the nodes
other than label, for example, attributes such as datatype, range
and domain, etc., [11]. These are used as if they were
labels (strings) with the range of methods discussed above available
for comparing them.
Context-aware, in many cases there are a variety of
relations among concepts or schema elements which makes it necessary
to differentiate distinct types of connections among nodes. This
gives rise to a family of matching techniques which are more
semantically rich than structure similarity ones.
Both DB schema and ontology can be transferred into a labelled
directed graph of which nodes could be elements and concepts, and
edges, could be attributes and properties, respectively, with the
names of attributes and properties as labels. A context,
defined in graph jargon, is an arbitrary node together with nodes
that are connected to it via particular types of edges which at the
same time satisfy certain criteria, e.g., a threshold of the length
of paths.
Sometimes, context-aware approaches group and weigh the edges
from and to a node to impose a view of the domain of discourse from
the end user perspective. Depending on whether importing external
resources is allowed, there are two types of context-awareness.
In the simplest form, algorithms that compare nodes from two
schemata also traverse downwards several layers along the direction
of edges from the node under consideration, or upwards against the
direction of edges to the node under consideration. All the visited
nodes, together with the information about edges connecting them
(taxonomic relationships like part-of,
subclass-of, etc.) are evaluated as a whole to infer
further mappings between nodes in the context. For instance, in
Figure 2.3(a), the issue whether "Norway" in S1
corresponds to "Norway" in S2 is evaluated together with the
information provided by their ancestors along the part-of
relationship path. In this figure, these two nodes do not match, as
"Norway" in S1 refers to a map of this country while "Norway"
in S2 refers to the country itself.

Figure 2.3: Context Awareness
Similarity flooding [10] is an example of a
context-aware approach. An arbitrary schema Sn is first
transformed into a directed labelled graph. The initial mappings
between two schemata, S1 and S2, are obtained using certain
mapping techniques, e.g., a simple string matcher comparing common
prefixes and suffixes of literals, and captured to a Pairwise
Connectivity Graph (PCG). Nodes of a PCG are elements from S1×S2, denoted as NS1 ×S2. An edge
labelled a: (m×k) ® (n×l) (m, n Î S1 and k, l Î S2) of a PCG means that an a edge is
present in the original schemata between m and n as well as k
and l, i.e. a: m ® n and a: k ®l.
From a PCG, a similarity propagation graph is induced which assigns
to each edge in the PCG a propagation coefficient to indicate the
influence between nodes of the PCG. In other words, the weighted
edges indicate how well the similarity of a given PCG node
propagates to its neighbour. The accumulation of similarity is
performed until a pre-set threshold is reached or terminated by the
user after some maximal number of iterations. A series of filter
methods are then adopted to reduce the size of the resultant mapping
candidates and select the most plausible ones.
Following the same philosophy-similarity propagation, Palopoli and
colleagues [] integrates multiple ER schemata by
using the following principle: similarity of schema elements depends
on the similarity of elements in their vicinity (nearby elements
influence match more than those farther away). ER schemata are
first transformed into graphs with entities, relationships, and
attributes as nodes. The similarity coefficient is initialised by
standard thesauruses and re-evaluated based on the similarity of
nodes in their corresponding vicinities.
With the use of namespaces, along comes another type of
context awareness. As illustrated in Figure
2.3(b), "United Kingdom" belongs to both "World
Countries Ontology" and "UK Ontology". Articulating these two
ontologies summons the resolution of different namespaces that might
involve string matchers in certain forms. An example of dealing
with co-reference resolution of such namespaces is given in
[1].
Extension-aware, when a relatively complete set of
instances can be obtained, semantics of a schema or ontology can be
reflected through the way that instances are classified. A major
assumption made by techniques belonging to this family is that
instances with similar semantics might share features
[8], therefore, an understanding of such common
features can contribute to an approximate understanding of the
semantics.
Formal Concept Analysis (FCA) [4] is a
representative of instance-aware approaches. FCA is a field of
mathematics emerged in the nineties that builds upon lattice theory
and the work of Ganter and Wille on the mathematisation of concept
in the eighties. It is mostly suited for analysing instances and
properties of entities (concepts) in a domain of interest. FCA
consists of formal contexts and concept lattices. A formal context
is a triple K=(\mcO,\mcP,\mcS), where is a set of
objects, is a set of attributes (or properties), and
\mcS Í \mcO ×\mcP is a relation that connects
each object o with the attributes satisfied by o.
The intent (set of attributes belonging to an object) and the extent
(set of objects having these attributes) are given formal
definitions in [4]. A formal concept is a pair áA, Bñ consisting of an extent A Í O and an intent
B Í P, and these concepts are hierarchically ordered by
inclusion of their extents. This partial order induces a complete
lattice, the concept lattice of the context. FCA can be applied to
semi-structured domains to assist in modelling with instances and
properties in hierarchical, partially ordered lattices. This is the
main structure most the mapping systems work with. Thus, FCA albeit
not directly related to mapping, it is a versatile technology which
could be used at the early stages of mapping for structuring a
loosely defined domain.
Intension-aware refers to the family of techniques that
establish correlations between relations among extent and intent.
Such approaches are particularly useful when it is impossible or
impractical to obtain a complete set of instances to reflect the
semantics.
Barwise and Seligman [2] propose a mathematical theory,
Information Flow, that aims at establishing the laws that
govern the flow of information. It is a general theory that attempts
to describe information flow in any kind of a distributed system. It
is based on the understanding that information flow results from
regularities in a distributed system, and that it is by virtue of
regularities among the connections that information of some
components of a system carries information of other components. As a
notion of a component carrying information about another component,
Barwise and Seligman follow the analogy of types and tokens where
tokens and its connections carry information. These are classified
against types and the theory of information flow aims to capture
this aspect of information flow which involves both types and
tokens.
When integration is our major concern, the same pattern arises: two
communities with different ontologies (or schemata) will be able to
share information when they are capable of establishing connections
among their tokens in order to infer the relationship among their
types. Kalfoglou and Schorlemmer [6] argued for the
relation of information flow to a distributed system like the
(Semantic) Web, where the regularities of information flowing
between its parts can be captured and used to do mapping. The
mathematical background of information flow theory ensures that the
corresponding types (concepts) respect token (instance) membership
to each of the mapped types. Their approach is community-oriented,
in the sense that communities on the (Semantic) Web own and control
their data (instances) and they use them (i.e., classify them)
against ontologies for the purpose of knowledge sharing and reuse.
It is precisely this information of classifying your own instances
against ontologies that is used as evidence for computing the
mapping relation between communities' heterogeneous ontologies. It
is evident that information flow goes beyond
extension-awareness towards the tick marked by
intension-aware.
Semantic Similarity, very close to the semantically-rich
end lays the family of logic satisfiability approaches which
focus on the logic correspondences. Logic constructors play a
significant role in expressive formalisms, such as DLs, implying
that the discovery of similarity is more like finding logic
consequence. The idea behind techniques in this category is to
reduce the matching problem to one that can be solved by resorting
to logic satisfiability techniques. Concepts in a hierarchical
structure are transformed into well-formed logic formulae (wffs). To
compute the relationships between two set of wffs amounts to examine
whether (y, wffs1, wffs2) is satisfiable.
y is the set of relationships normally containing not only
equivalence but also "more general than" denoted as Ê ,
"less general than" denoted as Í , "disjoint with"
denoted as Å, etc.
The major difference among these approaches is on how the wffs are
computed with respect to each concept (and/or label of concept).
Bouquet and colleagues [3] introduce an algorithm with
the notions of label interpretation and
contextualization, called CtxMatch. Each concept in a
concept hierarchy is associated with a formula based on the WordNet
senses of each word in the label of the concept. The senses
associated with each label are refined according to the information
provided by its ancestors and direct descendants. Matching of two
concepts, C1 and C2, is then transformed into checking the
satisfiability of a formula composed by contextualised senses
associated with their labels and the known WordNet relations among
senses expressed in logic formulae, e.g. art#1 Í WordNet humanities#1 denotes that,
according to WordNet, the first sense of the word "art" is less
general than the first sense of the word "humanities" where
"art" and "humanities" are words from the labels of C1 and
C2 respectively.
S-Match [5] goes one step further by distinguishing
two different notions of concept, namely the concept of label
and the concept of node. Concept of a label is context
insensitive concerning only the WordNet senses of the labels of a
concept. On the other hand, concept of a node is context-sensitive,
its logic formula is computed as the "intersection of the concepts
at labels of all the nodes from the root to the node itself."
[5]. The concept of label matrix is constructed
containing the relations exist between any two concepts of labels in
the two hierarchies of which the matching is to be obtained. Based
on such a matrix the concept of node matrix is calculated.
Bibliography
- [1]
-
H. Alani, S. Dasmahapatra, N. Gibbins, H. Glasser, S. Harris,
Y. Kalfoglou,
K. O'Hara, and N. Shadbolt.
Managing Reference: Ensuring Referential Integrity of Ontologies for
the Semantic Web.
In Proceedings of the 13th International Conference on Knowledge
Engineering and Knowledge Management (EKAW'02), Siguenza, Spain, pages
317-334, Oct. 2002.
- [2]
-
J. Barwise and J. Seligman.
Information Flow: the Logic of Distributed Systems.
Cambridge Tracts in Theoretical Computer Science 44. Cambridge
University Press, 1997.
ISBN: 0-521-58386-1.
- [3]
-
P. Bouquet, B. Magnini, L. Scrafini, and S. Zanobini.
A SAT-based Algorithm for Context Matching.
In Proceedings of the 4th International and Interdisciplinary
Conference on Modeling and Using Context (Context03), 2003.
- [4]
-
B. Ganter and R. Wille.
Formal Concept Analysis: mathematical foundations.
Springer, 1999.
ISBN: 3-540-62771-5.
- [5]
-
F. Guinchiglia, P. Shvaiko, and M. Yatskevich.
S-Match: an Algorithm and an Implementation of Semantic Matching.
In Proceedings of 1st European Semantic Web Symposium (ESWS'04),
Crete, Greece, page 61.
- [6]
-
Y. Kalfoglou and M. Schorlemmer.
IF-MAP: an Ontology Mapping method based on Information Flow theory.
Journal on Data Semantics, 1:98-127, Oct. 2003.
LNCS2800, Springer, ISBN: 3-540-20407-5.
- [7]
-
Y. Kalfoglou, M. Schorlemmer, M. Uschold, A. Sheth, and S. Staab.
Semantic Interoperability and Integration.
Seminar 04391 - executive summary, Schloss Dagstuhl - International
Conference and Research Centre, Sept. 2004.
- [8]
-
J. Madhavan, P. Bernstein, C. Kuang, A. Halevy, and P. Shenoy.
Corpus-based Schema Matching.
In Proceedings of the IJCAI'03 Workshop on Information
Integration on the Web (IIWeb-03), Acapulco, Mexico, Aug. 2003.
- [9]
-
J. Madhavan, P. Bernstein, and E. Rahm.
Generic Schema Matching with CUPID.
In Proceedings of the 27th International Conference on Very
Large Databases (VLDB'01), Roma, Italy, Sept. 2001.
- [10]
-
S. Melnik, H. Garcia-Molina, and E. Rahm.
Similarity Flooding: A Versatile Graph Matching algorithm and its
application to Schema Matching.
In Porceedings of the 18th International Conference on Data
Engineering (ICDE), pages 117-128, 2002.
- [11]
-
T. Milo and S. Zohar.
Using Schema Matching to simplify heterogeneous data translation.
In Proceedings of the 24rd International Conference on Very
Large Data Bases (VLDB'98), New York, NY, USA, USA, pages 122-133, Aug.
1998.
- [12]
-
F. Noy and M. Musen.
PROMPT: Algorithm and Tool for Automated Ontology Merging and
Alignment.
In Proceedings of the 17th National Conference on Artificial
Intelligence, (AAAI'00), Austin, TX, USA, July 2000.
- [13]
-
N. Noy and M. Klein.
Ontology Evolution: Not the same as Schema Evolution.
Knowledge and Information Systems, 6(4):428-440, 2004.
- [14]
-
A. Sheth and J. Larson.
Federated database systems for managing distributed, heterogeneous,
and autonomous databases.
ACM Computing Surveys, 22(3):183-230, Sept. 1990.
- [15]
-
J. Wang, K. Zhang, K. Jeong, and D. Shasha.
A system for approximate tree matching.
IEEE Transactions on Knowledge and Data Engineering,
6(4):559-571, 1994.
This material was prepared under the CROSI project. Copyright remains with the authors. Parts or the whole of this text have been published in conferences, workshops and other knowledge disseminating events.
CROSI presents this information online merely for sake of information dissemination.
This material should not be copy-pasted without acknowledging its origins.
Please contact the authors for information on how to use or reference this material.