Up: CAPTURING REPRESENTING AND OPERATIONALISING
Previous: Modular Architecture
CMS (CROSI Mapping System) is a structure matching system that capitalizes on the rich semantics of
the OWL constructs found in source ontologies and on its modular
architecture that allows the system to consult external linguistic
resources. It operationalises the modular architecture we developed in this project and employs a multi-strategy system comprising
of four modules, namely:
CROSI (Ontology) Mapping System (CMS)
- Feature Generation
- Feature Selection and Processing
Ontology features used for mapping
In CMS, different features of the input data are generated and
selected to fire off different sorts of feature matchers. Hence, the
first step when deploying CMS was to extract characteristics that
can be used to identify similar entities from different ontologies.
We summarize the characteristics we extracted in table
Table 4.1: Features extracted for ontology mapping.
There are several points that need further explanation. First, in
many cases, identifying corresponding instances is considered to be
an easier task than identifying corresponding classes. This is
because instances are expected to have more grounded variables.
Corresponding instances provide a ground on which the number of
candidate mapping classes can be narrowed down to a few (as we
discovered in our past work with the IF-Map instance-based system
). Second, in case of complement classes, let
cs be a class from the source ontology and ct from the target
ontology, if sim(cs, ct) = a and d = Øc, we can
safely conclude that sim(d, cs) = 1-a, where
sim/2 is the similarity function and a, a real number,
gives the confident value.
The resultant similarity values are then compiled by multiple
similarity aggregators running in parallel or consecutive order. The
overall similarity is then evaluated to initiate iterations that
backtrack to different stages. We include a screenshot of the
Web-based interface of CMS in figure 1.
Figure 1: The Web-based Interface of CMS.
CMS specific techniques
To fit the requirements of different applications, CMS implements a
series of mapping techniques, which are regarded as independent
components that made up CMS.
Ranging from pure syntactical approaches to more semantically
enriched ones, name matchers are categorised as: String (tokenised)
distance, Thesaurus, and WordNet hierarchical distance. Levenstein
distance is the simplest implementation of string distance. More
sophisticated ones are: Monge-Elkan distance which optimizes
edit-distance functions with well-tuned editing cost and the
Jaro metric and its variants which computes an accumulated
similarity of s and t from the order and number of common
characters between s and t. In CMS a thesaurus comes into play
in two forms: WordNet and a predefined corpora that are implemented
as WNNameMatcher and CorpusNameMatcher,
respectively. To facilitate the use of WordNet, we assume that local
names of classes are either nouns or noun phrases while local names
of properties are phrases starting with verbs followed by either
nouns or adjectives. Elements in the retrieved synsets are then
compared against each other using either exact string matching or
one of the string-distance based algorithms discussed in the
previous section. WordNet arranges its entries in hierarchical
structures. Hence, the similarity between names can be computed as
follows: let wi and wj be the corresponding WordNet entries of
namei and namej, w be the least common hypernym of wi and
wj, r be the root of the underlying WordNet hierarchy, and
hi, hj, h be the distances between wi and r, wj and
r, w and r, respectively, the similarity between wi and
wj is approximated as 2×h / hi + hj.
In CMS, a semantic flavour is added in two different ways:
structure-aware and intension-aware matchers. Structure-awareness
refers to the capability of traversing class hierarchies and
accumulate similarities along the sub-class (sub-property)
relationships. Let c and d be two classes from source and target
ontologies, ci and di are their direct parents in respective
ontologies, the similarity between c and d is recursively
defined as sim(c, d) = asimlocal(c, d) +bsim(ci, di), where a and b are
arbitrary weights and simlocal/2 gives the local
similarity with regard to c and d which can be computed using
one or a combination of techniques discussed above.
Intension-awareness takes into account the definitions of classes. A
class c is regarded as a tuple áS, P ñ where S is
a set of classes of which c is a subclass and P is a set of
properties having c as domain and other classes or concrete data
types as range. Hence, finding the semantic similarity between c = áSc, Pcñ and d = áSd, Pdñ amounts
to finding the similarity between Sc and Sd as well as Pc
and Pd, i.e. sim(c, d) = asim(Sc, Sd)+ bsimproperty(Pc, Pd), where a and
b are arbitrary weights and simproperty/2
computes the property similarity. More specifically, we
differentiate the following situations:
The first situation contributes the most to the similarity of c
and d. We regard classes with matching names and exact matching
properties, i.e., properties with same name, domain and range, as
semantically equivalent classes.
In many cases, matching between DPc and DPd
(FPc and FPc, respectively) can only be concluded
after traversing several levels upwards or downwards of the class
hierarchy. Although not as strong as exact matching of property
domains and ranges, matching classes of DPc
(FPc) to remote ancestors or descendants of classes of
DPd (FPd) provides a hint on how close the
different properties are, and thus how similar the two concepts c
and d are. Such an idea is implemented in CMS as a
- classes with matching property names, property domains and property
ranges: Lpc = Lpd and
simset(Dpc, Dpd) ³ v and
simset(Fpc, Fpd) ³ v where
simset/2 computes the similarity of two sets of
entities and v is a predefined threshold.
- Classes with matching property names and property domains but
different property ranges: Lpc = Lpd
and simset(Dpd, Dpd) ³ v,
simset(Fpc, Fpd) < v
- classes with matching property names but different property domains
as well as ranges: Lpc = Lpd and
simset(Dpc, Dpd) < v and
simset(Fpc, Fpd) < v.
The most distinctive feature of CMS is its capability of combining
ontology/database schemata matching systems. Existing matching systems
are wrapped to provide a unique interface with other modules of CMS.
In the current implementation, FOAM alignment framework (FOAM
hereinafter) and INRIA alignment API (INRIA, hereinafter) are
invoked as external sources that matching candidates are drown upon.
The reason of using FOAM and INRIA is twofold: 1) both of the
systems are programmed in Java making the integration with CMS
straightforward; 2) as illustrated in table 4.2, although based
on similar algorithms, FOAM and INRIA produce results that are
disparate enough to make aggregation meaningful. The integration of
other ontology/database schemata matching systems is forthcoming.
Table 4.2: Variant results of different mapping systems.
- Y.Kalfoglou, H.Alani, M.Schorlemmer, and C.Walton. On the Emergent Semantic Web and overlooked issues. In Proceedings of the 3rd International Semantic Web Confernece (ISWC'04), LNCS 3298, Hiroshima, Japan, pages 576591, Nov. 2004.
This material was prepared under the CROSI project. Copyright remains with the authors. Parts or the whole of this text have been published in conferences, workshops and other knowledge disseminating events.
CROSI presents this information online merely for sake of information dissemination.
This material should not be copy-pasted without acknowledging its origins.
Please contact the authors for information on how to use or reference this material.