|SimMetrics from The University of Sheffield|
SimMetrics:SimMetrics is an open source extensible library of Similarity or Distance Metrics, e.g. Levenshtein Distance , L2 Distance , Cosine Similarity , Jaccard Similarity etc etc . SimMetrics provides a library of float based similarity measures between String Data as well as the typical unnormalised metric output.
It is intended for researchers in information integration, II, and other related fields. It includes a range of similarity measures from a variety of communities, including statistics, DNA analysis, artificial intelligence, information retrieval, and databases.
What's the Problem?
Information in real world is:
Information fusion techniques help distinguishing provenance and reliability of source thus enabling data validation (or invalidation) via different methods:
A information fusion process integrates statements detecting overlappings, conflicts, inconsistencies and helps handling detected problems.
Towards a Solution
Similarity metrics are a statistical approach to Information Integration based on metrics that compare two strings and return a similarity values.
Different algorithms are available
The SimMetric library has been developed to provide a consitant interface layer to similarity measures that act in a normailised manner allowing comparison and composition of metrics, whilst still allowing usage of the basic algorithms original output.
All metrics can work on a simple basis whereby they take two strings and return a similarity measure from 0.0 to 1.0, 0.0 being entirely different, 1.0 being identical.
The metrics developed have been optermised for fast processing time and include methods that provide timing estimates.
Any metric with cost functions facilitates the addition or modification of the cost function allowing custom metrics to be developed, (for more details on cost functions they are detailed in the descriptions of various string metrics ).
This standardised interface based approach allows a combination of techniques rather than inconsistent strategies that do not 'map'.
SimMetrics library has been used for a number of different applications:
More details are available at http://sourceforge.net/projects/simmetrics/
Current downloads 8,600+ (15-30 every day)
The project is developed initially within AKT and Dot.KOM but has continuing support under the IPAS and X-Media project where it is utilised in semantic information integration.