AKTors.org
AKTors.org AKTTechnologiesPublicationsRelated ProjectsPeople
AKTors.org AcquisitionModellingRetrievalReusePublishingMaintenance
SimMetrics from The University of Sheffield

 

SimMetrics:SimMetrics is an open source extensible library of Similarity or Distance Metrics, e.g. Levenshtein Distance , L2 Distance , Cosine Similarity , Jaccard Similarity etc etc . SimMetrics provides a library of float based similarity measures between String Data as well as the typical unnormalised metric output.

It is intended for researchers in information integration, II, and other related fields. It includes a range of similarity measures from a variety of communities, including statistics, DNA analysis, artificial intelligence, information retrieval, and databases.

SimMetrics fact-file

Owner

 : 

The University of Sheffield

Researchers
(listed alphabetically)

 : 

Mr Sam Chapman [ Browse , RDF ], Dr Fabio Ciravegna [ Browse , RDF ]

Description

 : 

http://www.dcs.shef.ac.uk/~sam/simmetrics.html

Used by

 : 

Armadillo

Addresses challenges

 : 

Information Integration

What's the Problem?

Information in real world is:

  • dynamic
  • uncertain
  • ambiguos
  • heterogeneous
  • distributed
  • imprecise

Information fusion techniques help distinguishing provenance and reliability of source thus enabling data validation (or invalidation) via different methods:

  • Statistical
  • Fuzzy
  • Logic consistency checking

A information fusion process integrates statements detecting overlappings, conflicts, inconsistencies and helps handling detected problems.

Towards a Solution

Similarity metrics are a statistical approach to Information Integration based on metrics that compare two strings and return a similarity values.

Different algorithms are available

  • Phonetic similarity
  • Linguistic similarity
  • Character-based similarity
  • Term-based similarity
  • Record based similarity
  • Graph based similarity

The SimMetric library has been developed to provide a consitant interface layer to similarity measures that act in a normailised manner allowing comparison and composition of metrics, whilst still allowing usage of the basic algorithms original output.

All metrics can work on a simple basis whereby they take two strings and return a similarity measure from 0.0 to 1.0, 0.0 being entirely different, 1.0 being identical.

The metrics developed have been optermised for fast processing time and include methods that provide timing estimates.

Any metric with cost functions facilitates the addition or modification of the cost function allowing custom metrics to be developed, (for more details on cost functions they are detailed in the descriptions of various string metrics ).

This standardised interface based approach allows a combination of techniques rather than inconsistent strategies that do not 'map'.

Usage

SimMetrics library has been used for a number of different applications:

  • Fraud Detection 
  • Plagurism Detection 
  • Ontology Merging 
  • DNA analysis (student projects) 
  • RNA analysis 
  • Image analysis (used for fast feature comparison) 
  • Evidence based machine learning (merging multiple captures) 
  • MS Excel plugin for cell similarity 
  • Database deduplication 
  • Data Mining 
  • Web Interfaces e.g. Ajax style suggestions as you type 
  • Data Integration 
  • Semantic Knowledge Integration 

Library details

  • License : GNU General Public License (GPL)
  • Operating System : All 32-bit MS Windows (95/98/NT/2000/XP) , OS Independent (Written in an interpreted language)
  • Programming Language : C# , Java , Visual Basic .NET

More details are available at http://sourceforge.net/projects/simmetrics/

Downloads

Current downloads 8,600+ (15-30 every day)

Project links

The project is developed initially within AKT and Dot.KOM but has continuing support under the IPAS and X-Media project where it is utilised in semantic information integration.