Yankova-Doseva, Milena (2010) TERMS: Text Extraction from Redundant and Multiple Sources. PhD thesis, University of Sheffield.
Abstract
In this work we present our approach to the identity resolution problem: discovering references to one and the same object that come from different sources. Solving this problem is important for a number of different communities (e.g. Database, NLP and Semantic Web) that process heterogeneous data where variations of the same objects are referenced in different formats (e.g. textual documents, web pages, database records, ontologies etc.). Identity resolution aims at creating a single view into the data where different facts are interlinked and incompleteness is remedied.
We propose a four-step approach that starts with schema alignment of incoming data sources. As a second step - candidate selection - we discard those entities that are totally different from those that they are compared to. Next the main evidence for identity of two entities comes from applying similarity measures comparing their attribute
values. The last step in the identity resolution process is data fusion or merging entities found to be identical into a single object.
The principal novel contribution of our solution is the use of a rich semantic knowledge representation that allows for flexible and unified interpretation during the resolution process. Thus we are not restricted in the type of information that can be processed (although we have focussed our work on problems relating to information extracted from text). We report the implementation of these four steps in an IDentity Resolution Framework (IDRF) and their application to two use-cases. We propose a rule based approach for customisation in each step and introduce logical operators and their interpretation during the process. Our final evaluation shows that this approach facilitates high accuracy in resolving identity.
Metadata
Supervisors: | Cunningham, Hamish |
---|---|
Keywords: | identity resolution, ontologies, semantics, recod linkage, deduplication, ontology based information extraction |
Awarding institution: | University of Sheffield |
Academic Units: | The University of Sheffield > Faculty of Engineering (Sheffield) > Computer Science (Sheffield) The University of Sheffield > Faculty of Science (Sheffield) > Computer Science (Sheffield) |
Identification Number/EthosID: | uk.bl.ethos.527243 |
Depositing User: | Mrs Milena Yankova-Doseva |
Date Deposited: | 23 Jul 2010 08:46 |
Last Modified: | 27 Apr 2016 14:09 |
Open Archives Initiative ID (OAI ID): | oai:etheses.whiterose.ac.uk:933 |
Download
yankova_final
Filename: yankova_final.pdf
Licence:
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 2.5 License
Export
Statistics
You do not need to contact us to get a copy of this thesis. Please use the 'Download' link(s) above to get a copy.
You can contact us about this thesis. If you need to make a general enquiry, please see the Contact us page.