Mining Complex Entities from Heterogeneous Information Networks

Abstract

Most research on information mining has focused on classic Information Extraction (IE) tasks, from structured and unstructured documents, like newspaper articles and web pages. In the last years however the staggering growth of social media as platform for sharing content has moved the focus towards a different type of extraction target. Social media pose a number of challenge to information extraction: contributions to social media sites like blogs, forums, Twitter, etc. are conversational in nature and thus tend to be brief and informal, containing imprecise, subjective and ambiguous information. The expanded context (who the author is, the social and geographical context, their social links, etc.) becomes relevant to disambiguate and interlink information.

Aim of this tutorial is to introduce and discuss issues, methodologies and technologies for extracting information from documents, with a particular focus on mining heterogeneous information networks (e.g. social websites) in order to mine complex entities.

The tutorial will last 3.5 hours (including breaks) and will cover:

Introduction to information extraction from documents in general (20 minutes) and from information networks in particular (10 minutes)
Introduction to machine learning based methods for information extraction (75 minutes)
1. representing documents and feature sets
2. entity and terminology recognition
3. learning gazetteers
4. event and relation extraction
5. extraction from multimedia documents
Annotation for training (15 minutes)
1. feature selection
2. annotation and error
3. porting across domains
Information Extraction from information networks (45 minutes)
1. using the Twitter and Facebook APIs
2. entity recognition and resolution
3. term association
4. entity disambiguation over large scale
Conclusion and future work (15 minutes)

The focus is on Machine Learning based methods. We will cover - among others - methods using Rule Induction, SVM, CRF, HMM, Transfer Learning, Active Learning. We will Also discuss real world cases from the field of information and knowledge management.

Presenters Info

Prof. Fabio Ciravegna is Director of Research and Innovation in the Digital World at the University of Sheffield. He is also Professor of Language and Knowledge Technologies and Head of OAK Group in the Department of Computer Science. His research field concerns Human Language and Semantic Web Technologies, with focus on their use for Knowledge Management. He is currently principal investigator in 5 projects (3 European Integrated Projects and two industrial projects). In 2006-2010 he was director of the European Integrated Project IST X-Media (€13.6m budget) on knowledge management across media and in 2002-2005 he was director of the EU project Dot.Kom on designing information extraction for knowledge management. He has published more than 100 papers, is part of the editorial board of the International Journal on “Web Semantics” and was on that of the International Journal of Human Computer Studies until 2009. In 2009 he was General Chair of the European Semantic Web Conference and Sponsorship chair of the International Semantic Web Conference.

Fabio has given 17 invited speeches at international conferences, workshops and events including the 3rd Asian Semantic Web Conference in Bangkok (2009), the 11th International Conference on Business Information Systems in Innsbruck (2009), the Twelfth International Conference on Artificial Intelligence in Varna (BG, 2006) and the Search Engine Meeting 2004 in Den Haag. He has given 16 tutorials at international conferences, events and summer schools (several of them invited).

Among them:

Invited Tutorial on Introduction to the Semantic Web at the 9th International Semantic Web Conference, Shanghai, China. Dates: 7-11 November 2010.
Invited tutor at the Summer School on Multimedia Semantics ’09, Koblenz, 23-28 August 2009. Tutorial on Information Extraction from texts.
Invited Tutorial on Introduction to the Semantic Web at the 7th International Semantic Web Conference, Karlsruhe, Germany. Dates: 26-30 October 2008.
Invited Tutorial on Semantic Web Technologies in Large Distributed Enterprises at the Norwegian Semantic Days 2008, Stavanger, Norway, 23-24 April 2008.
Tutorial on Semantic Web Technologies for Knowledge Management in Large Distributed Organisations at the 6th International Semantic Web Conference and the 2nd Asian Semantic Web Conference, Busan, Korea Dates: November 11-15, 2007
Invited Tutorial on Introduction to the Semantic Web at the 6th International Semantic Web Conference and the 2nd Asian Semantic Web Conference, Busan, Korea, 11-15 November 2007.
Tutorial on Semantic Web Technologies for Knowledge Management in Large Distributed Organisations at the 4th European Semantic Web Conference, 3-7 June 2007, Innsbruck
7 times invited tutor at the European Summer School on Ontological Engineering and the Semantic Web, July 2003, 2004, 2005, 2006, 2007, 2008, 2009, Cercedilla (Spain).
Adaptive Text Extraction and Mining, joint tutorial and workshop with Nicholas Kushmerick at the 14th European Conference on Machine Learning (ECML), Cavtat, Croatia, Sept 2003.
Adaptive Text Extraction and Mining, tutorial with Nicholas Kushmerick at 15th European Conference on Artificial Intelligence,(ECAI 2002), Lyon, France, July 2002.

He holds one patent on Hybrid Search for Knowledge Management. Another patent on terminology recognition in the aerospace domain is pending. He holds a PhD from the University of East Anglia and a Doctorship from the University of Torino, Italy.

Ziqi Zhang is a research associate who has been working in the field of Information Extraction since 2006. As part of his research he is pursuing a Ph.D. specialising in entity recognition that harnesses background knowledge resources. He has been involved in a number of projects related to information extraction, and has presented his work at major computational linguistics oriented conferences (EKAW, RANLP, LREC etc.). He gave a one day tutorial on “Knowledge Acquisition from Social Media Platforms” during the 2010 EKAW conference in Lisbon, and also a tutorial on terminology extraction during the Ontogenesis Network meeting in 2008.

Point of Contact: Fabio Ciravegna, fabio@dcs.shef.ac.uk