go to untn.it e prints home
switch to italian version go to untn.it e prints home about browse search register user area help
go to Università di Trento
titles, abstracts, keywords >>>

Large Dataset for Keyphrases Extraction

Krapivin, Mikalai and Autaeu, Aliaksandr and Marchese, Maurizio (2009) Large Dataset for Keyphrases Extraction. Technical Report DISI-09-055, Informatica e Studi Aziendali, University of Trento.

Full text available as:
PDF - Requires Adobe Acrobat Reader or other PDF viewer.

Abstract

We propose a large dataset for machine learning-based automatic keyphrase extraction. The dataset has a high quality and consist of 2,000 of scientific papers from computer science domain published by ACM. Each paper has its keyphrases assigned by the authors and verified by the reviewers. Different parts of papers, such as title and abstract, are separated, enabling extraction based on a part of an article's text. The content of each paper is converted from PDF to plain text. The pieces of formulae, tables, figures and LaTeX mark up were removed automatically. For removal we have used Maximum Entropy Model-based machine learning and achieved 97.04% precision. Preliminary investigation with help of the state of the art keyphrase extraction system KEA shows keyphrases recognition accuracy improvement for refined texts.

Keywords:Keyphrases Extraction, Machine Learning, Large Dataset
Subjects:Q Science: QA Mathematics: QA075 Electronic computers. Computer science
ID Code:1671
Deposited By:Krapivin, Mikalai
Deposited On:18 September 2009

Contact the site administrator at : eprints@biblio.unitn.it