Call for Papers

July 23, 2010 Geneva, Switzerland

SUBMISSIONS DUE June 10, 2010

Details

We solicit submissions for the Workshop on Feature Generation and Selection for Information Retrieval, to be held on July 23, 2010, in Geneva, Switzerland, in conjunction with the 33rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2010). The workshop will bring together researchers and practitioners from academia and industry to discuss the latest developments in various aspects of feature generation and selection for textual information retrieval. Modern information retrieval systems facilitate information access at unprecedented scale and level of sophistication. However, in many cases the underlying representation of text remains quite simple, often limited to using a weighted bag of words. Over the years, several approaches to automatic feature generation have been proposed (such as Latent Semantic Indexing, Explicit Semantic Analysis, Hashing, and Latent Dirichlet Allocation), yet their application in large scale systems still remains the exception rather than the rule. On the other hand, numerous studies in NLP and IR resort to manually crafting features, which is a laborious and expensive process. Such studies often focus on one specific problem, and consequently many features they define are task- or domain-dependent. Consequently, little knowledge transfer is possible to other problem domains. This limits our understanding of how to reliably construct informative features for new tasks. An area of machine learning concerned with feature generation (or constructive induction) studies methods that endow computers with the ability to modify or enhance the representation language. Feature generation techniques search for new features that describe the target concepts better than the attributes supplied with the training instances. It is worthwhile to note that traditional machine learning data sets, such as those available from the UCI data repository, are only available as feature vectors, while their feature set is essentially fixed. In fact, feature generation for specific UCI benchmark datasets is scorned upon. On the other hand, textual data is almost always available in its raw format (in some case as structured data with sufficient side information). Given the importance of text as a data format, it is well worthwhile designing text-specific feature generation algorithms. Complementary to feature generation, the issue of feature selection arises. It aims to retain only the most informative features, e.g., in order to reduce noise and to avoid overfitting, and is essential when numerous features are automatically constructed. This allows us to deal with features that are correlated, redundant, or uninformative, and hence we may want to decimate them through a principled selection process. We believe that much can be done in the quest for automatic feature generation for text processing, for example, using large-scale knowledge bases as well as the sheer amounts of textual data easily accessible today. We further believe the time is ripe to bring together researchers from many related areas (including information retrieval, machine learning, statistics, and natural language processing) to address these issues and seek cross-pollination among the different fields. Papers from a rich set of empirical, experimental, and theoretical perspectives are invited. Topics of interest for the workshop include but are not limited to:

Identifying cases when new features should be constructed
Knowledge-based methods (including identification of appropriate knowledge resources)
Efficiently utilizing human expertise (akin to active learning, assisted feature construction)
(Bayesian) nonparametric distribution models for text (e.g. LDA, hierarchical Pitman-Yor model)
Compression and autoencoder algorithms (e.g., information bottleneck, deep belief networks)
Feature selection (L1 programming, message passing, dependency measures, submodularity)
Cross-language methods for feature generation and selection
New types of features, e.g., spatial features to support geographical IR
Applications of feature generation in IR (e.g., constructing new features for indexing, ranking)

The workshop will include invited talks as well as presentations of accepted research contributions. The schedule will provide time for both organized and open discussion. Registration will be open to all SIGIR 2010 attendees. ## Submission Instructions Submissions should report new (unpublished) research results or ongoing research. Submissions can be up to 8 pages long for full papers, and up to 4 pages long for short papers. Papers should be formatted in double-column ACM SIG proceedings format, for LaTeX use “Option 2”. Papers must be in English and must be submitted as PDF files. Papers should be submitted electronically using EasyChair no later than 23:59 Pacific Standard time, Thursday, June 30, 2010. At least one author of each accepted paper will be expected to attend and present their findings at the workshop. ## Important Dates Submission Deadline: June 10, 2010 Acceptance notification: June 28, 2010 Camera-ready submission: July 5, 2010 Workshop date: July 23, 2010 ## Keynote speakers

Dr. Kenneth Church, Chief Scientist of the Human Language Technology Center of Excellence at the Johns Hopkins University
Dr. Yee Whye Teh, Lecturer at the Gatsby Computational Neuroscience Unit, University College London

We are grateful to the Pascal 2 Network of Excellence for travel support for our keynote speakers. ## Organizing Committee

Evgeniy Gabrilovich, Yahoo! Research, USA
Alex Smola, Australian National University and Yahoo! Research, USA
Naftali Tishby, Hebrew University of Jerusalem, Israel

Program Committee

Francis Bach, INRIA, France
Misha Bilenko, Microsoft Research, USA
David Blei, Princeton, USA
Karsten Borgwardt, Max Planck Institute, Germany
Wray Buntine, NICTA, Australia
Raman Chandrasekar, Microsoft Research, USA
Kevyn Collins-Thompson, Microsoft Research, USA
Silviu Cucerzan, Microsoft Research, USA
Brian Davison, Lehigh University, USA
Gideon Dror, Academic College of Tel-Aviv-Yaffo, Israel
Arkady Epshteyn, Google, USA
Wai Lam, CUHK, Hong Kong SAR, China
Tie-Yan Liu, Microsoft Research Asia, China
Shaul Markovitch, Technion, Israel
Donald Metzler, USC/ISI, USA
Daichi Mochihashi, NTT, Japan
Patrick Pantel, Yahoo, USA
Filip Radlinski, Microsoft Research, United Kingdom
Rajat Raina, Facebook, USA
Pradeep Ravikumar, University of Texas at Austin, USA
Mehran Sahami, Stanford, USA
Le Song, CMU, USA
Krysta Svore, Microsoft Research, USA
Volker Tresp, Siemens, Germany
Eric Xing, CMU, USA
Kai Yu, NEC, USA
ChengXiang Zhai, UIUC, USA
Jerry Zhu, University of Wisconsin, USA