last updated: 2012-09-01
The challenge is now close.
Data have been delivered publicly to the UCI Machine Learning Repository.
Nomao is a search engine of places. It collects data coming from multiple sources on the web and aggregates them.
For a given query, the elements which describe a place could come from different sources. So, in order to present an understandable and unified answer, Nomao has (i) to detect that the different data refer to the same place, and (ii) to aggregate their different results. The purpose of the challenge is to improve the deduplication process of the places concerned and not identified up to now.
For instance, in the following example of a query such as "La poste Lille", the method should be able to detect that spots 2 and 3 refer to the same place, but not spot 1.
|1||La poste||13 Rue De La Clef 59000 Lille France||3631|
|2||La poste||13 Rue Nationale 59000 Lille France||3631|
|3||La poste lille||13 r. nationale 59000 lille||0320313131|
To automate that deduplication process, using Machine Learning is well suited. And to optimize the creation of the training dataset, using Active Learning is appropriate. However, in such a real-world case, labeling data is costly but large amounts of unlabeled data are available. So this raises specific issues: the main ones are scalability of the proposed method, representativity of the training dataset, and practicability of the labeling process.
Aims of the challenge
To achieve the deduplication process, a classifier could be (i) trained using an initial dataset (ID) and then (ii) used to designate (to discover) in another unlabeled dataset (UD) the data which need to be considered first by the Nomao expert in order to refine itself (to improve its results) as much as possible.
A brief tutorial on active learning can be found here: http://www.causality.inf.ethz.ch/activelearning.php?page=tutorial#cont
Given a trained classifier, there are different ways to choose which instances have to be considered at first in UD:
- Exploitation: the N instances having the highest uncertainty about their label
- Exploration: choosing N instances randomly
- A compromise between exploration and exploitation
This challenge has some specific aspect of real-world problem, realistic data:
- Training from a small part of the entire distribution:
The distribution of the places pairs can not be known in advance. Therefore the sampling which has been used to elaborate the initial dataset is not uniform in the sense that it could perhaps not cover the entire domain space. This is illustrated on Figure 1: the entire distribution could be the blue circle and the initial dataset the red circle. Besides, this dataset carries human prior since instances have been selected by Nomao experts following their feelings.
Fig-1: Learning when test and train inputs can have different distributions
- Learning when test and train inputs can have different distributions:
In one hand the initial dataset has been built by human experts of Nomao, but in the other hand the instances included in the test dataset (TD) have been drawn randomly. Thus the results obtained on the initial training dataset could differ from the one obtained on the test dataset. This is illustrated on Figure 1: the initial dataset could be the red circle but the test set, the grey circle, could be disjointed from it.
- Purchase of data labels by batches:
With such high volumes of data, following the classical way of running active learning (labeling examples one by one and updating the model at each step) is unpracticable. It is too long and too time-consuming for the labeling expert. So instead, sets of examples must be proposed for labeling, and this raises specific issues.
- Register to the challenge
and download the initial training dataset (NomaoTrain.txt),
the test dataset (NomaoTest.txt) and the unlabeled dataset (NomaoUnlabeled.txt).
The target variable is only provided for the training dataset.
See the file NomaoNames.txt for more details on the data.
- First Active Campaign - Up to Friday, June 1, 2012 (12:00 PM):
- train a classifier using the training dataset (and the other datasets if you want to use semi-supervised method)
- return us (by mail at firstname.lastname@example.org) the predicted labels for the test dataset (PL1)
- ask for N (set to 100) labels (by mail at email@example.com) of instances belonging to the unlabeled dataset
- Second Active Campaign - Up to Friday, June 8, 2012 (12:00 PM):
- train a classifier using the training dataset (and the other datasets if you want to use semi-supervised method) + the N instances for which you asked the label
- return us (by mail at firstname.lastname@example.org) the predicted labels for the test dataset (PL2)
- ask for N (set to 100) labels (by mail at email@example.com) of instances belonging to the unlabeled dataset.
- Final Test Campaign - Up to Friday, June 15, 2012 (12:00 PM):
- train a classifier using the training dataset (and the other datasets if you want to use semi-supervised method) + the 2*N instances for which you asked the label
- return us (by mail at firstname.lastname@example.org) the predicted labels for the test dataset (PL3)
- Write a paper and submit it to the ALRA workshop:
- Paper submission deadline: Friday, June 29, 2012
- Paper acceptance notification: Tuesday, July 31, 2012
- Paper camera-ready deadline: Tuesday, August 14, 2012
- Attend the ALRA Workshop: Friday, September 28, 2012, Bristol, UK
The number N=100 of instances labels each participant is allowed to ask depends on the total number of participants, the Nomao expert being asked to label around 1,000 - 2,000 instances for each campaign.
File NomaoTrain.txt contains the initial training dataset (ID), made of 29,104 examples:
- each line corresponds to one example
- fields (example features) are separated by commas (,)
- first field is the example identifier, composed of the names of the spots being compared, separated by a sharp (#)
- last field is the label of the example : +1 if the spots must be merged, and -1 if they refer to different entities
- missing data are represented by question marks (?)
Files NomaoTest.txt and NomaoUnlabeled.txt contain the same format of data, but with the example labels being hidden, replaced by a question mark (?). NomaoTest.txt contains 1,985 examples and NomaoUnlabeled.txt contains 100,000 examples.
File NomaoNames.txt contains the description of the datasets.
The submission format is the following: one line for each example, containing 3 fields separated by a comma (,)
- the name of the example (names of the spots separated by #)
- its predicted class (+1 or -1)
- and the measure of confidence on the prediction (set it to 0 if you don't have such information)
The Goal of the challenge is to liven up the ECML-PKDD 2012 workshop ALRA: Active Learning in Real-world Applications and to promote new active learning method.
Do it just for the fun of it!
Please do not share the labels asked during active campaigns.
For now, organizers did not schedule to implement methods to find the cheats. Organizers expect that participants will be fair...
Note that the value of N is really smaller than the size of the initial training dataset. An exchange of labels between participants could deteriorate the classifier between the first and second active campaign.
Time: The time to ask labels and to send the results will be the one displayed on the challenge website.
Conditions of participation: Anybody who complies with the rules of the challenge is welcome to participate. The participants are not required to attend the workshop (but they are encouraged to) where the results will be discussed, and the workshop is open to non-challenge participants.
Anonymity: All entrants must identify themselves to the organizers (first name, name and affiliation). A backlink from your website to this challenge page would be welcome. The identity will not be revealed if asked.
Data: All the datasets will be revealed at the end of the challenge and placed to the UCI Machine Learning repository.
Reproducibility: Everything is allowed that is not explicitly forbidden. We forbid acquiring labels under fake names, by registering multiple times, or exchanging labels with other participants. Participation is not conditioned on delivering your code nor publishing your methods. However, we will ask the top ranking participants to voluntarily cooperate to reproduce their results. This will include filling out a fact sheet about their methods and eventually participating in post-challenge tests and sending us their code, including the source code. The outcome of our attempt to reproduce your results will be published and add credibility to your results.
The prediction performance will be evaluated according to the Area Under the ROC Curve (AUC). The AUC values AUC1, AUC2 and AUC3 corresponding respectively to the predicted labels PL1, PL2 and PL3 will not be revealed during the challenge but only at the end of it.
The final results considered to rank the participants will be the improvement of the AUC between AUC3 and AUC2 and between AUC2 and AUC1. The improvement will be analyzed in the light of the baseline model (revealed at the end of the challenge). The goal is to have the better improvement AND to beat the baseline model.
Papers that address this issue will be welcome for the ALRA workshop. Authors will thus contribute to the confrontation of the proposed solutions and to discussions during the workshop.