==================== PIT 2015 ==================== SemEval-2015 Task 1: Paraphrase and Semantic Similarity in Twitter Updated: Nov 9, 2014 (added evaluation plan) ==================================================== ORGANIZERS Wei Xu, University of Pennsylvania Chris Callison-Burch, University of Pennsylvania Bill Dolan, Microsoft Research EVALUATION The test data will be released in similar format as training/dev data, except that it doesn't have the "Label" column. Each line has 6 columns. | Topic_Id | Topic_Name | Sent_1 | Sent_2 | Sent_1_tag | Sent_2_tag | The participants are required to produce a binary label (paraphrase) for each sentence pair, and optionally a real number between 0 ~ 1 for measuring semantic similarity. !!!Each participant is allowed to submit up to 2 runs. Each participant is asked to submit the following files (packed as a zip file): PIT2015_TEAMNAME.readme (follow the format in the example below) PIT2015_TEAMNAME_01_nameofthisrun.output (!!!check the format before submission by the script pit2015_checkformat.py) PIT2015_TEAMNAME_02_nameofthisrun.output (optional) The system output file should match the lines of the test data. Each line has 2 columns and separated by a tab in between, like this: | Binary Label (true/false) | Degreed Score (between 0 and 1, in the4 decimal format) | if your system only gives binary labels, put "0.0000" in all second columns. The example files and format checking script: ./eval/sampletest.data (participants will receive test data in the same format) ./eval/PIT2015_UPENN_02_lg.output (participants need to return system outputs in the same format) ./eval/pit2015_checkformat.py (script that checks output format before submitting the outputs) Please check the shared-task website (http://alt.qcri.org/semeval2015/task1/) on Dec 12th, 2014 to obtain the test data and submission instructions. TRAIN/DEV DATA The dataset contains the following files: ./data/train.data (13063 sentence pairs) ./data/dev.data (4727 sentence pairs) Notice that the train and dev data is collected from the same time period and same trending topics. In the evaluation later, we will test the system on data collected from a different time period. Both data files come in the tab-separated format. Each line contains 7 columns: | Topic_Id | Topic_Name | Sent_1 | Sent_2 | Label | Sent_1_tag | Sent_2_tag | The "Trending_Topic_Name" are the names of trends provided by Twitter, which are not hashtags. The "Sent_1" and "Sent_2" are the two sentences, which are not necessarily full tweets. Tweets were tokenized (thanks to Brendan O'Connor et al.) and split into sentences. The "Label" column is in a format such like "(1, 4)", which means among 5 votes from Amazon Mechanical turkers only 1 is positive and 4 are negative. We would suggest map them to binary labels as follows: paraphrases: (3, 2) (4, 1) (5, 0) non-paraphrases: (1, 4) (0, 5) debatable: (2, 3) which you may discard if training binary classifier The "Sent_1_tag" and "Sent_2_tag" are the two sentences with part-of-speech and named entity tags (thanks to Alan Ritter). BASELINE A logistic regression model using simple lexical overlap features: ./script/baseline_logisticregression.py It is our reimplementation in Python. This basline was originally used by Dipanjan Das and Noah A. Smith in their ACL 2009 paper "Paraphrase Identification as Probabilistic Quasi-Synchronous Recognition". To run the script, you will need to install NLTK and Megam packages: http://www.nltk.org/_modules/nltk/classify/megam.html http://www.umiacs.umd.edu/~hal/megam/index.html If you have troubles with Megam, you may need to rebuild it from source code: http://stackoverflow.com/questions/11071901/stuck-in-using-megam-in-python-nltk-classify-maxentclassifier Example output, if training on train.data and test on dev.data will look like: Read in 11513 training data ... (after discarding the data with debatable cases) Read in 4139 test data ... (see details in TRAIN/DEV DATA section) PRECISION: 0.704069050555 RECALL: 0.389229720518 F1: 0.501316944688 ACCURACY: 0.725537569461 The script will provide the numbers for plotting precision/recall curves, or a single precision/recall/F1 score with 0.5 cutoff of predicated probability. REFERENCES @article{Xu-EtAl-2014:TACL, author = {Wei Xu and Alan Ritter and Chris Callison-Burch and William B. Dolan and Yangfeng Ji}, title = {Extracting Lexically Divergent Paraphrases from {Twitter}}, journal = {Transactions of the Association for Computational Linguistics}, volume = {}, number = {}, year = {2014}, pages = {}, publisher = {Association for Computational Linguistics}, url = {http://www.cis.upenn.edu/~xwe/files/tacl2014-extracting-paraphrases-from-twitter.pdf} } @phdthesis{xu2014data, author = {Xu, Wei}, title = {Data-Drive Approaches for Paraphrasing Across Language Variations}, school = {Department of Computer Science, New York University}, year = {2014}, url = {http://www.cis.upenn.edu/~xwe/files/thesis-wei.pdf} } (more details about how this data was collected and some analysis is in Chapter 6)