====================  PIT 2015  ====================
  SemEval-2015 Task 1:
  Paraphrase and Semantic Similarity in Twitter
  Updated: Nov 9, 2014 (added evaluation plan)
====================================================
			    
ORGANIZERS

  Wei Xu, University of Pennsylvania
  Chris Callison-Burch, University of Pennsylvania
  Bill Dolan, Microsoft Research


EVALUATION

  The test data will be released in similar format as training/dev data, except that it 
  doesn't have the "Label" column. Each line has 6 columns.
     | Topic_Id | Topic_Name | Sent_1 | Sent_2 | Sent_1_tag | Sent_2_tag |

  The participants are required to produce a binary label (paraphrase) for each sentence 
  pair, and optionally a real number between 0 ~ 1 for measuring semantic similarity.

  !!!Each participant is allowed to submit up to 2 runs. 

  Each participant is asked to submit the following files (packed as a zip file):
      PIT2015_TEAMNAME.readme (follow the format in the example below)
      PIT2015_TEAMNAME_01_nameofthisrun.output (!!!check the format before submission by the script pit2015_checkformat.py)
      PIT2015_TEAMNAME_02_nameofthisrun.output (optional)      
    
  The system output file should match the lines of the test data. Each line has 2 columns 
  and separated by a tab in between, like this:
     | Binary Label (true/false) | Degreed Score (between 0 and 1, in the4 decimal format) |
  if your system only gives binary labels, put "0.0000" in all second columns.  
       
  The example files and format checking script:
    ./eval/sampletest.data  (participants will receive test data in the same format)
    ./eval/PIT2015_UPENN_02_lg.output     (participants need to return system outputs in the same format) 
	./eval/pit2015_checkformat.py (script that checks output format before submitting the outputs)
  
  Please check the shared-task website (http://alt.qcri.org/semeval2015/task1/) 
  on Dec 12th, 2014 to obtain the test data and submission instructions. 


TRAIN/DEV DATA

  The dataset contains the following files:
  
    ./data/train.data (13063 sentence pairs)
    ./data/dev.data   (4727 sentence pairs)

  Notice that the train and dev data is collected from the same time period and
  same trending topics. In the evaluation later, we will test the system on data
  collected from a different time period.

  Both data files come in the tab-separated format. Each line contains 7 columns:
    
    | Topic_Id | Topic_Name | Sent_1 | Sent_2 | Label | Sent_1_tag | Sent_2_tag |
 
  The "Trending_Topic_Name" are the names of trends provided by Twitter, which are
  not hashtags.
  
  The "Sent_1" and "Sent_2" are the two sentences, which are not necessarily full 
  tweets. Tweets were tokenized (thanks to Brendan O'Connor et al.) and 
  split into sentences. 
 
  The "Label" column is in a format such like "(1, 4)", which means among 5 votes 
  from Amazon Mechanical turkers only 1 is positive and 4 are negative. We would 
  suggest map them to binary labels as follows:
    
    paraphrases: (3, 2) (4, 1) (5, 0)
    non-paraphrases: (1, 4) (0, 5)
    debatable: (2, 3)  which you may discard if training binary classifier

  The "Sent_1_tag" and "Sent_2_tag" are the two sentences with part-of-speech 
  and named entity tags (thanks to Alan Ritter). 
  
          
BASELINE
  
  A logistic regression model using simple lexical overlap features:
    ./script/baseline_logisticregression.py

  It is our reimplementation in Python. This basline was originally 
  used by Dipanjan Das and Noah A. Smith in their ACL 2009 paper
  "Paraphrase Identification as Probabilistic Quasi-Synchronous Recognition".

  To run the script, you will need to install NLTK and Megam packages:
    http://www.nltk.org/_modules/nltk/classify/megam.html
    http://www.umiacs.umd.edu/~hal/megam/index.html
  If you have troubles with Megam, you may need to rebuild it from source code:
    http://stackoverflow.com/questions/11071901/stuck-in-using-megam-in-python-nltk-classify-maxentclassifier

  Example output, if training on train.data and test on dev.data will look like:
    
    Read in 11513 training data ...  (after discarding the data with debatable cases)
    Read in 4139 test data ...       (see details in TRAIN/DEV DATA section)
    PRECISION: 0.704069050555
    RECALL:    0.389229720518
    F1:        0.501316944688
    ACCURACY:  0.725537569461 

  The script will provide the numbers for plotting precision/recall curves, or a 
  single precision/recall/F1 score with 0.5 cutoff of predicated probability. 
  

REFERENCES 


@article{Xu-EtAl-2014:TACL,
  author =  {Wei Xu and Alan Ritter and Chris Callison-Burch and William B. Dolan and Yangfeng Ji},
  title =   {Extracting Lexically Divergent Paraphrases from {Twitter}},
  journal = {Transactions of the Association for Computational Linguistics},
  volume =  {},
  number =  {},
  year =    {2014},
  pages = {},
  publisher = {Association for Computational Linguistics},
  url = {http://www.cis.upenn.edu/~xwe/files/tacl2014-extracting-paraphrases-from-twitter.pdf}
}

@phdthesis{xu2014data,
  author = {Xu, Wei},
  title = {Data-Drive Approaches for Paraphrasing Across Language Variations},
  school = {Department of Computer Science, New York University},
  year = {2014},
  url = {http://www.cis.upenn.edu/~xwe/files/thesis-wei.pdf}
}			    

(more details about how this data was collected and some analysis is in Chapter 6)