WOJ-DB: a Word Order Judgement Database Survey data collected for project leading to Ph.D thesis entitled "The mat sat on the cat: Investigating structure in the evaluation of order in machine translation" by Martin McCaffery supervised by Mark-Jan Nederhof 23 June 2017 INTRODUCTION This dataset contains both subjective (human) and mechanical judgements of the quality of sentences, along with metadata on both sentences and their judges. It was produced for the abovementioned PhD project but was intended as a tool to aid researchers in as broad an area as possible. A more detailed overview can be found in the Ph.D. thesis text, section 6.2. FORMAT & FILES The data is provided in tab-separated-values format, suitable for use with the read.csv() command in R. - sentences.txt contains three primary columns. The first is a human-produced fluent "reference" translation; the second is a machine-produced "hypothesis" translation; and the third is a unique identifier for the pair. - dataset.csv contains all scores and metadata detailed below. Any references are to the Ph.D. thesis which is freely available online. DATASET FIELDS ### SENTENCE IDENTIFICATION sent.id unique ID for sentence, matching sentences.txt sent.index position (1..50) where sentence appeared within its given survey participant.id unique ID for this questionnaire and participant ### PARTICIPANT METADATA (thesis section 6.3.2) english.native does the participant consider themselves a native English speaker? age participant's self-reported age english.since number of years the participant has spoken English, if not native langs.minimal number of languages spoken at a minimal fluency level langs.moderate number of languages spoken at a moderate fluency level langs.fluent number of languages spoken fluently langs.native number of languages spoken as a native langs.points score based on languages spoken: see 6.5.2 gender participant's self-reported gender: see 6.3.2 and page 167 nationality participant's self-reported nationality: see 6.3.2 and page 167 education participant's self-reported education level: see 6.3.2 and page 167 caring participant's self-reported interest in language: see 6.3.2 and page 167 dyslexia whether participant has dyslexia: see 6.3.2 and page 167 ### HUMAN SCORES (thesis section 6.2.1 and page 167) human.hol reported overall sentence quality: section 6.2.1 question 1 human.ord reported quality of word ordering: section 6.2.1 question 2 ### SENTENCE METADATA (thesis section 6.3.4) is.translation is the hypothesis a machine translation of the same source sentence as the reference? is.permutation inverse of is.translation: was this hypothesis produced by altering the reference itself as per thesis section 6.2.3? permutation.type if (is.permutation), text name of permutation type is.order equal to (permutation.type == 'order') is.phrase equal to (permutation.type == 'phrase') is.choice equal to (permutation.type == 'choice') is.swap equal to (permutation.type == 'swap') ref.length number of words in reference sentence hyp.length number of words in hypothesis sentence src.length number of words in source sentence (not English; not provided) degree if (is.permutation), number of words altered by permutation algorithm human.adq WMT score for sentence, normalised per section 6.1.1.5 real.score either 'degree' or 'human.adq', as appropriate real.score.factorised real.score, rounded to a five-point scale sent.id.count how many times this sentence appeared across all surveys both.count how many rows contain these exact human.hol and human.ord scores ### AUTOMATIC SCORES FOR TOOLS PRODUCED FOR THESIS # names contain, in order: ____ # - "dted" and "derp" are the names of the tools (chapters 4 and 5 respectively) # - "mlt" and "stn" are the MaltParser and Stanford Parser (section 3.4.3) # - "met" and "stn" are the MaxEnt Treebank and Stanford Parser (section 3.4.2) # - "cdec" and "giza" are the cdec and GIZA++ projects (section 3.4.5) # - 'f' at the end indicates dependency trees were flattened before use (section 3.4.4) # - other suffix letters indicate DTED configurations (section 4.3) # The full list of columns for DTED and DERP is as follows: dted_mlt_met_cdec_fcl dted_mlt_met_cdec_fco dted_mlt_met_cdec_cl dted_mlt_met_cdec_co dted_mlt_met_cdec_c dted_mlt_met_cdec_fc dted_mlt_met_cdec_fb dted_mlt_met_cdec_b dted_mlt_met_giza_fcl dted_mlt_met_giza_fco dted_mlt_met_giza_cl dted_mlt_met_giza_co dted_mlt_met_giza_c dted_mlt_met_giza_fc dted_mlt_met_giza_fb dted_mlt_met_giza_b dted_mlt_stn_cdec_fco dted_mlt_stn_cdec_co dted_mlt_stn_giza_fcl dted_mlt_stn_giza_fco dted_mlt_stn_giza_cl dted_mlt_stn_giza_co dted_mlt_stn_giza_c dted_mlt_stn_giza_fc dted_mlt_stn_giza_fb dted_mlt_stn_giza_b dted_stn_stn_giza_fcl dted_stn_stn_giza_fco dted_stn_stn_giza_cl dted_stn_stn_giza_co dted_stn_stn_giza_c dted_stn_stn_giza_fc dted_stn_stn_giza_fb dted_stn_stn_giza_b dted_mlt_met_meteor_c dted_mlt_met_meteor_fc dted_mlt_met_meteor_fb dted_mlt_met_meteor_b dted_mlt_stn_meteor_c dted_mlt_stn_meteor_fc dted_mlt_stn_meteor_fb dted_mlt_stn_meteor_b dted_stn_stn_meteor_c dted_stn_stn_meteor_fc dted_stn_stn_meteor_fb dted_stn_stn_meteor_b derp_mlt_met_cdec_f derp_mlt_met_cdec derp_mlt_met_giza_f derp_mlt_met_giza derp_mlt_stn_cdec_f derp_mlt_stn_cdec derp_mlt_stn_giza_f derp_mlt_stn_giza derp_stn_stn_cdec_f derp_stn_stn_cdec derp_stn_stn_giza_f derp_stn_stn_giza ### AUTOMATIC SCORES FOR OTHER TOOLS meteor.1.5 Meteor scores calculated during PhD project meteor.1.5.wo Meteor (chunking) scores: see 3.5.4 meteor.wmt Meteor scores from WMT data: see 7.3.1 depref.align scores extracted from WMT data: see 7.3.1 depref.exact scores extracted from WMT data: see 7.3.1 LEPOR_v3 scores extracted from WMT data: see 7.3.1 MEANT scores extracted from WMT data: see 7.3.1 nLEPOR.baseline scores extracted from WMT data: see 7.3.1 reverted scores extracted from WMT data: see 7.3.1 SIMPBLEU.prec scores extracted from WMT data: see 7.3.1 SIMPBLEU.recall scores extracted from WMT data: see 7.3.1 terrorcat scores extracted from WMT data: see 7.3.1 UMEANT scores extracted from WMT data: see 7.3.1 apac scores extracted from WMT data: see 7.3.1 BEER scores extracted from WMT data: see 7.3.1 DCUcomb.seg scores extracted from WMT data: see 7.3.1 DCU_seg scores extracted from WMT data: see 7.3.1 DiscoTK.light.kool scores extracted from WMT data: see 7.3.1 DiscoTK.light scores extracted from WMT data: see 7.3.1 DiscoTK.party scores extracted from WMT data: see 7.3.1 DiscoTK.party.tuned scores extracted from WMT data: see 7.3.1 nrc_amber scores extracted from WMT data: see 7.3.1 nrc_bleu scores extracted from WMT data: see 7.3.1 upc.ipa scores extracted from WMT data: see 7.3.1 upc.stout scores extracted from WMT data: see 7.3.1 VERTa.EQ scores extracted from WMT data: see 7.3.1 VERTa.W scores extracted from WMT data: see 7.3.1 BEER_Treepel scores extracted from WMT data: see 7.3.1 BS scores extracted from WMT data: see 7.3.1 chrF3 scores extracted from WMT data: see 7.3.1 chrF scores extracted from WMT data: see 7.3.1 DPMFcomb scores extracted from WMT data: see 7.3.1 DPMF scores extracted from WMT data: see 7.3.1 dreem scores extracted from WMT data: see 7.3.1 LeBLEU.default scores extracted from WMT data: see 7.3.1 LeBLEU.optimized scores extracted from WMT data: see 7.3.1 meteor_wsd scores extracted from WMT data: see 7.3.1 ratatouille scores extracted from WMT data: see 7.3.1 UoW.LSTM scores extracted from WMT data: see 7.3.1 upf.cobalt scores extracted from WMT data: see 7.3.1 VERTa.70Adeq30Flu scores extracted from WMT data: see 7.3.1