Skip to content

Commit

Permalink
release 2018.0
Browse files Browse the repository at this point in the history
  • Loading branch information
Jackie Lo committed Jan 23, 2019
0 parents commit 1bc6abf
Show file tree
Hide file tree
Showing 97 changed files with 9,575 additions and 0 deletions.
17 changes: 17 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
Special Note: The following license applies to the contents of all subdirectories
with the exception of the src/cmdlp subdirectory, which is governed by its own
license. Please refer to src/cmdlp/LICENSE for the license pertaining to the
src/cmdlp subdirectory.

Multilingual Text Processing / Traitement multilingue de textes
Digital Technologies Research Centre / Centre de recherche en technologies numériques
National Research Council Canada / Conseil national de recherches Canada

Copyright 2018, Her Majesty in Right of Canada /
Copyright 2018, Sa Majeste la Reine du Chef du Canada

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
13 changes: 13 additions & 0 deletions NOTICE
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
YiSi

Copyright 2018, Her Majesty in Right of Canada /
Copyright 2018, Sa Majeste la Reine du Chef du Canada

YiSi was developed at:
Multilingual Text Processing / Traitement multilingue de textes
Digital Technologies Research Centre / Centre de recherche en technologies numériques
National Research Council Canada / Conseil national de recherches Canada

The command line parser in src/cmdlp was developed by Markus S. Saers.
It is covered by its own license (see src/cmdlp/LICENSE).
The cmdlp package is available from https://github.com/masaers/cmdlp
156 changes: 156 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,156 @@
# YiSi: A Semantic Machine Translation Evaluation Metric for Evaluating Languages with Different Levels of Available Resources
## Introduction
YiSi<sup>[a]</sup> is a family of semantic machine translation (MT) evaluation metrics
with a flexible architecture for evaluating MT output in languages of different
resource levels. Inspired by MEANT 2.0 (Lo, 2017), YiSi-1 measures the similarity
between the human references and machine translation by aggregating the weighted
distributional lexical semantic similarity, and, optionally, the shallow semantic
structures. YiSi-0 is a degenerate resource-free version using the longest
common character substring accuracy to replace distributional semantics for
evaluating lexical similarity between the human reference and MT output. On the
other hand, YiSi-2 is the bilingual reference-less version using bilingual word
embeddings for evaluating crosslingual lexical semantic similarity between the input
and MT output.

YiSi-1 achieved the highest average correlation with human direct assessment (DA)
judgment across all language pairs at system-level and the highest median correlation
with DA relative ranking across all language pairs at segment-level in the WMT2018
metrics task (Ma et al., 2018). YiSi-1 also successfully served in WMT2018 parallel
corpus filtering task while YiSi-2 showed comparable accuracy in the same task.

YiSi-0 is readily available for evaluating all languages. YiSi-1 requires a
monolingual corpus in the output language to train the distributional lexical
semantics model. YiSi-1_srl is designed for resource-rich languages that are equipped
with an automatic semantic role labeler in the output language. YiSi-2 requires
bilingual word embeddings and YiSi-2_srl addinionally requires an automatic semantic
role labeler for both the input and output language.

<sup>[a]</sup> YiSi is the romanization of the Cantonese word "意思/meaning".

## Installation

### Prerequisites
#### Base requirements
- YiSi was developed to run on Linux.
- YiSi is written in C++ and requires a version of `g++` that supports C++11; we're using GCC 4.9.3.
- YiSi requires `make`; we're using GNU Make 3.81.
- YiSi requires `bash`; we're using GNU bash, version 4.1.2.

#### Additional requirements to use SRL
- YiSi interfaces to a Java SRL library (mateplus), thus requires Java JDK 1.8 to build `srlmate.jar`.
- Define the `JAVA_HOME` environment variable:
```bash
export JAVA_HOME=/path/to/jdk_install_directory
```
- YiSi depends on mateplus, an extended version of the mate-tools semantic role labeler.
You can download and install mateplus from:
https://github.com/microth/mateplus
- Make sure to install all the mateplus basic dependencies listed in its README, i.e. without FrameNet and ParZu extensions.
- Define the `MATEPLUS_HOME` environment variable:
```bash
export MATEPLUS_HOME=/path/to/mateplus_install_director
```
Thus, the location of `mateplus.jar` is `$MATEPLUS_HOME/mateplus.jar`
- Put the JAR files for the dependencies you install for mateplus in `$MATEPLUS_HOME/lib`.
- Put the models you download for mateplus in `$MATEPLUS_HOME/lib`.

### Building YiSi
If building YiSi with SRLMATE in order to use SRL, then either define the `JAVE_HOME`
and `MATEPLUS_HOME` environment variables as instructed above, or edit the default
values defined in the YiSi `src/Makefile` and `test/Makefile`.

You may also want to define:
```bash
export YISI_HOME=/path/to/YiSi_git
```

To build YiSi, run the following commands:
```bash
cd $YISI_HOME/src
make all -j 4
```

To run the YiSi tests, either from `$YISI_HOME/src/` or `$YISI_HOME/test/`, run:
```bash
make test
```

If mateplus is not installed or `MATEPLUS_HOME` does not point at your mateplus,
YiSi will be built without SRLMATE; otherwise YiSi will be built with SRLMATE.

No additional `make install` step is needed for YiSi. The `make all` step builds
all the YiSi programs in `$YISI_HOME/bin/`.

The path to SRLMATE, if it was built, is: `$YISI_HOME/obj/srlmate.jar`

## Running YiSi
Although probably not required, we recommend adding the YiSi bin directory to `$PATH`:
```bash
export PATH=$YISI_HOME/bin:$PATH
```
YiSi has a lot of command line options (see `yisi --help`.
It's easiest to drive YiSi using a config file.
For example:
```bash
> cd $YISI_HOME/test

> cat yisi-1.config
srclang=de
tgtlang=en
lexsim-type=w2v
outlexsim-path=mini.d300.en
reflexweight-type=learn
phrasesim-type=nwpr
ngram-size=3
mode=yisi
alpha=0.8
ref-file=test_ref.en
hyp-file=test_hyp.en
sntscore-file=test_hyp.sntyisi1
docscore-file=test_hyp.docyisi1

> yisi --config yisi-1.config
Reading w2v text model from mini.d300.en
Size of voc: 500 Dimension: 300
Finished reading w2v model.
Learning lex weight from test_ref.en ... Done.
Tokenizing/SRL-ing hyp ... Done.
Tokenizing/SRL-ing ref ... Done.
Evaluating line 1
Evaluating line 2
Evaluating line 3
Evaluating line 4
Evaluating line 5
Evaluating line 6
Evaluating line 7
Evaluating line 8
Evaluating line 9
Evaluating line 10
```
`$YISI_HOME/test/` contains sample config files for running various YiSi scenarios on toy data:
```bash
> cd $YISI_HOME/test
> ls yisi-*.config
yisi-0.config yisi-1.config yisi-1_srl.config yisi-2.config yisi-2_srl.config
```
Please note: YiSi-2_srl is not ready for release yet, so don't try running `yisi yisi-2_srl.config`.

`$YISI_HOME/bin/` contains also contains many test programs (`*_test`),
which are used primarily for unit-testing.
See `$YISI_HOME/test/Makefile` for examples of how to call these programs, if interested.

## Pretrained word embeddings for YiSi-1
Unit vectors built by word2vec trained on the latest WMT translation task monolingual data are available for download at:
http://chikiu-jackie-lo.org/home/index.php/yisi

## References
[In progress]

## Acknowledgements
I would like to give special thanks to the following people:

Darlene Stewart, for her major efforts in defense coding and packaging the software. This release would be in a much worse shape without her covering up the potholes lying everywhere.

Markus Saers, for his accomodations in licensing the command line parser and fulfilling wishlist items in it.

Everyone in the NRC MTP team and Karteek Addanki, Meriem Beloucif, Nedjma Ousidhoum, Andrew Cattle and Marine Carpuat, for the moral support in the critical moment when YiSi was born.
157 changes: 157 additions & 0 deletions src/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,157 @@
# @file Makefile
# @brief Makefile for building YiSi
#
# @author Jackie Lo and Darlene Stewart
#
# Multilingual Text Processing / Traitement multilingue de textes
# Digital Technologies Research Centre / Centre de recherche en technologies numériques
# National Research Council Canada / Conseil national de recherches Canada
# Copyright 2018, Her Majesty in Right of Canada /
# Copyright 2018, Sa Majeste la Reine du Chef du Canada

# Override the value of MATEPLUS_HOME with a command line definition, or
# consider defining MATEPLUS_HOME in your .profile, for example:
# export MATEPLUS_HOME=~/u/sandboxes/mateplus
MATEPLUS_HOME ?= ~/u/tools/MATE/mateplus-master/src

JAVA_HOME ?= /space/group/nrc_ict/pkgs/centos6/gcc-4.9.3/jdk1.8.0_131

MATEPLUS_PATH ?= $(MATEPLUS_HOME)/mateplus.jar

ifneq ("$(wildcard $(MATEPLUS_PATH))","")
WITH_SRLMATE ?= True
else
$(info *** mateplus.jar not found)
endif

ifneq (clean, $(MAKECMDGOALS))
ifneq (cleaner, $(MAKECMDGOALS))
ifdef WITH_SRLMATE
$(info Building with SRLMATE...)
else
$(info Building without SRLMATE...)
endif
endif
endif

CXXFLAGS += -Wall -pedantic -std=c++11 -g -O3
JFLAGS += -cp ${MATEPLUS_PATH}

PROG_NAMES := yisi
TEST_NAMES := cmdlp_test srlgraph_test maxmatching_test lexsim_test w2v_test biw2v_test \
lexweight_test phrasesim_test srl_test srlutil_test util_test \
yisiscorer_test testbin

ifdef WITH_SRLMATE
TEST_NAMES += srlmate_test
endif

# List of binaries that need to be built
BIN_NAMES := $(PROG_NAMES) $(TEST_NAMES)

# List of all possible binaries (programs), including those that won't be built.
ALL_BIN_NAMES := $(BIN_NAMES) srlmate_test

SRC_OBJS := $(patsubst %.cpp,../obj/%.o,$(wildcard *.cpp))
CMDLP_OBJS := $(patsubst cmdlp/%.cpp,../obj/%.o,$(wildcard cmdlp/*.cpp))

# We compile/link SRLMATE objects a bit differently.
SRLMATE_OBJS := $(addprefix ../obj/,srlmate.o)
SRLMATE_BIN_OBJS := $(addprefix ../obj/,srlmate_test.o)
SRLMATE_BINS := $(addprefix ../bin/,srlmate_test)


ifdef WITH_SRLMATE
SRLMATE_OBJS += $(addprefix ../obj/,srl.o)
SRLMATE_BINS += $(addprefix ../bin/,srl_test yisiscorer_test yisi)
endif

# Object files are c++ sources that do not result in stand alone binaries
ALL_OBJECTS := $(filter-out $(addprefix ../obj/,$(ALL_BIN_NAMES:%=%.o)),$(SRC_OBJS) $(CMDLP_OBJS))

OBJECTS := $(filter-out $(SRLMATE_OBJS),$(ALL_OBJECTS))

#
# Targets
#

# Clear default suffix rules
.SUFFIXES:
# Keep dependencies between calls
.PRECIOUS: ../dep/%.d ../obj/%.o

.PHONY: all binaries scripts
all: binaries scripts
ifdef WITH_SRLMATE
all: ../obj/srlmate.jar
all: en.mplsconfig de.mplsconfig es.mplsconfig zh.mplsconfig
endif

binaries: $(BIN_NAMES:%=../bin/%)

YISIBIN_SUB := "s~^YISIBIN=/path/to/your/yisi/bin$$~YISIBIN=$(dir $(CURDIR))bin~"

.PHONY: scripts
scripts: | ../bin
cp -p ../src/scripts/resolve_yisicmd.sh ../bin
sed -e $(YISIBIN_SUB) scripts/run_yisi.sh > ../bin/run_yisi.sh
chmod a+x ../bin/*.sh

../obj/srlmate.jar: *.java
mkdir -pv ../obj/java
${JAVA_HOME}/bin/javac $(JFLAGS) -d ../obj/java $^
cd ../obj/java && jar -cvf ../srlmate.jar *

en.mplsconfig de.mplsconfig es.mplsconfig zh.mplsconfig: %: %.template
sed -e "s#<YISI_HOME>/#$(dir $(CURDIR))#g; s#<MATEPLUS_HOME>#$(MATEPLUS_HOME)#g;" < $< > $@

$(SRLMATE_BINS): LDFLAGS += -L${JAVA_HOME}/jre/lib/amd64/server
$(SRLMATE_BINS): LIBRARIES += -ljvm
$(SRLMATE_OBJS) $(SRLMATE_BIN_OBJS): CXXFLAGS += -I${JAVA_HOME}/include -I${JAVA_HOME}/include/linux -DWITH_SRLMATE

$(SRLMATE_BINS): $(SRLMATE_OBJS)
$(SRLMATE_BINS): OBJECTS += $(SRLMATE_OBJS)

../bin/%: ../obj/%.o $(OBJECTS) | ../bin
$(CXX) $(LDFLAGS) $< $(OBJECTS) $(LIBRARIES) -o $@

$(SRC_OBJS): ../obj/%.o: %.cpp | ../obj ../dep

$(CMDLP_OBJS): ../obj/%.o: cmdlp/%.cpp | ../obj ../dep

ifdef WITH_SRLMATE
../obj/srl.o: ../obj/.STAMP.WITH_SRLMATE
else
../obj/srl.o: ../obj/.STAMP.WITHOUT_SRLMATE
endif

../obj/%.o:
$(CXX) $(CXXFLAGS) -MM -MT '$@' $< > $(@:../obj/%.o=../dep/%.d)
$(CXX) $(CXXFLAGS) -c $< -o $@

../obj/.STAMP.WITH_SRLMATE ../obj/.STAMP.WITHOUT_SRLMATE: | ../obj
rm -rf ../obj/.STAMP.WITH*_SRLMATE
touch $@

.PHONY: test
test:
$(MAKE) -C ../test MATEPLUS_PATH=$(MATEPLUS_PATH)

../dep ../obj ../bin:
mkdir -p $@

.PHONY: clean cleaner clean.mplsconfig
clean: clean.mplsconfig
$(RM) -r ../bin
$(RM) *~

clean.mplsconfig:
$(RM) -f en.mplsconfig de.mplsconfig es.mplsconfig zh.mplsconfig

clean.test:
$(MAKE) -C ../test clean

cleaner: clean clean.test
$(RM) -r ../dep ../obj

-include $(wildcard ../dep/*.d)
Loading

0 comments on commit 1bc6abf

Please sign in to comment.