Interactive Sentence Alignment (ISA) - A Short User Guide
Introduction
ISA is a PHP based web interface for interactive sentence alignment of
parallel XML documents. It uses as the backend the length-based Gale&Church
approach to sentence alignment but it can be used for manual alignment. The
basic idea is to use the interface for
- adding hard boundaries to improve quality and performance of the automatic
alignment
- correcting existing alignments by removing/adding new segment boundaries
The interface allows you to work only on small portions of the document or the
entire document. Alignment results can be saved (if not disabled) or sent via
e-mail (if not disabled) in various formats (XCES align with pointers to
external sentence IDs, plain text format or simple TMX).
PHP is a server-side scripting language. Hence, the documents to be aligned are
stored at the server running the scripts.
Currently, the location of the documents to be aligned are hard-coded in the
script (in the config.php file). An upload form could easily be added (but then
we would need some form of protection).
The documents may contain any kind of markup (it has to be valid XML though)
but they must contain sentence boundary tags with unique id
attributes (e.g. <s id="S1.1">)!
Getting started
ISA shows the source document on the left-had side and the target document in
the right-had side of the blue window, sentence by sentence. Both documents are
split into segments (sequences of one or more sentences). These segments are
aligned to each other one by one starting with the top-most segment in each
document. ISA only shows only parts of the document. As default it shows about
20 sentences for each document (depending on the segmentation). ISA breaks at
1.5*limit sentences if there is no boundary found. You can go to
the next page using the links in the upper-right corner. You can also change
the number of sentences to be shown (10, 20, 50 or all).
If you start the interface for the first time, ISA will create some
internal files from both documents (this may take some time depending on the
size of the documents). The script looks for structural XML tags (all tags
higher than the sentence boundary tags) in both documents and tries to find the
best one for breaking them into initial segments. It basically uses the most
frequent XML tag that occurs the same number of times in both documents. No
segmentation is done if there is no such tag.
Now, you can do the following things:
Adding and removing segment boundaries
Adding boundaries is simply done by clicking on a sentence that should
start a
new segment. Naturally, you cannot do this for the first sentence of an
existing segment. Sentence for which this is possible are highlighted with green
when you move the mouse over the sentence string. Sentence boundaries are
always added in front of the selected sentence! Wait until the screen is
re-built again. All segments will be aligned in their order, i.e. adding
boundaries has consequences for all following segments in the document!
Removing existing boundaries is done in a similar way. Click on the
first sentence in a segment and the boundary in front of it will
disappear. These sentences will be highlighted with red and the entire segment
will be merged with the previous one. This also has consequences for all
following segments in the document.
The changes in segment alignment might feel a little bit confusing in the
beginning. However, many sentence alignment mistakes are due to follow-up
errors. Adding or removing boundaries may put things right for a large portion
following in the document!
Adding XML tag boundaries
You may use any XML tag from the original document that is above and
including the sentence boundary markup for adding additional segment
boundaries. Select the XML tag you like to use from the selection box in the
last row of the form in the upper-right corner of the window. Segment
boundaries according to the markup in the document will be added
immediately after selecting a new tag from the box. Note that boundaries
are added to the existing ones! No segment boundary will disappear,
regardless if it was added manually or automatically! The only way to remove
all existing boundaries and to use only one specific XML tag for segmentation
is to select an XML tag and to press on the
reset button afterwards.
The cognate filter
Another way of adding segmentation boundaries automatically is to use a "cognate
filter". Many names and related words are spelled the same in different
languages. This can be used to adjusted the segmentation of the document. ISA
implements a simple filter using a sliding window approach to find identical
words. It adds new boundaries before each sentence pair containing such
identical words. ISA uses a "sliding-window approach and prefers sentence
pairs with similar distance from the last segment boundary. You can adjust the
size of the sliding window (meaning the maximum distance between the two
sentences from the last boundary) and the length threshold of cognates (meaning
the minimum number of characters in words to be checked). The script uses a
length threshold of 5 characters and a window of 10 sentence as default.
Again, boundaries are added to existing ones. No boundary will disappear
by this approach!
The reset button
You can always go back to the initial segmentation by pressing the reset button
in the form in the upper-right corner. Note that all existing boundaries
for the entire document will be lost (if you didn't save them before)!
The segmentation will be initialized using the currently selected XML tag
(look at the selection box to the left of the reset button).
Run the automatic sentence aligner
The automatic sentence aligner is called if you click on the align button in
the form in the upper-right corner. ISA runs the aligner only for the
part of the document currently shown on the screen. All segment boundaries are
used as "hard boundaries", meaning that the sentence alignment will re-start at
these boundaries. In other words, sentences from each source document segment
will be aligned with sentences in the corresponding target segment.
Note that running the aligner is limited to 5 seconds. The call to the
external program will be killed if this limit is exceeded!
Add or remove empty sentence alignments
Sometimes a sentence is not translated at all. In that case you can click on
the sentence ID next to the sentence to align that sentence with
nothing. Similarily, you can click on the sentence ID to remove existing empty
alignments. This works also for blocks larger than 1 sentence. Simply use one
of the sentence IDs in the block.
Sending alignment results
Sentence alignment results can be sent to you via e-mail. You can choose
between 3 different formats:
- XCES align (the alignment for the entire document will be sent)
- TMX (only the currently shown part will be sent)
- plain text (only the currently shown part will be sent)
Select the format you like and type your e-mail address in the box to the left
of the 'mail' button. Press 'mail' to send the data.
(This function may be disabled in the script!)
Save alignment results
You can save the current alignment into a local file on the server. The
alignment of the entire file will be saved in XCES align format and may be
re-loaded for later modification. There is only
one alignment file that
will be overwritten each time somebody presses 'save'!!!
Re-load previous alignments
Previous alignments saved in a local file may be re-loaded
in the same way as XML tags are used for
segmentation. Select the 'link' tag
from the XML tag selection box and the sentence alignment boundaries from the
saved alignment file will be added to the existing boundaries.
Reset the
document using the XML root tag to use only link-tags for segmentation.
-
tiedeman@let.rug.nl