Interactive Sentence Alignment (ISA) - A Short User Guide

Introduction

ISA is a PHP based web interface for interactive sentence alignment of parallel XML documents. It uses as the backend the length-based Gale&Church approach to sentence alignment but it can be used for manual alignment. The basic idea is to use the interface for

adding hard boundaries to improve quality and performance of the automatic alignment
correcting existing alignments by removing/adding new segment boundaries

The interface allows you to work only on small portions of the document or the entire document. Alignment results can be saved (if not disabled) or sent via e-mail (if not disabled) in various formats (XCES align with pointers to external sentence IDs, plain text format or simple TMX).

PHP is a server-side scripting language. Hence, the documents to be aligned are stored at the server running the scripts. Currently, the location of the documents to be aligned are hard-coded in the script (in the config.php file). An upload form could easily be added (but then we would need some form of protection).

The documents may contain any kind of markup (it has to be valid XML though) but they must contain sentence boundary tags with unique id attributes (e.g. <s id="S1.1">)!

Getting started

ISA shows the source document on the left-had side and the target document in the right-had side of the blue window, sentence by sentence. Both documents are split into segments (sequences of one or more sentences). These segments are aligned to each other one by one starting with the top-most segment in each document. ISA only shows only parts of the document. As default it shows about 20 sentences for each document (depending on the segmentation). ISA breaks at 1.5*limit sentences if there is no boundary found. You can go to the next page using the links in the upper-right corner. You can also change the number of sentences to be shown (10, 20, 50 or all).

If you start the interface for the first time, ISA will create some internal files from both documents (this may take some time depending on the size of the documents). The script looks for structural XML tags (all tags higher than the sentence boundary tags) in both documents and tries to find the best one for breaking them into initial segments. It basically uses the most frequent XML tag that occurs the same number of times in both documents. No segmentation is done if there is no such tag.

Now, you can do the following things:

Adding and removing segment boundaries

Adding boundaries is simply done by clicking on a sentence that should start a new segment. Naturally, you cannot do this for the first sentence of an existing segment. Sentence for which this is possible are highlighted with green when you move the mouse over the sentence string. Sentence boundaries are always added in front of the selected sentence! Wait until the screen is re-built again. All segments will be aligned in their order, i.e. adding boundaries has consequences for all following segments in the document!

Removing existing boundaries is done in a similar way. Click on the first sentence in a segment and the boundary in front of it will disappear. These sentences will be highlighted with red and the entire segment will be merged with the previous one. This also has consequences for all following segments in the document.

The changes in segment alignment might feel a little bit confusing in the beginning. However, many sentence alignment mistakes are due to follow-up errors. Adding or removing boundaries may put things right for a large portion following in the document!

Adding XML tag boundaries

You may use any XML tag from the original document that is above and including the sentence boundary markup for adding additional segment boundaries. Select the XML tag you like to use from the selection box in the last row of the form in the upper-right corner of the window. Segment boundaries according to the markup in the document will be added immediately after selecting a new tag from the box. Note that boundaries are added to the existing ones! No segment boundary will disappear, regardless if it was added manually or automatically! The only way to remove all existing boundaries and to use only one specific XML tag for segmentation is to select an XML tag and to press on the reset button afterwards.

The cognate filter

Another way of adding segmentation boundaries automatically is to use a "cognate filter". Many names and related words are spelled the same in different languages. This can be used to adjusted the segmentation of the document. ISA implements a simple filter using a sliding window approach to find identical words. It adds new boundaries before each sentence pair containing such identical words. ISA uses a "sliding-window approach and prefers sentence pairs with similar distance from the last segment boundary. You can adjust the size of the sliding window (meaning the maximum distance between the two sentences from the last boundary) and the length threshold of cognates (meaning the minimum number of characters in words to be checked). The script uses a length threshold of 5 characters and a window of 10 sentence as default.

Again, boundaries are added to existing ones. No boundary will disappear by this approach!

The reset button

You can always go back to the initial segmentation by pressing the reset button in the form in the upper-right corner. Note that all existing boundaries for the entire document will be lost (if you didn't save them before)! The segmentation will be initialized using the currently selected XML tag (look at the selection box to the left of the reset button).

Run the automatic sentence aligner

The automatic sentence aligner is called if you click on the align button in the form in the upper-right corner. ISA runs the aligner only for the part of the document currently shown on the screen. All segment boundaries are used as "hard boundaries", meaning that the sentence alignment will re-start at these boundaries. In other words, sentences from each source document segment will be aligned with sentences in the corresponding target segment.

Note that running the aligner is limited to 5 seconds. The call to the external program will be killed if this limit is exceeded!

Add or remove empty sentence alignments

Sometimes a sentence is not translated at all. In that case you can click on the sentence ID next to the sentence to align that sentence with nothing. Similarily, you can click on the sentence ID to remove existing empty alignments. This works also for blocks larger than 1 sentence. Simply use one of the sentence IDs in the block.

Sending alignment results

Sentence alignment results can be sent to you via e-mail. You can choose between 3 different formats:

XCES align (the alignment for the entire document will be sent)
TMX (only the currently shown part will be sent)
plain text (only the currently shown part will be sent)

Select the format you like and type your e-mail address in the box to the left of the 'mail' button. Press 'mail' to send the data.

(This function may be disabled in the script!)

Save alignment results

You can save the current alignment into a local file on the server. The alignment of the entire file will be saved in XCES align format and may be re-loaded for later modification. There is only one alignment file that will be overwritten each time somebody presses 'save'!!!

Re-load previous alignments

Previous alignments saved in a local file may be re-loaded in the same way as XML tags are used for segmentation. Select the 'link' tag from the XML tag selection box and the sentence alignment boundaries from the saved alignment file will be added to the existing boundaries. Reset the document using the XML root tag to use only link-tags for segmentation.

- tiedeman@let.rug.nl