A wishlist for parallel corpora tools
Pluczek
Flexibility
More input and output formats support. The most wanted would be: Paraconc, CWB, Parasol, Intercorp, tmx. Converting from one format to another.
Functionality
- implementation of a sentence-splitting module, hopefully language specific, with the possiblity for modification by the user depending on the genre of the text. (This could also be developed as an independent module to be used with UGTag and other programs.)
- including more functionality of Hunalign, with a description of the features and explanation of consequences of their use (most likely will demand contact with HunAlign developers, as it does not seem to be well-documented).
- (opt) enabling the user to modify Hunalign's heuristics
Robustness
- The program crashes if it meets problems. The error reports are illegible.
- Hunalign sometimes cuts off the end of the text. Pluczek should try to restore the lost pieces. Some testing and visual reports could be helpful.
- visualization of the pre- and post-alignment could be helpful to evaluate the functionality of Hunalign and adjust configuration accordingly.
Major
- cross-platformness
- localization to more languages
UGTag related features
Adapting ugtag to work with standoff TEI based corpora.
Extending UGTag to use LanguageTool's disambiguation and Jarek Lipski's SRX library for
sentence splitting.