vnTokenizer is an automatic tokenizer for segmenting Vietnamese texts into lexical units. It is developed in the Java programming language which is platform-independent. It gives a good segmentation result in terms of precision and recall ratios which are in the range of 96%-98%.
NEW: In May 2016, vnTokenizer 5.0 was released as a tool in Vitk, a new toolkit which is designed to process very large text data. This toolkit uses Apache Spark, a fast cluster computing platform. Check it out at https://github.com/phuonglh/vn.vitk
If you use vnTokenizer in your publication, please cite our article A hybrid approach to word segmentation of Vietnamese texts. vnTokenizer is integrated in vnTagger.