word_tokenize¶

underthesea.word_tokenize.word_tokenize(sentence, format=None)[source]¶

Vietnamese word segmentation

Parameters:	sentence ({unicode, str}) – raw sentence
Returns:	tokens – tagged sentence
Return type:	list of text

Examples

>>> # -*- coding: utf-8 -*-
>>> from underthesea import word_tokenize
>>> sentence = "Bác sĩ bây giờ có thể thản nhiên báo tin bệnh nhân bị ung thư"

>>> word_tokenize(sentence)
['Bác sĩ', 'bây giờ', 'có thể', 'thản nhiên', 'báo tin', 'bệnh nhân', 'bị', 'ung thư']

>>> word_tokenize(sentence, format="text")
'Bác_sĩ bây_giờ có_thể thản_nhiên báo_tin bệnh_nhân bị ung_thư'

pos_tag¶

underthesea.pos_tag.pos_tag(sentence, format=None)[source]¶

Vietnamese POS tagging

Parameters:	sentence ({unicode, str}) – Raw sentence
Returns:	tokens – tagged sentence
Return type:	list of tuple with word, pos tag

Examples

>>> # -*- coding: utf-8 -*-
>>> from underthesea import pos_tag
>>> sentence = "Chợ thịt chó nổi tiếng ở TPHCM bị truy quét"
>>> pos_tag(sentence)
[('Chợ', 'N'),
('thịt', 'N'),
('chó', 'N'),
('nổi tiếng', 'A'),
('ở', 'E'),
('TPHCM', 'Np'),
('bị', 'V'),
('truy quét', 'V')]

chunking¶

underthesea.chunking.chunk(sentence, format=None)[source]¶

Vietnamese chunking

Parameters:	sentence ({unicode, str}) – raw sentence
Returns:	tokens – tagged sentence
Return type:	list of tuple with word, pos tag, chunking tag

Examples

>>> # -*- coding: utf-8 -*-
>>> from underthesea import chunk
>>> sentence = "Nghi vấn 4 thi thể Triều Tiên trôi dạt bờ biển Nhật Bản"
>>> chunk(sentence)
[('Nghi vấn', 'N', 'B-NP'),
('4', 'M', 'B-NP'),
('thi thể', 'N', 'B-NP'),
('Triều Tiên', 'Np', 'B-NP'),
('trôi dạt', 'V', 'B-VP'),
('bờ biển', 'N', 'B-NP'),
('Nhật Bản', 'Np', 'B-NP')]

ner¶

underthesea.ner.ner(sentence, format=None)[source]¶

Location and classify named entities in text

Parameters:	sentence ({unicode, str}) – raw sentence
Returns:	tokens – tagged sentence
Return type:	list of tuple with word, pos tag, chunking tag, ner tag

Examples

>>> # -*- coding: utf-8 -*-
>>> from underthesea import ner
>>> sentence = "Ông Putin ca ngợi những thành tựu vĩ đại của Liên Xô"
>>> ner(sentence)
[('Ông', 'Nc', 'B-NP', 'O'),
('Putin', 'Np', 'B-NP', 'B-PER'),
('ca ngợi', 'V', 'B-VP', 'O'),
('những', 'L', 'B-NP', 'O'),
('thành tựu', 'N', 'B-NP', 'O'),
('vĩ đại', 'A', 'B-AP', 'O'),
('của', 'E', 'B-PP', 'O'),
('Liên Xô', 'Np', 'B-NP', 'B-LOC')]

classify¶

Install dependencies and download default model

$ pip install Cython
$ pip install future scipy numpy scikit-learn
$ pip install -U fasttext --no-cache-dir --no-deps --force-reinstall
$ underthesea data

underthesea.classification.classify(X, domain=None)[source]¶

Text classification

Parameters:	X ({unicode, str}) – raw sentence domain ({None, 'bank'}) – domain of text None: general domain bank: bank domain
Returns:	tokens – categories of sentence
Return type:	list

Examples

>>> # -*- coding: utf-8 -*-
>>> from underthesea import classify
>>> sentence = "HLV ngoại đòi gần tỷ mỗi tháng dẫn dắt tuyển Việt Nam"
>>> classify(sentence)
['The thao']

>>> sentence = "Tôi rất thích cách phục vụ của nhân viên BIDV"
>>> classify(sentence, domain='bank')
('CUSTOMER SUPPORT',)

sentiment¶

Install dependencies

$ pip install future scipy numpy scikit-learn==0.19.0 joblib

underthesea.sentiment.sentiment(X, domain=None)[source]¶

Sentiment Analysis

Parameters:	X ({unicode, str}) – raw sentence domain ({'bank'}) – domain of text bank: bank domain
Returns:	tokens – sentiment of sentence
Return type:	list

Examples

>>> # -*- coding: utf-8 -*-
>>> from underthesea import sentiment
>>> sentence = "Vừa smartbidv, vừa bidv online mà lại k dùng chung 1 tài khoản đăng nhập, rắc rối!"
>>> sentiment(sentence, domain='bank')
('INTERNET BANKING#NEGATIVE',)