Developer Interface¶

word_tokenize¶

underthesea.word_tokenize(sentence, format=None, use_token_normalize=True, fixed_words=[])[source]¶

Vietnamese word segmentation

Parameters:	sentence (str) – raw sentence format (str, optional) – format option. Defaults to None. use format=`text` for text format use_token_normalize (bool) – True if use token_normalize fixed_words (list) – list of fixed words
Returns:	word tokens
Return type:	`list` of `str`

Examples

>>> # -*- coding: utf-8 -*-
>>> from underthesea import word_tokenize
>>> sentence = "Bác sĩ bây giờ có thể thản nhiên báo tin bệnh nhân bị ung thư"

>>> word_tokenize(sentence)
["Bác sĩ", "bây giờ", "có thể", "thản nhiên", "báo tin", "bệnh nhân", "bị", "ung thư"]

>>> word_tokenize(sentence, format="text")
"Bác_sĩ bây_giờ có_thể thản_nhiên báo_tin bệnh_nhân bị ung_thư"

pos_tag¶

underthesea.pos_tag(sentence, format=None, model=None)[source]¶

chunking¶

underthesea.chunk(sentence, format=None)[source]¶

Vietnamese chunking

Parameters:	sentence ({unicode, str}) – raw sentence
Returns:	tokens – tagged sentence
Return type:	list of tuple with word, pos tag, chunking tag

Examples

>>> # -*- coding: utf-8 -*-
>>> from underthesea import chunk
>>> sentence = "Nghi vấn 4 thi thể Triều Tiên trôi dạt bờ biển Nhật Bản"
>>> chunk(sentence)
[('Nghi vấn', 'N', 'B-NP'),
('4', 'M', 'B-NP'),
('thi thể', 'N', 'B-NP'),
('Triều Tiên', 'Np', 'B-NP'),
('trôi dạt', 'V', 'B-VP'),
('bờ biển', 'N', 'B-NP'),
('Nhật Bản', 'Np', 'B-NP')]

ner¶

underthesea.ner(sentence, format=None, deep=False)[source]¶

Location and classify named entities in text

Parameters:	sentence ({unicode, str}) – raw sentence
Returns:	tokens
Return type:	list of tuple with word, pos tag, chunking tag, ner tag tagged sentence

Examples

>>> # -*- coding: utf-8 -*-
>>> from underthesea import ner
>>> sentence = "Ông Putin ca ngợi những thành tựu vĩ đại của Liên Xô"
>>> ner(sentence)
[('Ông', 'Nc', 'B-NP', 'O'),
('Putin', 'Np', 'B-NP', 'B-PER'),
('ca ngợi', 'V', 'B-VP', 'O'),
('những', 'L', 'B-NP', 'O'),
('thành tựu', 'N', 'B-NP', 'O'),
('vĩ đại', 'A', 'B-AP', 'O'),
('của', 'E', 'B-PP', 'O'),
('Liên Xô', 'Np', 'B-NP', 'B-LOC')]

classify¶

Install dependencies and download default model

$ pip install Cython
$ pip install future scipy numpy scikit-learn
$ pip install -U fasttext --no-cache-dir --no-deps --force-reinstall
$ underthesea data

underthesea.classify(X, domain=None)[source]¶

Text classification

Parameters:	X ({unicode, str}) – raw sentence domain ({None, 'bank'}) – domain of text None: general domain bank: bank domain
Returns:	tokens – categories of sentence
Return type:	list

sentiment¶

Install dependencies

$ pip install future scipy numpy scikit-learn==0.19.2 joblib

underthesea.sentiment(X, domain='general')[source]¶

Sentiment Analysis

Parameters:

X (str) – raw sentence
domain (str) – domain of text (bank or general). Default: general

Returns:

Text (Text of input sentence)
Labels (Sentiment of sentence)

Examples

>>> from underthesea import sentiment
>>> sentence = "Chuyen tiền k nhận Dc tiên"
>>> sentiment(sentence, domain='bank')
[MONEY_TRANSFER#negative (1.0)]

viet2ipa¶

underthesea.pipeline.ipa.viet2ipa(text: str, *args, **kwargs)[source]¶

Generate ipa of the syllable

Vietnamese syllabic structure (Anh & Trang 2022)

syllable = onset + rhyme + tone

rhyme = medial + nuclear vowel + (coda)

Parameters:	text (str) – represents syllable dialect (str) – Either the string “north” or “south”. Default: north eight (boolean) – If true, use eight tone format, else use six tone format. Default: False tone (str) – Either the string “ipa” or “number”. Default: number
Returns:	A string. Represents ipa of the syllable

Examples

>>> # -*- coding: utf-8 -*-
>>> from underthesea.pipeline.ipa import viet2ipa
>>> viet2ipa("trồng")
tɕoŋ³²