Developer Interface

word_tokenize

underthesea.word_tokenize(sentence, format=None, use_token_normalize=True, fixed_words=[])[source]

Vietnamese word segmentation

Parameters:
  • sentence (str) – raw sentence
  • format (str, optional) – format option. Defaults to None. use format=`text` for text format
  • use_token_normalize (bool) – True if use token_normalize
  • fixed_words (list) – list of fixed words
Returns:

word tokens

Return type:

list of str

Examples

>>> # -*- coding: utf-8 -*-
>>> from underthesea import word_tokenize
>>> sentence = "Bác sĩ bây giờ có thể thản nhiên báo tin bệnh nhân bị ung thư"
>>> word_tokenize(sentence)
["Bác sĩ", "bây giờ", "có thể", "thản nhiên", "báo tin", "bệnh nhân", "bị", "ung thư"]
>>> word_tokenize(sentence, format="text")
"Bác_sĩ bây_giờ có_thể thản_nhiên báo_tin bệnh_nhân bị ung_thư"

pos_tag

underthesea.pos_tag(sentence, format=None, model=None)[source]

chunking

underthesea.chunk(sentence, format=None)[source]

Vietnamese chunking

Parameters:sentence ({unicode, str}) – raw sentence
Returns:tokens – tagged sentence
Return type:list of tuple with word, pos tag, chunking tag

Examples

>>> # -*- coding: utf-8 -*-
>>> from underthesea import chunk
>>> sentence = "Nghi vấn 4 thi thể Triều Tiên trôi dạt bờ biển Nhật Bản"
>>> chunk(sentence)
[('Nghi vấn', 'N', 'B-NP'),
('4', 'M', 'B-NP'),
('thi thể', 'N', 'B-NP'),
('Triều Tiên', 'Np', 'B-NP'),
('trôi dạt', 'V', 'B-VP'),
('bờ biển', 'N', 'B-NP'),
('Nhật Bản', 'Np', 'B-NP')]

ner

underthesea.ner(sentence, format=None, deep=False)[source]

Location and classify named entities in text

Parameters:sentence ({unicode, str}) – raw sentence
Returns:tokens
Return type:list of tuple with word, pos tag, chunking tag, ner tag tagged sentence

Examples

>>> # -*- coding: utf-8 -*-
>>> from underthesea import ner
>>> sentence = "Ông Putin ca ngợi những thành tựu vĩ đại của Liên Xô"
>>> ner(sentence)
[('Ông', 'Nc', 'B-NP', 'O'),
('Putin', 'Np', 'B-NP', 'B-PER'),
('ca ngợi', 'V', 'B-VP', 'O'),
('những', 'L', 'B-NP', 'O'),
('thành tựu', 'N', 'B-NP', 'O'),
('vĩ đại', 'A', 'B-AP', 'O'),
('của', 'E', 'B-PP', 'O'),
('Liên Xô', 'Np', 'B-NP', 'B-LOC')]

classify

Install dependencies and download default model

$ pip install Cython
$ pip install future scipy numpy scikit-learn
$ pip install -U fasttext --no-cache-dir --no-deps --force-reinstall
$ underthesea data
underthesea.classify(X, domain=None)[source]

Text classification

Parameters:
  • X ({unicode, str}) – raw sentence
  • domain ({None, 'bank'}) –
    domain of text
    • None: general domain
    • bank: bank domain
Returns:

tokens – categories of sentence

Return type:

list

sentiment

Install dependencies

$ pip install future scipy numpy scikit-learn==0.19.2 joblib
underthesea.sentiment(X, domain='general')[source]

Sentiment Analysis

Parameters:
  • X (str) – raw sentence
  • domain (str) – domain of text (bank or general). Default: general
Returns:

  • Text (Text of input sentence)
  • Labels (Sentiment of sentence)

Examples

>>> from underthesea import sentiment
>>> sentence = "Chuyen tiền k nhận Dc tiên"
>>> sentiment(sentence, domain='bank')
[MONEY_TRANSFER#negative (1.0)]

viet2ipa

underthesea.pipeline.ipa.viet2ipa(text: str, *args, **kwargs)[source]

Generate ipa of the syllable

Vietnamese syllabic structure (Anh & Trang 2022)

syllable = onset + rhyme + tone

rhyme = medial + nuclear vowel + (coda)

Parameters:
  • text (str) – represents syllable
  • dialect (str) – Either the string “north” or “south”. Default: north
  • eight (boolean) – If true, use eight tone format, else use six tone format. Default: False
  • tone (str) – Either the string “ipa” or “number”. Default: number
Returns:

A string. Represents ipa of the syllable

Examples

>>> # -*- coding: utf-8 -*-
>>> from underthesea.pipeline.ipa import viet2ipa
>>> viet2ipa("trồng")
tɕoŋ³²