mt.pandas.word

Custom word accessor for pandas.

Classes

class mt.pandas.word.WordAccessor(pandas_obj)

Accessor for word fields.

Inheritance

digraph inheritance0b51fb1c99 { bgcolor=transparent; rankdir=LR; size="8.0, 12.0"; "WordAccessor" [URL="#mt.pandas.word.WordAccessor",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="Accessor for word fields."]; }
property bigram

Returns a list of letter bigrams for each word. See ngram().

property english

Returns which item is like an English word

property extract_vietnamese_tone

Extracts the tone marks {””’, “’”, “?”, “~”, “.”}` each Vietnamese word.

property letter

Returns a list of letters for each word. .

property move_vi_tone_to_last

Moves the first tone mark to the end of a word.

ngram(n)

Returns a list letter n-grams for each word.

Parameters:

n (int) – number n specifying the letter n-gram. Must be integer greater than 1.

Returns:

each element of the returning series is a list of n-grams of the corresponding element in the input series

Return type:

pandas.Series

Raises:

ValueError – if an argument is wrong

Notes

You can use pandas’ explode() function to process further.

property remove_vietnamese_tone

Removes the tone marks in each Vietnamese word.

property split_vi_diacritical

Splits any untoned diacritical Vietnamese letter into its base letter followed by a symbol representing the diacritical mark, in each word.

property split_vi_tone

Splits any Vietnamese toned letter into its base letter followed by a symbol representing the tone mark (‘?~.)`, in each word.

sub_map(substr_map)

Substitutes substrings using a dictionary/map.

For each substring of a word, the substring is replaced with a replacement string.

Parameters:

substr_map (dict) – a map that maps each substring into a replacement string

property trigram

Returns a list of letter trigrams for each word. See ngram().

property truncate_first_vi_mark

Truncates each word to the first occurence of a split Vietnamese mark.

property vietnamese

Returns which item is like a Vietnamese word