text

Manipulation of textual data.

Textual data (pre)processing

remove_punctuation(input_text[, ...])

Removes punctuation from textual data.

get_acronym(input_text[, only_capitals, ...])

Generates an acronym (in capital letters) from textual data.

split_on_uppercase(input_text[, join_with])

Extracts words from a string by splitting it at occurrences of uppercase letters.

numeral_english_to_arabic(input_text)

Converts a number written in English words into its equivalent numerical value represented in Arabic numerals.

count_words(input_text[, lowercase, ...])

Counts the occurrences of each word in the given text.

calculate_idf(documents[, lowercase, ...])

Calculates Inverse Document Frequency (IDF) for a sequence of textual documents.

calculate_tfidf(documents, **kwargs)

Calculates TF-IDF (Term Frequency-Inverse Document Frequency) for the given textual documents.

Textual data similarity

euclidean_distance_between_texts(txt1, txt2)

Computes the Euclidean distance between two sentences.

cosine_similarity_between_texts(txt1, txt2)

Calculates the cosine similarity between two sentences.

find_matched_str(input_str, lookup_list[, ...])

Finds all strings (in a sequence) that match a given string or regex pattern.

find_similar_str(input_str, lookup_list[, ...])

Finds n strings that are similar to input_str from a sequence of candidates.