Description
An open-source NLP library: fast text cleaning and preprocessing.
TL;DR
This library provides a quick and ready-to-use text preprocessing tools for text cleaning and normalization. You can simply remove hashtags, nicknames, emoji, url addresses, punctuation, whitespace and whatever.
Installation
To download dobbi, either fork this GitHub repo or simply use Pypi via pip:
$ pip install dobbi
Usage
Import the library:
import dobbi
Interaction
The library uses method chaining in order to simplify text processing:
import pandas as pd
d = {'text': ['#fun #lol Why @Alex33 is so funny here: https://some-url.com',
'#looool =) π such lovely!?*!!!%&']}
df = pd.DataFrame(d)
cln_func = dobbi.clean() \
.hashtag() \
.nickname() \
.url() \
.function()
df['text'] = df['text'].map(cln_func)
repl_func = dobbi.replace() \
.emoji() \
.emoticon() \
.punctuation() \
.function()
df['text'] = df['text'].map(repl_func)
Result:
print(df['text'][0]) # 'Why is so funny here'
print(df['text'][1]) # 'TOKEN_EMOTICON_HAPPY_FACE_OR_SMILEY TOKEN_EMOJI_SMILING_FACE_WITH_HEART_EYES such lovely'
Supported methods and patterns
The process consists of three stages:
- Initialization methods: initialize a dobbi Work object
- Intermediate methods: chain patterns in the needed order
- Terminal methods: choose if you need a function or a result