-
-
Notifications
You must be signed in to change notification settings - Fork 18.8k
Closed
Labels
API DesignEnhancementStringsString extension data type and string dataString extension data type and string data
Milestone
Description
The current solution is to call str.replace(<compiled_re>.pattern, flags=<compiled_re>.flags)
which is relatively ugly and verbose in my opnion.
Here's a contrived example of removing stopwords and normalizing whitespace afterwards:
import pandas as pd
import re
some_names = pd.Series(["three weddings and a funeral", "the big lebowski", "florence and the machine"])
stopwords = ["the", "a", "and"]
stopwords_re = re.compile(r"(\s+)?\b({})\b(\s+)?".format("|".join(stopwords), re.IGNORECASE)
whitespace_re = re.compile(r"\s+")
# desired code:
# some_names.str.replace(stopwords_re, " ").str.strip().str.replace(whitespace_re, " ")
# actual code:
some_names.\
str.replace(stopwords_re.pattern, " ", flags=stopwords_re.flags).\
str.strip().str.replace(whitespace_re.pattern, " ", flags=whitespace_re.flags)
Why do I think this is better?
- It's nice to have commonly used regular expressions compiled and to carry their flags around with them (and also allows the use of "verbose" regular expressions)
- It's not that compiled regular expressions should quack like strings... it's that in this case we're making strings quack like compiled regular expressions, but at the same time not letting those compiled regular expressions quack their own quack.
Is there a good reason not to implement this?
Metadata
Metadata
Assignees
Labels
API DesignEnhancementStringsString extension data type and string dataString extension data type and string data