Using Regular Expression As A Tokenizer?
I am trying tokenize my corpus into sentences. I tried using spacy and nltk and they did not work well since my text is a bit tricky. Below is an artificial sample I made which cov
Solution 1:
In general you can't rely on one single Great White infallible regex, you have to write a function which uses several regexes (both positive and negative); also a dictionary of abbreviations, and some basic language parsing which knows that e.g. 'I', 'USA', 'FCC', 'TARP' are capitalized in English. Reference
Following this guideline the following function uses several regexes to parse your sentence Modification of D Greenberg answer
Code
import re
def split_into_sentences(text):
# Regex pattern
alphabets="([A-Za-z])"
prefixes ="(Mr|St|Mrs|Ms|Dr|Prof|Capt|Cpt|Lt|Mt)[.]"
suffixes ="(Inc|Ltd|Jr|Sr|Co)"
starters ="(Mr|Mrs|Ms|Dr|He\s|She\s|It\s|They\s|Their\s|Our\s|We\s|But\s|However\s|That\s|This\s|Wherever)"
acronyms ="([A-Z][.][A-Z][.](?:[A-Z][.])?)"
# website regex from https://www.geeksforgeeks.org/python-check-url-string/
websites = r"(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’]))"
digits ="([0-9])"
section ="(Section \d+)([.])(?= \w)"
item_number ="(^|\s\w{2})([.])(?=[-+ ]?\d+)"
abbreviations ="(^|[\s\(\[]\w{1,2}s?)([.])(?=[\s\)\]]|$)"
parenthesized ="\((.*?)\)"
bracketed ="\[(.*?)\]"
curly_bracketed ="\{(.*?)\}"
enclosed = '|'.join([parenthesized, bracketed, curly_bracketed])
# text replacement
# replace unwanted stop period with <prd>
# actual stop periods with <stop>
text =" "+ text +" "
text = text.replace("\n"," ")
text = re.sub(prefixes,"\\1<prd>",text)
text = re.sub(websites, lambda m: m.group().replace('.', '<prd>'), text)
if"Ph.D"in text: text = text.replace("Ph.D.","Ph<prd>D<prd>")
if"..."in text: text = text.replace("...","<prd><prd><prd>")
text = re.sub("\s"+ alphabets +"[.] "," \\1<prd> ",text)
text = re.sub(acronyms+" "+starters,"\\1<stop> \\2",text)
text = re.sub(alphabets +"[.]"+ alphabets +"[.]"+ alphabets +"[.]","\\1<prd>\\2<prd>\\3<prd>",text)
text = re.sub(alphabets +"[.]"+ alphabets +"[.]","\\1<prd>\\2<prd>",text)
text = re.sub(" "+suffixes+"[.] "+starters," \\1<stop> \\2",text)
text = re.sub(" "+suffixes+"[.]"," \\1<prd>",text)
text = re.sub(" "+ alphabets +"[.]"," \\1<prd>",text)
text = re.sub(section,"\\1<prd>",text)
text = re.sub(item_number,"\\1<prd>",text)
text = re.sub(abbreviations, "\\1<prd>",text)
text = re.sub(digits +"[.]"+ digits,"\\1<prd>\\2",text)
text = re.sub(enclosed, lambda m: m.group().replace('.', '<prd>'), text)
if"”"in text: text = text.replace(".”","”.")
if"\""in text: text = text.replace(".\"","\".")
if"!"in text: text = text.replace("!\"","\"!")
if"?"in text: text = text.replace("?\"","\"?")
text = text.replace(".",".<stop>")
text = text.replace("?","?<stop>")
text = text.replace("!","!<stop>")
text = text.replace("<prd>",".")
# Tokenize sentence based upon <stop>
sentences = text.split("<stop>")
if sentences[-1].isspace():
# remove last since only whitespace
sentences = sentences[:-1]
sentences = [s.strip() for s in sentences]
return sentences
Usage
for index, token inenumerate(split_into_sentences(s), start = 1):
print(f'{index}) {token}')
Tests
1. Input
s='''It is relevant to point that Case No. 778 - Martin H. v. The Woods, it was mentioned that deathto one cannot be generalised. However, the High Court while enhancing the same from life to
death, in our view,has not assigned adequate and acceptable reasons. In our opinion, it isnot a
rarest of rare casewhere extreme penalty of death is called for instead sentence of
imprisonment for life as ordered by the trial Court would be appropriate.15) In the light of the
above discussion, while maintaining the conviction of the appellant-accused for the offence under Section 302. IPC,
award of extreme penalty of death by the High Court isset aside and we restore the sentence of
life imprisonment as directed by the trial Court.
'''
Output
1) It is relevant to point that Case No. 778 - Martin H. v. The Woods, it was mentioned that death to one cannot be generalised.
2) However, the High Court while enhancing the same from life to death, in our view,has not assigned adequate and acceptable reasons.
3) In our opinion, it isnot a rarest of rare casewhere extreme penalty of death is called for instead sentence of imprisonment for life as ordered by the trial Court would be appropriate.
4) 15) In the light of the above discussion, while maintaining the conviction of the appellant-accused for the offence under Section 302. IPC, award of extreme penalty of death by the High Court isset aside and we restore the sentence of life imprisonment as directed by the trial Court.
2. Input
s = '''Mr. or Mrs. or Dr. (not sure of their title) Smith will be here in the morning at eight.He's arriving on flight No. 48213 out of Denver.He'll take the No. 2 bus from the airport.However, he may grab a taxi instead.'''
Output
1) Mr. or Mrs. or Dr. (not sure of their title) Smith will be here in the morning at eight.
2) He's arriving on flight No. 48213 out of Denver.3) He'll take the No. 2 bus from the airport.4) However, he may grab a taxi instead.
3. Input
s = '''The respondent, in his statement Ex.-73, which is accepted and found to be truthful. The passcode is either No.5, No. 5, No.-5, No.+5.'''
Output
1) The respondent, in his statement Ex.-73, which is accepted and found to be truthful.
2) The passcode is either No.5, No. 5, No.-5, No.+5.
4. Input
s = '''He went to New York. He is 10 years old.'''
Output
1) He went toNew York.
2) He is10 years old.
5. Input
s = '''15) In the light of Ex. P the above discussion, while maintaining the conviction of the appellant-accused for the offence under Section 302 IPC, award of extreme penalty of death by the High Court is set aside and we restore the sentence of life imprisonment as directed by the trial Court. The appeal is allowed in part to the extent mentioned above.'''
Output
1) 15) In the light of Ex. P the above discussion, while maintaining the conviction of the appellant-accused for the offence under Section 302 IPC, award of extreme penalty of death by the High Court isset aside and we restore the sentence of life imprisonment as directed by the trial Court.
2) The appeal is allowed in part to the extent mentioned above.
Solution 2:
Are you looking for below regex:
'(?<=[^A-Z][a-z]\w)[/.] '
Explanation:
- [^A-Z][a-z]\w)[/.] --> This will match all the words that are not starting with uppercase, followed by a '.' and a space.
- (?<=....) --> This will reset whatever has been selected, and just select whatever comes next, i.e., select '. ' only.
Now this can be used in split:
sent=re.split('(?<=[^A-Z][a-z]\w)[/.] ',j)
Post a Comment for "Using Regular Expression As A Tokenizer?"