Using Regular Expression As A Tokenizer?

December 04, 2023 Post a Comment

I am trying tokenize my corpus into sentences. I tried using spacy and nltk and they did not work well since my text is a bit tricky. Below is an artificial sample I made which cov

Solution 1:

In general you can't rely on one single Great White infallible regex, you have to write a function which uses several regexes (both positive and negative); also a dictionary of abbreviations, and some basic language parsing which knows that e.g. 'I', 'USA', 'FCC', 'TARP' are capitalized in English. Reference

Following this guideline the following function uses several regexes to parse your sentence Modification of D Greenberg answer

Code

import re

def split_into_sentences(text):
    # Regex pattern
    alphabets="([A-Za-z])"
    prefixes ="(Mr|St|Mrs|Ms|Dr|Prof|Capt|Cpt|Lt|Mt)[.]"
    suffixes ="(Inc|Ltd|Jr|Sr|Co)"
    starters ="(Mr|Mrs|Ms|Dr|He\s|She\s|It\s|They\s|Their\s|Our\s|We\s|But\s|However\s|That\s|This\s|Wherever)"
    acronyms ="([A-Z][.][A-Z][.](?:[A-Z][.])?)"
    # website regex from https://www.geeksforgeeks.org/python-check-url-string/
    websites = r"(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’]))"
    digits ="([0-9])"
    section ="(Section \d+)([.])(?= \w)"
    item_number ="(^|\s\w{2})([.])(?=[-+ ]?\d+)"
    abbreviations ="(^|[\s\(\[]\w{1,2}s?)([.])(?=[\s\)\]]|$)"
    parenthesized ="\((.*?)\)"
    bracketed ="\[(.*?)\]"
    curly_bracketed ="\{(.*?)\}"
    enclosed = '|'.join([parenthesized, bracketed, curly_bracketed])
    # text replacement
    # replace unwanted stop period with <prd>
    # actual stop periods with <stop>
    text =" "+ text +"  "
    text = text.replace("\n"," ")
    text = re.sub(prefixes,"\\1<prd>",text)
    text = re.sub(websites, lambda m: m.group().replace('.', '<prd>'), text)
    if"Ph.D"in text: text = text.replace("Ph.D.","Ph<prd>D<prd>")
    if"..."in text: text = text.replace("...","<prd><prd><prd>")
    text = re.sub("\s"+ alphabets +"[.] "," \\1<prd> ",text)
    text = re.sub(acronyms+" "+starters,"\\1<stop> \\2",text)
    text = re.sub(alphabets +"[.]"+ alphabets +"[.]"+ alphabets +"[.]","\\1<prd>\\2<prd>\\3<prd>",text)
    text = re.sub(alphabets +"[.]"+ alphabets +"[.]","\\1<prd>\\2<prd>",text)
    text = re.sub(" "+suffixes+"[.] "+starters," \\1<stop> \\2",text)
    text = re.sub(" "+suffixes+"[.]"," \\1<prd>",text)
    text = re.sub(" "+ alphabets +"[.]"," \\1<prd>",text)
    text = re.sub(section,"\\1<prd>",text)
    text = re.sub(item_number,"\\1<prd>",text)
    text = re.sub(abbreviations, "\\1<prd>",text)
    text = re.sub(digits +"[.]"+ digits,"\\1<prd>\\2",text)
    text = re.sub(enclosed, lambda m: m.group().replace('.', '<prd>'), text)
    if"”"in text: text = text.replace(".”","”.")
    if"\""in text: text = text.replace(".\"","\".")
    if"!"in text: text = text.replace("!\"","\"!")
    if"?"in text: text = text.replace("?\"","\"?")
    text = text.replace(".",".<stop>")
    text = text.replace("?","?<stop>")
    text = text.replace("!","!<stop>")
    text = text.replace("<prd>",".")

    # Tokenize sentence based upon <stop>
    sentences = text.split("<stop>")
    if sentences[-1].isspace():
        # remove last since only whitespace
        sentences = sentences[:-1]
    sentences = [s.strip() for s in sentences]

    return sentences

Usage

for index, token inenumerate(split_into_sentences(s), start = 1):
    print(f'{index}) {token}')

Tests

1. Input

s='''It is relevant to point that Case No. 778 - Martin H. v. The Woods, it was mentioned that deathto one cannot be generalised. However, the High Court while enhancing the same from life to 
death, in our view,has not assigned adequate and acceptable reasons. In our opinion, it isnot a 
rarest of rare casewhere extreme penalty of death is called for instead sentence of 
imprisonment for life as ordered by the trial Court would be appropriate.15) In the light of the 
above discussion, while maintaining the conviction of the appellant-accused for the offence under Section 302. IPC, 
award of extreme penalty of death by the High Court isset aside and we restore the sentence of
 life imprisonment as directed by the trial Court.
'''

Output

1) It is relevant to point that Case No. 778 - Martin H. v. The Woods, it was mentioned that death  to one cannot be generalised.
2) However, the High Court while enhancing the same from life to  death, in our view,has not assigned adequate and acceptable reasons.
3) In our opinion, it isnot a  rarest of rare casewhere extreme penalty of death is called for instead sentence of  imprisonment for life as ordered by the trial Court would be appropriate.
4) 15) In the light of the  above discussion, while maintaining the conviction of the appellant-accused for the offence under Section 302. IPC,  award of extreme penalty of death by the High Court isset aside and we restore the sentence of  life imprisonment as directed by the trial Court.

2. Input

s = '''Mr. or Mrs. or Dr. (not sure of their title) Smith will be here in the morning at eight.He's arriving on flight No. 48213 out of Denver.He'll take the No. 2 bus from the airport.However, he may grab a taxi instead.'''

Output

1) Mr. or Mrs. or Dr. (not sure of their title) Smith will be here in the morning at eight.
2) He's arriving on flight No. 48213 out of Denver.3) He'll take the No. 2 bus from the airport.4) However, he may grab a taxi instead.

3. Input

s = '''The respondent, in his statement Ex.-73, which is accepted and found to be truthful. The passcode is either No.5, No. 5, No.-5, No.+5.'''

Output

1) The respondent, in his statement Ex.-73, which is accepted and found to be truthful.
2) The passcode is either No.5, No. 5, No.-5, No.+5.

4. Input

s = '''He went to New York. He is 10 years old.'''

Output

1) He went toNew York.
2) He is10 years old.

5. Input

s = '''15) In the light of  Ex. P the above discussion, while maintaining the conviction of the appellant-accused for the offence under Section 302 IPC, award of extreme penalty of death by the High Court is set aside and we restore the sentence of life imprisonment as directed by the trial Court. The appeal is allowed in part to the extent mentioned above.'''

Output

1) 15) In the light of  Ex. P the above discussion, while maintaining the conviction of the appellant-accused for the offence under Section 302 IPC, award of extreme penalty of death by the High Court isset aside and we restore the sentence of life imprisonment as directed by the trial Court.
2) The appeal is allowed in part to the extent mentioned above.

Solution 2:

Are you looking for below regex:

'(?<=[^A-Z][a-z]\w)[/.] '

Explanation:

[^A-Z][a-z]\w)[/.] --> This will match all the words that are not starting with uppercase, followed by a '.' and a space.
(?<=....) --> This will reset whatever has been selected, and just select whatever comes next, i.e., select '. ' only.

Now this can be used in split:

sent=re.split('(?<=[^A-Z][a-z]\w)[/.] ',j)

Python Programming Language

Using Regular Expression As A Tokenizer?

Solution 1:

Solution 2:

Post a Comment for "Using Regular Expression As A Tokenizer?"