How To Do A Partial Conditioning On A Tag For Find_all() In Bs4?
I have an xml which has multiple tags which look like this: , id = lambda value: value and value.startswith("Page1"))
This is my entire code:
from bs4 import BeautifulSoup
xml = """
<textblock height="55" hpos="143" id="Page1_Block5" lang="en-US" stylerefs="StyleId-E6BF91A3-3D6A-442F-9A46-22A0459A02E9- font1" vpos="226" width="393">
"""
xml_soup = BeautifulSoup(xml,'lxml')
text_blocks = xml_soup.find_all('textblock', id = lambda value: value and value.startswith("Page1"))
Explanation:
The lambda function checks whether the id
starts with Page1
. If yes, then it retrieves the tag. I have also added few more values to the xml
variable. Here is the test data that I used:
xml = """
<textblock height="55" hpos="143" id="Page1_Block5" lang="en-US" stylerefs="StyleId-E6BF91A3-3D6A-442F-9A46-22A0459A02E9- font1" vpos="226" width="393">
<textblock height="55" hpos="143" id="Page1_Block4" lang="en-US" stylerefs="StyleId-E6BF91A3-3D6A-442F-9A46-22A0459A02E9- font1" vpos="226" width="393">
<textblock height="55" hpos="143" id="Page2_Block5" lang="en-US" stylerefs="StyleId-E6BF91A3-3D6A-442F-9A46-22A0459A02E9- font1" vpos="226" width="393">
<textblock height="55" hpos="143" id="Page1_Block1" lang="en-US" stylerefs="StyleId-E6BF91A3-3D6A-442F-9A46-22A0459A02E9- font1" vpos="226" width="393">
"""
As u can see, there are 3 textblock
tags with an id
that starts with Page1
. When I ran my code using this test data and printed out the length of the variable text_blocks
, this is the output that I got:
>>>len(text_blocks)
3
This shows that the code works! Hope that this helps!
P.S: U can refer to this link for more details about extracting elements with an id
that starts with a particular string.
Post a Comment for "How To Do A Partial Conditioning On A Tag For Find_all() In Bs4?"