Skip to content Skip to sidebar Skip to footer

How To Deal With ® In Url For Urllib2.urlopen?

I received a url: https://www.packtpub.com/virtualization-and-cloud/citrix-xenapp®-75-desktop-virtualization-solutions; it is from BeautifulSoup. url=u'https://www.packtpub.com/vi

Solution 1:

URLs must be valid bytestring, with non-ASCII codepoints encoded correctly. You'll need to encode to UTF-8, then url quote the path of your URL:

import urllib
import urllib2
import urlparse

originalUrl = u'https://www.packtpub.com/virtualization-and-cloud/citrix-xenapp\xae-75-desktop-virtualization-solutions'
parsed_link = urlparse.urlsplit(originalUrl.encode('utf8'))
parsed_link = parsed_link._replace(path=urllib.quote(parsed_link.path))
encoded_link = parsed_link.geturl()
source = urllib2.urlopen(encoded_link).read()

Demo:

>>>import urllib>>>import urllib2 >>>import urlparse>>>originalUrl = u'https://www.packtpub.com/virtualization-and-cloud/citrix-xenapp\xae-75-desktop-virtualization-solutions'>>>parsed_link = urlparse.urlsplit(originalUrl.encode('utf8'))>>>parsed_link = parsed_link._replace(path=urllib.quote(parsed_link.path))>>>encoded_link = parsed_link.geturl()>>>encoded_link
'https://www.packtpub.com/virtualization-and-cloud/citrix-xenapp%C2%AE-75-desktop-virtualization-solutions'
>>>source = urllib2.urlopen(encoded_link).read()>>>len(source)
68758

Post a Comment for "How To Deal With ® In Url For Urllib2.urlopen?"