Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minor bug in Seafax example #80

Open
andypiper opened this issue Oct 22, 2023 · 0 comments
Open

Minor bug in Seafax example #80

andypiper opened this issue Oct 22, 2023 · 0 comments

Comments

@andypiper
Copy link

andypiper commented Oct 22, 2023

The RSS feed that the Seafax sample pulls from by default (BBC Technology news) can contain some unicode characters. Specifically, I noticed that smart single quotes \u2019 show up displayed within the headlines. For example, at the time of reporting this, the feed contains this:

    <item>
      <title><![CDATA[Google Pixel’s face-altering photo tool sparks AI manipulation debate]]></title>

This is then displayed on-screen as:

Google Pixel\u2019s face-altering photo tool sparks AI manipulation debate

I've hacked a couple of simple (but horrible) solutions that addresses this single situation.

Either around line 125:

                # Populate our result dict
                if top_tag in accept_tags:
                    current[top_tag.decode("utf-8")] = text.decode("utf-8").replace("\u2019","'")
                    # this replaces unicode RIGHT_SINGLE_QUOTATION_MARK with basic mark

An alternative which may be better(?) is to do the replacement in the get_rss() function instead (this is the one I'm using for now).

def get_rss():
    try:
        stream = urequest.urlopen(URL)
        output = list(parse_xml_stream(stream, [b"title", b"description", b"guid", b"pubDate"], b"item"))

        # replace smart quotes with basic ones in titles
        for dict in output:
            for key,value in dict.items():
                if key == "title":
                    dict[key] = value.replace("\u2019","'")
                    
        return output

    except OSError as e:
        print(e)
        return False

There's probably a better way to clean up the string data and handle other potential Unicode invaders, but so far the smart right quotation mark is the only one I'm seeing in the RSS data. It is also not consistent, as elsewhere in the BBC feed I'm seeing basic single quotes in the exact same context as the smart quote in this situation.

I'm happy to submit a PR if that would be useful, there's a good chance that this is such a niche case that it's not warranted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant