Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RTL text is mirrored #66

Closed
mtigas opened this issue Mar 31, 2016 · 15 comments
Closed

RTL text is mirrored #66

mtigas opened this issue Mar 31, 2016 · 15 comments
Labels

Comments

@mtigas
Copy link
Member

mtigas commented Mar 31, 2016

per Eva’s tweet, started looking into whether we had some issues with Arabic script.

Not sure if this was a bug in the older tabula-extractor.

Anyway, given this file (or any other PDF with Arabic script) and just trying to pull out any run of text, you’ll get output from Tabula and tabula-java that’s mirrored:

reversed-txt

reference text:
refrerence

(Note question mark position in first line.)


After looking into it some, here’s what I’ve dug up:

  • The PDFbox site here mentions at the bottom that

    Extracting text in languages whose text goes from right to left (such as Arabic and Hebrew) in PDF files can result in text that is backwards. PDFBox can normalize and reverse the text if the ICU4J jar file has been placed on the classpath (it is an optional dependency). Note that you should also enable sorting with either org.apache.pdfbox.util.PDFTextStripper or org.apache.pdfbox.ExtractText to ensure accurate output.

  • Our TextElement class rips some bits from that very PDFTextStripper class. Here's ours, noting that it’s "ported from from PDFBox's PDFTextStripper.writePage, with modifications"

    (lol "Here be dragons")

  • So that upstream writePage function has a bunch of extra bits, starting around L629 regarding normalizing RTL scripts. We're missing those bits. Here's the block comment from that part:

    /* Before we can display the text, we need to do some normalizing.
     * Arabic and Hebrew text is right to left and is typically stored
     * in its logical format, which means that the rightmost character is
     * stored first, followed by the second character from the right etc.
     * However, PDF stores the text in presentation form, which is left to
     * right.  We need to do some normalization to convert the PDF data to
     * the proper logical output format.
     *
     * Note that if we did not sort the text, then the output of reversing the
     * text is undefined and can sometimes produce worse output then not trying
     * to reverse the order.  Sorting should be done for these languages.
     * */
    
@mtigas mtigas added the bug label Mar 31, 2016
@jeremybmerrill
Copy link
Member

Is it possible that PDFBox's output is correct? That "mirrored" output may be displaying the text in logical order, right? Hard to know what the expected output on the command-line should be.
The expected output on the command-line, somehow, is to show up in the correct way for Arabic. Curling aljazeera.net yields <title>الجزيرة نت</title> which is in the right order. The text, when I copy-paste it into sublime is "reversed" into logical order (i.e. illegible in Arabic).

We may have to do some "guessing" to figure out if text in a PDF is RTL, so that we can enable the RTL HTML attrs for preview output. (might be as simple as counting non-numeric word characters and seeing if 50% or more are in a script that's RTL, which is probably doable as a Unicode thingy)

I also have no idea how CSVs in RTL languages work.

Also, dealing with diacritics in these languages will be a disaster.

@mtigas
Copy link
Member Author

mtigas commented Mar 31, 2016

Here's some output from upstream PDFBox; looks like they’ve got it working right:
pdfbox

Since we're missing their RTL modifications, we're naively taking the visual character order (the character positions on the page, left-to-right) as the logical character order, and not correcting for that.

There's some chance that their RTL code (and maybe something in ICU4J) already handle the diacritics (and direction-switches and etc). Maybe.

Also not sure about dealing with CSVs (probably some fun juggling strong and weak characters); for now we should mainly worry about outputting the character stream for a given cell in the correct logical order.

@mtigas
Copy link
Member Author

mtigas commented Mar 31, 2016

Actually, scratch that: in the last couple lines of PDFBox’s output, it also looks like it’s only partially right (blows up where the directions start getting mixed).

@jeremybmerrill
Copy link
Member

I'd be happy to try to pick this up. Not sure when I'll get to it though, so if anyone wants to do it first, that's cool too.

@jazzido
Copy link
Contributor

jazzido commented Apr 1, 2016

As the team's linguist, you're definitely the man for this job. Besides, I need to turn in my thesis 1 month from now :)

@jeremybmerrill
Copy link
Member

In case anyone's curious!

curl -s www.aljazeera.net/portal | head -n4 | tail -n1 | xxd
   <title>الجزيرة نت</title>
20 20 20 20 3c 74 69 74 6c 65 3e d8a7 d984       <title>.....
sp sp sp sp <  t  i  t  l  e  >  alif lam  
d8ac d8b2 d98a d8b1 d8a9 20 d986 d8aa 3c 2f  ......... ....</
jim   zay  ya   ra   ta  sp  nun  ta  <  /
74 69 74 6c 65 3e 0d 0a                          title>..
t  i  t  l  e  > 

Indeed, the way this works is that the text is transmitted in logical order. There's no special RTL mark.

@jeremybmerrill
Copy link
Member

Here's a question: if you have an all-RTL table, is the cell at 0,0 the top-left or top-right cell?

I think the right answer for us is the top-left, like in LTR languages, but it's a bit of a philosophical puzzle. (And worth reconsidering.)

@mtigas
Copy link
Member Author

mtigas commented Apr 7, 2016

Just took a quick look at that branch. Pretty great so far.

Also @jeremybmerrill: For an all-RTL (no bidi marks), it seems like the right place for the (0,0) logical location is the top-right? At least, based on quickly looking at the example on the Hebrew wikipedia article on CSV. (There’s no Arabic version of the article, and the Farsi one doesn’t have an example on the page.)

Maybe worth actually crafting better test PDFs instead of this contrived text document. The cases I can think of are:

  • Document completely RTL.
  • Document initially LTR containing some RTL script within some cells? Vice-versa.
  • Document mixing all-RTL and all-LTR cells in the same row?
  • Inline numbers -- as you note on the branch pull req. Most of the time numbers should be rendered in their LTR display format, I guess?
  • Probably other really weird combinations that are real-world realistic that I can't think of since my knowledge of RTL languages is still pretty light.

@jeremybmerrill
Copy link
Member

@mtigas that's a good idea, the Hebrew CSV article.

I'm gonna get the spaces working (which I think is all that's left for coping with bidi cells) in my test case and then can expand to some real world documents.

@jeremybmerrill
Copy link
Member

So some contrary evidence -- that might be an artifact of LibreOffice, I dunno -- is that opening one of the XLS files from the Israeli Central Bureau of Statistics results in LTR-ordered CSVs:

,2.6,3.2,3.0,4.2,5.8,1.9,         סך הכל
,0.7,1.3,1.1,2.3,3.8,0.1,         סך הכל - לנפש

that is to say, the first cell in the data (as shown by a hex editor) is the one that appears in the top-left of the table when presented visually.

and sweet 1996-style gifs:

israeli-flag-waving

@jeremybmerrill
Copy link
Member

And this CSV published by the Israeli newspaper Yediot Ahronot has English headers: http://mediadownload.ynet.co.il/download/gatso_speed_camera_01_2012.csv

I may switch to looking for real-world Arabic examples. In Israel, I wouldn't be surprised if, by default and because a lot of the early computer-literate folks were probably fluent English speakers, there may be a default towards English headers and English ordering when it's a pain the ass.

The same may be the case in places like Lebanon and Tunisia, but with French...

@jeremybmerrill
Copy link
Member

Here's another real-life XLS, this time from Tunisia: http://www.data.gov.tn/index.php?option=com_mtree&task=viewlink&link_id=72&Itemid=187. Exported to CSV -- again subject to the caveat that LibreOffice may be doing it wrong -- the spreadsheet is the same as English, but the row labels are just in column 10 (or whatever). That is, the exported CSV is oriented LTR.

@jeremybmerrill
Copy link
Member

I think the lam-alif ligature bug is actually in LibreOffice generating crappy PDFs, not PDFBox.

And it turns out various PDF generators do all sorts of weird stuff, so I'm going to have to work on the comparisons. (I created a PDF that ought to be identical with Google Drive and it gives totally different results.) Ugh.

(But Google Docs's tables are also a disaster.)

@jeremybmerrill
Copy link
Member

I believe this was fixed (nearly a year ago!) in 18d6268

I'm sure there are additional RTL bugs, but in at least some cases, it works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants