RTL text is mirrored #66

mtigas · 2016-03-31T21:16:57Z

per Eva’s tweet, started looking into whether we had some issues with Arabic script.

Not sure if this was a bug in the older tabula-extractor.

Anyway, given this file (or any other PDF with Arabic script) and just trying to pull out any run of text, you’ll get output from Tabula and tabula-java that’s mirrored:

reference text:

(Note question mark position in first line.)

After looking into it some, here’s what I’ve dug up:

The PDFbox site here mentions at the bottom that

Extracting text in languages whose text goes from right to left (such as Arabic and Hebrew) in PDF files can result in text that is backwards. PDFBox can normalize and reverse the text if the ICU4J jar file has been placed on the classpath (it is an optional dependency). Note that you should also enable sorting with either org.apache.pdfbox.util.PDFTextStripper or org.apache.pdfbox.ExtractText to ensure accurate output.
Our TextElement class rips some bits from that very PDFTextStripper class. Here's ours, noting that it’s "ported from from PDFBox's PDFTextStripper.writePage, with modifications"

tabula-java/src/main/java/technology/tabula/TextElement.java

Line 108 in 7b56c46

/**

(lol "Here be dragons")

So that upstream writePage function has a bunch of extra bits, starting around L629 regarding normalizing RTL scripts. We're missing those bits. Here's the block comment from that part:

/* Before we can display the text, we need to do some normalizing.
 * Arabic and Hebrew text is right to left and is typically stored
 * in its logical format, which means that the rightmost character is
 * stored first, followed by the second character from the right etc.
 * However, PDF stores the text in presentation form, which is left to
 * right.  We need to do some normalization to convert the PDF data to
 * the proper logical output format.
 *
 * Note that if we did not sort the text, then the output of reversing the
 * text is undefined and can sometimes produce worse output then not trying
 * to reverse the order.  Sorting should be done for these languages.
 * */

The text was updated successfully, but these errors were encountered:

jeremybmerrill · 2016-03-31T21:21:46Z

~~Is it possible that PDFBox's output is correct? That "mirrored" output may be displaying the text in logical order, right? Hard to know what the expected output on the command-line should be.~~
The expected output on the command-line, somehow, is to show up in the correct way for Arabic. Curling aljazeera.net yields <title>الجزيرة نت</title> which is in the right order. The text, when I copy-paste it into sublime is "reversed" into logical order (i.e. illegible in Arabic).

We may have to do some "guessing" to figure out if text in a PDF is RTL, so that we can enable the RTL HTML attrs for preview output. (might be as simple as counting non-numeric word characters and seeing if 50% or more are in a script that's RTL, which is probably doable as a Unicode thingy)

I also have no idea how CSVs in RTL languages work.

Also, dealing with diacritics in these languages will be a disaster.

mtigas · 2016-03-31T21:32:12Z

Here's some output from upstream PDFBox; looks like they’ve got it working right:

Since we're missing their RTL modifications, we're naively taking the visual character order (the character positions on the page, left-to-right) as the logical character order, and not correcting for that.

There's some chance that their RTL code (and maybe something in ICU4J) already handle the diacritics (and direction-switches and etc). Maybe.

Also not sure about dealing with CSVs (probably some fun juggling strong and weak characters); for now we should mainly worry about outputting the character stream for a given cell in the correct logical order.

mtigas · 2016-03-31T21:34:27Z

Actually, scratch that: in the last couple lines of PDFBox’s output, it also looks like it’s only partially right (blows up where the directions start getting mixed).

jeremybmerrill · 2016-04-01T03:12:12Z

I'd be happy to try to pick this up. Not sure when I'll get to it though, so if anyone wants to do it first, that's cool too.

jazzido · 2016-04-01T03:14:39Z

As the team's linguist, you're definitely the man for this job. Besides, I need to turn in my thesis 1 month from now :)

jeremybmerrill · 2016-04-02T21:21:50Z

In case anyone's curious!

curl -s www.aljazeera.net/portal | head -n4 | tail -n1 | xxd
   <title>الجزيرة نت</title>
20 20 20 20 3c 74 69 74 6c 65 3e d8a7 d984       <title>.....
sp sp sp sp <  t  i  t  l  e  >  alif lam  
d8ac d8b2 d98a d8b1 d8a9 20 d986 d8aa 3c 2f  ......... ....</
jim   zay  ya   ra   ta  sp  nun  ta  <  /
74 69 74 6c 65 3e 0d 0a                          title>..
t  i  t  l  e  >

Indeed, the way this works is that the text is transmitted in logical order. There's no special RTL mark.

jeremybmerrill · 2016-04-02T21:45:08Z

Here's a question: if you have an all-RTL table, is the cell at 0,0 the top-left or top-right cell?

I think the right answer for us is the top-left, like in LTR languages, but it's a bit of a philosophical puzzle. (And worth reconsidering.)

jeremybmerrill · 2016-04-03T23:38:46Z

Useful links:

unicode bidi algo
normalizePres method from PDFBox; calls out to ICU4J
an explanation from the ICU project

and for HTML
https://www.w3.org/International/articles/inline-bidi-markup/uba-basics
https://www.w3.org/WAI/GL/WCAG20/WD-WCAG20-TECHS-20071102/H34.html

mtigas · 2016-04-07T22:55:37Z

Just took a quick look at that branch. Pretty great so far.

Also @jeremybmerrill: For an all-RTL (no bidi marks), it seems like the right place for the (0,0) logical location is the top-right? At least, based on quickly looking at the example on the Hebrew wikipedia article on CSV. (There’s no Arabic version of the article, and the Farsi one doesn’t have an example on the page.)

Maybe worth actually crafting better test PDFs instead of this contrived text document. The cases I can think of are:

Document completely RTL.
Document initially LTR containing some RTL script within some cells? Vice-versa.
Document mixing all-RTL and all-LTR cells in the same row?
Inline numbers -- as you note on the branch pull req. Most of the time numbers should be rendered in their LTR display format, I guess?
Probably other really weird combinations that are real-world realistic that I can't think of since my knowledge of RTL languages is still pretty light.

jeremybmerrill · 2016-04-07T23:12:12Z

@mtigas that's a good idea, the Hebrew CSV article.

I'm gonna get the spaces working (which I think is all that's left for coping with bidi cells) in my test case and then can expand to some real world documents.

jeremybmerrill · 2016-04-09T21:41:20Z

So some contrary evidence -- that might be an artifact of LibreOffice, I dunno -- is that opening one of the XLS files from the Israeli Central Bureau of Statistics results in LTR-ordered CSVs:

,2.6,3.2,3.0,4.2,5.8,1.9,         סך הכל
,0.7,1.3,1.1,2.3,3.8,0.1,         סך הכל - לנפש

that is to say, the first cell in the data (as shown by a hex editor) is the one that appears in the top-left of the table when presented visually.

and sweet 1996-style gifs:

jeremybmerrill · 2016-04-09T21:50:06Z

And this CSV published by the Israeli newspaper Yediot Ahronot has English headers: http://mediadownload.ynet.co.il/download/gatso_speed_camera_01_2012.csv

I may switch to looking for real-world Arabic examples. In Israel, I wouldn't be surprised if, by default and because a lot of the early computer-literate folks were probably fluent English speakers, there may be a default towards English headers and English ordering when it's a pain the ass.

The same may be the case in places like Lebanon and Tunisia, but with French...

jeremybmerrill · 2016-04-09T22:04:26Z

Here's another real-life XLS, this time from Tunisia: http://www.data.gov.tn/index.php?option=com_mtree&task=viewlink&link_id=72&Itemid=187. Exported to CSV -- again subject to the caveat that LibreOffice may be doing it wrong -- the spreadsheet is the same as English, but the row labels are just in column 10 (or whatever). That is, the exported CSV is oriented LTR.

jeremybmerrill · 2016-04-10T01:00:09Z

I think the lam-alif ligature bug is actually in LibreOffice generating crappy PDFs, not PDFBox.

And it turns out various PDF generators do all sorts of weird stuff, so I'm going to have to work on the comparisons. (I created a PDF that ought to be identical with Google Drive and it gives totally different results.) Ugh.

(But Google Docs's tables are also a disaster.)

jeremybmerrill · 2017-03-30T15:35:00Z

I believe this was fixed (nearly a year ago!) in 18d6268

I'm sure there are additional RTL bugs, but in at least some cases, it works.

mtigas added the bug label Mar 31, 2016

jeremybmerrill mentioned this issue Apr 3, 2016

first stab at an RTL solution #67

Merged

jeremybmerrill mentioned this issue Apr 9, 2016

second stab at fixing RTL #70

Merged

jeremybmerrill closed this as completed Mar 30, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RTL text is mirrored #66

RTL text is mirrored #66

mtigas commented Mar 31, 2016

jeremybmerrill commented Mar 31, 2016

mtigas commented Mar 31, 2016

mtigas commented Mar 31, 2016

jeremybmerrill commented Apr 1, 2016

jazzido commented Apr 1, 2016

jeremybmerrill commented Apr 2, 2016

jeremybmerrill commented Apr 2, 2016

jeremybmerrill commented Apr 3, 2016

mtigas commented Apr 7, 2016

jeremybmerrill commented Apr 7, 2016

jeremybmerrill commented Apr 9, 2016

jeremybmerrill commented Apr 9, 2016

jeremybmerrill commented Apr 9, 2016

jeremybmerrill commented Apr 10, 2016

jeremybmerrill commented Mar 30, 2017

RTL text is mirrored #66

RTL text is mirrored #66

Comments

mtigas commented Mar 31, 2016

jeremybmerrill commented Mar 31, 2016

mtigas commented Mar 31, 2016

mtigas commented Mar 31, 2016

jeremybmerrill commented Apr 1, 2016

jazzido commented Apr 1, 2016

jeremybmerrill commented Apr 2, 2016

jeremybmerrill commented Apr 2, 2016

jeremybmerrill commented Apr 3, 2016

mtigas commented Apr 7, 2016

jeremybmerrill commented Apr 7, 2016

jeremybmerrill commented Apr 9, 2016

jeremybmerrill commented Apr 9, 2016

jeremybmerrill commented Apr 9, 2016

jeremybmerrill commented Apr 10, 2016

jeremybmerrill commented Mar 30, 2017