-
Notifications
You must be signed in to change notification settings - Fork 434
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RTL text is mirrored #66
Comments
We may have to do some "guessing" to figure out if text in a PDF is RTL, so that we can enable the RTL HTML attrs for preview output. (might be as simple as counting non-numeric word characters and seeing if 50% or more are in a script that's RTL, which is probably doable as a Unicode thingy) I also have no idea how CSVs in RTL languages work. Also, dealing with diacritics in these languages will be a disaster. |
Here's some output from upstream PDFBox; looks like they’ve got it working right: Since we're missing their RTL modifications, we're naively taking the visual character order (the character positions on the page, left-to-right) as the logical character order, and not correcting for that. There's some chance that their RTL code (and maybe something in ICU4J) already handle the diacritics (and direction-switches and etc). Maybe. Also not sure about dealing with CSVs (probably some fun juggling strong and weak characters); for now we should mainly worry about outputting the character stream for a given cell in the correct logical order. |
Actually, scratch that: in the last couple lines of PDFBox’s output, it also looks like it’s only partially right (blows up where the directions start getting mixed). |
I'd be happy to try to pick this up. Not sure when I'll get to it though, so if anyone wants to do it first, that's cool too. |
As the team's linguist, you're definitely the man for this job. Besides, I need to turn in my thesis 1 month from now :) |
In case anyone's curious!
Indeed, the way this works is that the text is transmitted in logical order. There's no special RTL mark. |
Here's a question: if you have an all-RTL table, is the cell at 0,0 the top-left or top-right cell? I think the right answer for us is the top-left, like in LTR languages, but it's a bit of a philosophical puzzle. (And worth reconsidering.) |
Just took a quick look at that branch. Pretty great so far. Also @jeremybmerrill: For an all-RTL (no bidi marks), it seems like the right place for the (0,0) logical location is the top-right? At least, based on quickly looking at the example on the Hebrew wikipedia article on CSV. (There’s no Arabic version of the article, and the Farsi one doesn’t have an example on the page.) Maybe worth actually crafting better test PDFs instead of this contrived text document. The cases I can think of are:
|
@mtigas that's a good idea, the Hebrew CSV article. I'm gonna get the spaces working (which I think is all that's left for coping with bidi cells) in my test case and then can expand to some real world documents. |
So some contrary evidence -- that might be an artifact of LibreOffice, I dunno -- is that opening one of the XLS files from the Israeli Central Bureau of Statistics results in LTR-ordered CSVs:
that is to say, the first cell in the data (as shown by a hex editor) is the one that appears in the top-left of the table when presented visually. and sweet 1996-style gifs: |
And this CSV published by the Israeli newspaper Yediot Ahronot has English headers: http://mediadownload.ynet.co.il/download/gatso_speed_camera_01_2012.csv I may switch to looking for real-world Arabic examples. In Israel, I wouldn't be surprised if, by default and because a lot of the early computer-literate folks were probably fluent English speakers, there may be a default towards English headers and English ordering when it's a pain the ass. The same may be the case in places like Lebanon and Tunisia, but with French... |
Here's another real-life XLS, this time from Tunisia: http://www.data.gov.tn/index.php?option=com_mtree&task=viewlink&link_id=72&Itemid=187. Exported to CSV -- again subject to the caveat that LibreOffice may be doing it wrong -- the spreadsheet is the same as English, but the row labels are just in column 10 (or whatever). That is, the exported CSV is oriented LTR. |
I think the lam-alif ligature bug is actually in LibreOffice generating crappy PDFs, not PDFBox. And it turns out various PDF generators do all sorts of weird stuff, so I'm going to have to work on the comparisons. (I created a PDF that ought to be identical with Google Drive and it gives totally different results.) Ugh. (But Google Docs's tables are also a disaster.) |
I believe this was fixed (nearly a year ago!) in 18d6268 I'm sure there are additional RTL bugs, but in at least some cases, it works. |
per Eva’s tweet, started looking into whether we had some issues with Arabic script.
Not sure if this was a bug in the older
tabula-extractor
.Anyway, given this file (or any other PDF with Arabic script) and just trying to pull out any run of text, you’ll get output from Tabula and tabula-java that’s mirrored:
reference text:
data:image/s3,"s3://crabby-images/c493a/c493a5a09abf646f9563dcc783078a962ffcc024" alt="refrerence"
(Note question mark position in first line.)
After looking into it some, here’s what I’ve dug up:
The PDFbox site here mentions at the bottom that
Our TextElement class rips some bits from that very PDFTextStripper class. Here's ours, noting that it’s "ported from from PDFBox's
PDFTextStripper.writePage
, with modifications"tabula-java/src/main/java/technology/tabula/TextElement.java
Line 108 in 7b56c46
So that upstream
writePage
function has a bunch of extra bits, starting around L629 regarding normalizing RTL scripts. We're missing those bits. Here's the block comment from that part:The text was updated successfully, but these errors were encountered: