Test Scrivener's New .docx Converters

User avatar
KB
Site Admin
Posts: 20709
Joined: Tue Jun 13, 2006 11:23 pm
Platform: Mac
Location: Truro, Cornwall
Contact:

Wed May 01, 2019 2:37 pm Post

Those of you who like to read these things may have noticed this hidden away in the release notes for Scrivener 3.1.2:

  • Scrivener now contains a brand new, native, in-house converter for Word .docx files (affecting import, export and Compile). Because this converter has not yet been tested widely enough, it is turned off by default - by default Scrivener still uses the mature Java-based converters from third-party company Aspose, as it has been doing for several years now. If you would like to test the new converters, open Preferences and then, under Sharing > Conversion, turn off “Use enhanced converters for Microsoft Word and OpenOffice documents”. This will turn off the Aspose Java converters, causing Scrivener to use the new in-house .docx converter instead. (Note that this will also result in poor OpenOffice conversions, however.)
  • When the Java-based Aspose converters for creating Word and Office files fail, Scrivener now shows a message to warn the user that Apple’s more basic converters will be used instead.
  • When the Aspose converters fail, Scrivener now falls back on the in-house converter rather than the low-fidelity Apple converter.


Converting to and from Word format is an important feature of Scrivener, but it's not actually that easy to do. Apple provides some standard converters, but they are terrible - they don't support images, comments or footnotes, they lose line spacing, and they obliterate other formatting (needless to say, Apple does not use them in its own apps). For the past few years, therefore, Scrivener has used a third-party, Java-based converter (from Aspose) for importing and exporting DOCX files. Apple's own converter is only ever used if the Java converter fails for some reason; you could also force Scrivener to use Apple's converter instead of the Java-based one by turning off "Use enhanced converters for Microsoft Word and Office documents" in Scrivener's Sharing > Conversion preferences.

The Aspose Java-based converters work very well, but as Apple tightens app security, my fear is that it is eventually going to be very difficult to call a Java app from a Cocoa one.

Meanwhile, on iOS, things are even more difficult: Apple's own semi-functional converter isn't even available there, and it's not possible to build in and invoke a separate Java process as you can on the Mac. For our iOS version, then, I built my own .docx converter that did just enough - everything that the iOS version required. This involved getting fairly familiar with the OfficeOpenXML docs and writing a lot of XML code. Fun!

Over the past few months, I have begun to expand on that work by making it cross-platform (Mac and iOS rather than just iOS) and adding to it support for everything that the Mac version of Scrivener needs for its many Compile features. As a result of this work, as of Scrivener 3.1.2, Scrivener now uses my own custom .docx converters when the Java converters either fail or are turned off via the Preferences.

Building these converters to do everything required for the Mac version has been quite complex, and they require a lot more testing. At some point, they may well replace the Java converters for .docx conversion, but they need more real-world use first. I'd therefore be grateful to anyone who is feeling intrepid or experimental enough to use them instead of the Java converters and report any issues with them.

To test out my custom converter, all you need to do is this:

1. Open Scrivener's Preferences (Scrivener > Preferences...).

2. Click on "Sharing" in the Preferences panel toolbar.

3. Select "Conversion".

4. Deselect "Use enhanced converters for Microsoft Word and OpenOffice documents." (Note that the information under this checkbox is currently misleading when it comes to .docx - it says that if you un-tick it, the macOS standard converters will be used and you may lose formatting. This is no longer true, but I'll be leaving the text like this to discourage users turning off the setting until I know my own converters are rock solid.)

Important Note: If you turn off this setting. the standard macOS converters *will* be used for OpenOffice and legacy .doc files, so only do this if you don't use those formats much.

We have tested lots of documents using our own converter and it should be working well, creating and importing documents just as well as the Java converter. However, it is brand new and there are bound to be issues we have missed, so we're thankful for any issues you find and report - please post anything you find in a reply to this thread, or email us at mac.support AT literatureandlatte.com. And of course, if you do find any show-stopping bugs, you can easily switch back to using the Java-based converters.

Thanks and all the best,
Keith
"You can't waltz in here, use my toaster, and start spouting universal truths without qualification."

User avatar
xiamenese
Posts: 3971
Joined: Mon Jan 29, 2007 1:32 am
Platform: Mac
Location: London or Exeter, UK.

Wed May 01, 2019 5:44 pm Post

Just tried compiling Shirley's English–Chinese WIP—with which I'm helping—using your converter to .docx on my 13" MBP. 187 pages of text in Chinese, compile time 5 secs, opened automatically in Word (v. 15.13.3 so quite old) in 10 secs. Text is perfect, though it is straightforward text with no images, footnotes, comments, etc.

Great job!

:)

Mark
The Scrivenato sometimes known as Mr X.
rMBP 13" (early 2015) 10.14.5, 8GB RAM, 512GB SSID
MBP17" (late 2011) 10.13.6, 8GB RAM, 512GB SSID
2017 iPad, iOS 12.3.1, 128GB, Apple Pencil
Scrivener, Scapple, Nisus Writer Pro, Bookends …

User avatar
KB
Site Admin
Posts: 20709
Joined: Tue Jun 13, 2006 11:23 pm
Platform: Mac
Location: Truro, Cornwall
Contact:

Wed May 01, 2019 7:01 pm Post

Thanks Mark! I'm glad it's working so far.
All the best,
Keith
"You can't waltz in here, use my toaster, and start spouting universal truths without qualification."


User avatar
KB
Site Admin
Posts: 20709
Joined: Tue Jun 13, 2006 11:23 pm
Platform: Mac
Location: Truro, Cornwall
Contact:

Thu May 02, 2019 8:06 pm Post

Many thanks for those files, they are really useful. I'm not seeing any crashes with those documents in my current development version, which is good news - I fixed a crash in the importer recently so it looks like it has fixed things for these documents too.

I have spotted a few inconsistencies when importing those documents, though, with fonts and issues in tables. I'm working on that now. My main focus is always on ensuring that export/Compile is 100% solid and reliable, but obviously I want import to work as well as possible too.

All the best,
Keith
"You can't waltz in here, use my toaster, and start spouting universal truths without qualification."

User avatar
nontroppo
Posts: 1147
Joined: Mon Mar 05, 2007 5:22 pm
Platform: Mac
Location: Airstrip One

Sun May 05, 2019 4:06 am Post

Hi Keith, gosh writing your own docx converter is a walk on hot coals I'm sure!!!

One point I see is compiler image size conversion. If I have an image at the default sizing:

Screenshot 2019-05-05 at 11.55.47_SMALL.png
Screenshot 2019-05-05 at 11.55.47_SMALL.png (12.6 KiB) Viewed 1978 times


In Word the size becomes this:

Screenshot 2019-05-05 at 11.56.08_SMALL.png
Screenshot 2019-05-05 at 11.56.08_SMALL.png (34.7 KiB) Viewed 1978 times


The "final" size itself seems fine ( (1410px/72dpi) * 2.54 = 49.74cm width ), but the original size isn't and the scale is set to  200% to compensate?

User avatar
xiamenese
Posts: 3971
Joined: Mon Jan 29, 2007 1:32 am
Platform: Mac
Location: London or Exeter, UK.

Sun May 05, 2019 9:17 am Post

Are you working on a retina screen? When you open an image in Graphic Converter, you are presented with an option of "Full Size" or "Smaller". "Smaller" is retina sizing, full size is non-Retina and reports at 200%.

Could that be involved?

Mark
The Scrivenato sometimes known as Mr X.
rMBP 13" (early 2015) 10.14.5, 8GB RAM, 512GB SSID
MBP17" (late 2011) 10.13.6, 8GB RAM, 512GB SSID
2017 iPad, iOS 12.3.1, 128GB, Apple Pencil
Scrivener, Scapple, Nisus Writer Pro, Bookends …

User avatar
nontroppo
Posts: 1147
Joined: Mon Mar 05, 2007 5:22 pm
Platform: Mac
Location: Airstrip One

Sun May 05, 2019 3:13 pm Post

Yes, I have a retina screen, though I am confused why you would scale an image to 200% on a retina screen (it is already downsampling automagically). This default result is a massive image that does not fit the Word page width and gets cut off. Actually I tried the Aspose docx engine option in the compiler and it generates the same scaling, so Keith is probably just trying to retain compatibility with the Aspose engine for backwards compatibility.

Pandoc docx writer has much more useful behaviour, in that it rescales images that do not specify dimensions so that they fit the width of the Word page as a default. This is also how the Scrivener editor visualises images in page view. Why wouldn't you want this as the default behaviour?

The performance improvement between Keith's converter and the Aspose one is striking!

User avatar
xiamenese
Posts: 3971
Joined: Mon Jan 29, 2007 1:32 am
Platform: Mac
Location: London or Exeter, UK.

Sun May 05, 2019 4:00 pm Post

nontroppo wrote:Yes, I have a retina screen, though I am confused why you would scale an image to 200% on a retina screen (it is already downsampling automagically). This default result is a massive image that does not fit the Word page width and gets cut off. Actually I tried the Aspose docx engine option in the compiler and it generates the same scaling, so Keith is probably just trying to retain compatibility with the Aspose engine for backwards compatibility.

I always load images in the retina-aware mode, though sometimes I then magnify to 200% for precision manipulation.

nontroppo wrote:Pandoc docx writer has much more useful behaviour, in that it rescales images that do not specify dimensions so that they fit the width of the Word page as a default. This is also how the Scrivener editor visualises images in page view. Why wouldn't you want this as the default behaviour?

Mmm. I have tried compiling a project which includes images to .docx. Apart from the fact that my version of Word, at least, chokes on images which have spaces in their names, something for which I understand Keith has already set up a workaround for the next update, interestingly only one image, downloaded from the web displayed improperly, not reducing to fit within the right margin; it had dimensions which were easy enough to correct—even for someone who loathes Word and doesn't understand its arcana! So I just wondered if the problem you identify might not be more on Word's part. Even if it is, I would trust Keith to find a workaround. :)

nontroppo wrote:The performance improvement between Keith's converter and the Aspose one is striking!

Agree 100%

:D

Mark
The Scrivenato sometimes known as Mr X.
rMBP 13" (early 2015) 10.14.5, 8GB RAM, 512GB SSID
MBP17" (late 2011) 10.13.6, 8GB RAM, 512GB SSID
2017 iPad, iOS 12.3.1, 128GB, Apple Pencil
Scrivener, Scapple, Nisus Writer Pro, Bookends …

User avatar
KB
Site Admin
Posts: 20709
Joined: Tue Jun 13, 2006 11:23 pm
Platform: Mac
Location: Truro, Cornwall
Contact:

Tue May 07, 2019 8:00 am Post

nontroppo - I can't reproduce this on my end. Are you using any of the resizing or scaling options in the Compile options, for instance the option to scale the image to the page width? Could you provide a sample project? Thanks.
"You can't waltz in here, use my toaster, and start spouting universal truths without qualification."

User avatar
nontroppo
Posts: 1147
Joined: Mon Mar 05, 2007 5:22 pm
Platform: Mac
Location: Airstrip One

Wed May 08, 2019 2:10 am Post

KB wrote:nontroppo - I can't reproduce this on my end. Are you using any of the resizing or scaling options in the Compile options, for instance the option to scale the image to the page width? Could you provide a sample project? Thanks.


Actually I didn't see this compile setting before so was turned off... So we can expect the image is not scaled to the page width, but why is it at 200%, rather than 100%? Anyway the setting solves the real issue, but if you want to scratch an itch on the 200%, I include a test project.

Test.scriv.zip
(430.58 KiB) Downloaded 35 times

User avatar
KB
Site Admin
Posts: 20709
Joined: Tue Jun 13, 2006 11:23 pm
Platform: Mac
Location: Truro, Cornwall
Contact:

Wed May 08, 2019 2:42 pm Post

Thanks for the test file. It seems that the difference is down to using a linked image that is then embedded during the Compile process. However, I'm completely baffled as to how it is happening - I think I have to call shenanigans on Word for this one. Word must be examining the images in some odd ways and finding differences in their dimensions that aren't obvious. I come to this conclusion because I tried embedding an image and also inserting it as a linked image, and the data for both came out exactly the same in Word, and yet Word reported it as 200%.

You can see that this doesn't seem to be Scrivener's doing yourself:

1. Generate the Word .docx file.
2. Change the extension .docx to .zip in the Finder.
3. Unzip the file using something like Stuffit Expander (Apple's own archive utility most likely won't unzip it properly).

Now drill down, open word/document.xml and for <w:drawing> data. Check out the "wp:extent" element which determines the size of the image in Word. For your image, this is given as:

Code: Select all

<wp:extent cx="28905200" cy="15697200"/>


The sizes here are given in EMUs, which are 12700 to a point.

Now open FleschFig1.png from the word /media folder and examine its dimensions in e.g. Preview. You'll see that it is 72dpi with a size of 2276 x 1236.

Well:
2276 x 12700 = 28905200
1236 x 12700 = 15697200

So, we have a 72dpi file and its size is set correctly. So why is it reporting it as scaled to 200%?

It gets weirder when you trying the same to the attached project, which has the same image embedded in different ways. If you examine the images and <w:drawing> info in the unzipped exported Word file, you'll see that the images are both 72dpi with the same dimensions, the size given in <w:drawing> is the same, and yet Word reports one as being 200% and the other as 100% (but they look the same in the editor).

Bizarre!
Attachments
ImgSizeTest.zip
(121.91 KiB) Downloaded 32 times
"You can't waltz in here, use my toaster, and start spouting universal truths without qualification."

User avatar
nontroppo
Posts: 1147
Joined: Mon Mar 05, 2007 5:22 pm
Platform: Mac
Location: Airstrip One

Thu May 09, 2019 12:30 am Post

KB wrote:It gets weirder when you trying the same to the attached project, which has the same image embedded in different ways. If you examine the images and <w:drawing> info in the unzipped exported Word file, you'll see that the images are both 72dpi with the same dimensions, the size given in <w:drawing> is the same, and yet Word reports one as being 200% and the other as 100% (but they look the same in the editor).


Yes, something strange. There are differences in the embedded/linked images though (unzipped word/media folder):

Screenshot 2019-05-09 at 08.07.05.png
Screenshot 2019-05-09 at 08.07.05.png (6.05 KiB) Viewed 1752 times


The linked imaged (ZXScrivener-1.jpg) gets compressed while the embedded one isn't. Using exiftool (install with homebrew, or https://www.sno.phy.queensu.ca/~phil/ex ... index.html), there are differences in the metadata for the two files, the most significant is Resolution Unit is set to none in the compressed file, but inches in the original:

Code: Select all

File Size                       : 21 kB
File Name                       : ZXScrivener-1.jpg
File Type                       : JPEG
File Type Extension             : jpg
MIME Type                       : image/jpeg
JFIF Version                    : 1.01
Resolution Unit                 : None
X Resolution                    : 72
Y Resolution                    : 72
Exif Byte Order                 : Big-endian (Motorola, MM)
Create Date                     : 2010:02:02 00:36:20
Color Space                     : sRGB
Exif Image Width                : 580
Exif Image Height               : 484
XMP Toolkit                     : XMP Core 5.4.0


Code: Select all

File Size                       : 60 kB
File Name                       : ZXScrivener.jpg
File Type                       : JPEG
File Type Extension             : jpg
MIME Type                       : image/jpeg
JFIF Version                    : 1.02
Exif Byte Order                 : Big-endian (Motorola, MM)
Orientation                     : Horizontal (normal)
X Resolution                    : 72
Y Resolution                    : 72
Resolution Unit                 : inches
Software                        : Adobe Photoshop CS4 Macintosh
Modify Date                     : 2010:02:02 00:36:20
Color Space                     : sRGB
Exif Image Width                : 580
Exif Image Height               : 484
Compression                     : JPEG (old-style)
Thumbnail Offset                : 332
Thumbnail Length                : 5977
IPTC Digest                     : 00000000000000000000000000000000
Displayed Units X               : inches
Displayed Units Y               : inches


Lots of other colourspace metadata is also removed in the compressed one, but I suspect the resolution unit difference is the more likely, now whether that is what triggers word to show 200%, or whether it even makes sense for Word to do so is another matter!

Aspose compressed BOTH embedded/linked images, and it results in significantly worse image quality (look at the JPG compression artifacts):

Screenshot 2019-05-09 at 07.52.06.png
Screenshot 2019-05-09 at 07.52.06.png (235.81 KiB) Viewed 1752 times


I don't really understand why Scrivener compresses the linked image but not the embedded one, or why Aspose compresses both, if there is no resizing in the editor and no rescaling in the compiler?

User avatar
KB
Site Admin
Posts: 20709
Joined: Tue Jun 13, 2006 11:23 pm
Platform: Mac
Location: Truro, Cornwall
Contact:

Thu May 09, 2019 9:39 am Post

nontroppo wrote:I don't really understand why Scrivener compresses the linked image but not the embedded one, or why Aspose compresses both, if there is no resizing in the editor and no rescaling in the compiler?


I'm not sure why Aspose compresses both - that must be down to Aspose as the same text is passed through to both converters. I can answer why this happens in Scrivener, though - it's down to a technicality. To embed the linked images into the text before export, I have to load their data and create an embedded image wrapper for them. If the image has been resized, I have to resize the actual image (changing the resolution) in order for it to work (it's the only way to support image resizing). At this point, I have to create a new version of the image data.

But here's the rub: there is no way (at least no way I can find) in Cocoa of retrieving the original compression from a JPEG file. So when new JPEG data is created, I have to apply an arbitrary compression factor. In Scrivener, you can set this via the Sharing > Export preferences.

There is one hitch in the way I'm doing things at the moment: the code that embeds linked images always generates the JPEG data from the resizing code, even if there is no need to resize (i.e. it just passes in a scale factor of 1.0). I've fixed this for the next update so that if there is no need to alter the size of the image, the data used by the embedded image will just be that of the original JPEG. This solves the problem of the 200% in your example file. As soon as an image is resized, though, Scrivener has to use the JPEG compression setting from the Preferences.

Note: this only applies to JPEG files, of course.

All the best,
Keith
"You can't waltz in here, use my toaster, and start spouting universal truths without qualification."

User avatar
nontroppo
Posts: 1147
Joined: Mon Mar 05, 2007 5:22 pm
Platform: Mac
Location: Airstrip One

Thu May 09, 2019 11:21 am Post

OK makes sense! Exiftool doesn't show what the JPEG compression is either, and what I gleaned online is this is not part of the JFIF spec to store this info at all. It probably doesn't matter too much anyway as JPEG is lossy, so whenever a 75% JPEG is saved again at 75%, it will always gradually become worse AFAIK.