Case transformation regexes \u and \U do not work in compile

ja
jandavid
Posts: 38
Joined: Wed Apr 06, 2016 8:41 am
Platform: Mac

Tue Jul 10, 2018 10:51 pm Post

This was discussed already a while ago here (https://www.literatureandlatte.com/foru ... se#p224631), and was recommended to be posted in the Bug Hunt forum but, alas, I never did. My apologies!

Yet, here we go again, new project, same problem ...

In the compile replacement tabs, when using regexes, case transformations such as \u or \U do not seem to work.

How to reproduce:
Compile a document with some words in it :)
Enter (\w+?) in the replace column and \u$1 in the "With" column. Check RegEx.
Compile.
Every "word" is changed to "\uword" when it should be changed to "Word".

Thank you very much for looking into that (or helping me figure out what I'm doing wrong).

ja
jandavid
Posts: 38
Joined: Wed Apr 06, 2016 8:41 am
Platform: Mac

Wed Jul 11, 2018 7:58 pm Post

In addition, I've now realized that in the with column of replacements \n or \r are also interpreted as a literal "n" and "r".

User avatar
AmberV
Posts: 21665
Joined: Sun Jun 18, 2006 4:30 am
Platform: Mac + Linux
Location: Santiago de Compostela, Galiza
Contact:

Thu Jul 12, 2018 12:42 am Post

According to Apple’s documentation, it is ICU compliant. It’s very close to PCRE, but it doesn’t support any special backslash transformations or string replacements. To insert a whitespace such as tab or return, you have to insert the literal string (which you can do into fields with ⌥⇥ and ⌥↩). Backslash only works for backslashes and $.
.:.
Ioa Petra'ka
“Whole sight, or all the rest is desolation.” —John Fowles

ja
jandavid
Posts: 38
Joined: Wed Apr 06, 2016 8:41 am
Platform: Mac

Thu Jul 12, 2018 5:04 am Post

You're right, the literal return works ...
that's a workaround for now. But what about case transformations?

I don't have sufficient programming knowledge to fully understand what you write (assuming it's an Apple problem given the links?), but is there any workaround for changing the case of the replacement string?

User avatar
AmberV
Posts: 21665
Joined: Sun Jun 18, 2006 4:30 am
Platform: Mac + Linux
Location: Santiago de Compostela, Galiza
Contact:

Thu Jul 12, 2018 3:24 pm Post

Yes, the regular expression engine is a framework provided by the Mac, it’s not something we can modify ourselves. The operators supported in the replacement pattern are pretty limited—table 3 in the documentation lists everything you can do. Namely: insert capture groups with $1, $2, etc., and insert the characters “$” and “\”.

As for workarounds, using the Processing compile option pane, you could make use of other regular expression engines to further manipulate the output in ways Replacements cannot. Beyond simple command-line approaches, that can dip into programming however.
.:.
Ioa Petra'ka
“Whole sight, or all the rest is desolation.” —John Fowles

ja
jandavid
Posts: 38
Joined: Wed Apr 06, 2016 8:41 am
Platform: Mac

Thu Jul 12, 2018 6:38 pm Post

AmberV wrote:As for workarounds, using the Processing compile option pane, you could make use of other regular expression engines to further manipulate the output in ways Replacements cannot. Beyond simple command-line approaches, that can dip into programming however.


I do postprocess on the command line with `pandoc`. To do that and the replacements together in one go via a script would indeed be elegant. Would you be able to help me implement this? I'm sorry, I'm not a programmer, my knowledge ends with regexes and what I find on the internet. ...

I have different compile formats defined for different purposes all using pandoc. Postprocessing arguments look like this

Code: Select all

<$inputfile> -f markdown-auto_identifiers -t latex --biblatex --columns=120  -o <$outputname>.tex

or

Code: Select all

-t docx --bibliography=$HOME/.pandoc/Bibliography.bib -M reference-section-title=References -N --reference-doc=$HOME/.pandoc/templates/refdoc-num-headings.docx -o <$outputname>.docx


I also have one format defined with the filesplitter script that you had posted here (viewtopic.php?f=2&t=52114&p=267229&hilit=split+multiple+files#p267229).
which I pasted into the script field in Scrivener replacing the MultiMarkdown command that you had provided with this:

Code: Select all

 `pandoc -f markdown-auto_identifiers -t latex --biblatex --top-level-division=chapter --columns=120 -o #{filename} #{tmpfile.path}`


Works flawlessly BTW, I'm always impressed when I use some code that I don't understand half of it, and it does some magic for me :wink: :)

I assume I could do all of the postprocessing as different scripts, where I first execute the regexes I need and then process it via pandoc similar to this one. If you could help me with a script that I could modify, where I could add some regexes, that would be awesome (I can probably construct the regexes myself.)

Thank you very much!

User avatar
AmberV
Posts: 21665
Joined: Sun Jun 18, 2006 4:30 am
Platform: Mac + Linux
Location: Santiago de Compostela, Galiza
Contact:

Fri Jul 13, 2018 6:22 pm Post

Well one cool thing about shell scripts is that at their most basic they can be thought of as merely a sequence of individual commands that you’d input by hand into Terminal one after the other. There is of course much more than can be done with them, but if all you need to do is run sed or something first, and then pandoc to finish it off, you can put both lines into the “Script” field. So that’s one really easy way to automate or chain together several tools.

But for simple cases, it may be better to use pipes. I provide an example of this in the Processing pane documentation, bottom of page 670. This example takes MMD output and injects it into the clipboard instead of making a file when you compile. The principle can be applied to other things however, such as:

Path:

Code: Select all

/usr/bin/sed


Argument

Code: Select all

-E 's/replace/with/' <$inputfile> | pandoc ...


It’s a little quirky because you’re putting the first part of the command in one field and two commands in the second, as arguments to the first, but separating path from arguments is a bit of artificial contrivance anyway. The result that is sent to the shell is “ ”, so as long as you recognise all of this will be ending up on the same line together, you can do most of the stuff you would do in a “one-liner” in Terminal.

Naturally you would need to modify the Pandoc command slightly to take standard input from the pipe, which will have the text that is modified by sed, instead of opening the original file. The output would remain the same, as you still want a file in the end, and you want Pandoc to create it.

In the case of the Ruby splitter script (glad to hear you’re getting good use out of it ;)), then that would be a decent place for the transformation, since we’re already processing the full text. Try something like the following. In the script, look for the line of code in the first line given below, and paste in the second line after it:

Code: Select all

...

next if chunk.length < 1
chunk.gsub!(/PATTERN/) { |match| match.capitalize }

...


Put your regular expression into the “PATTERN” spot, between the slashes, and see if that does what you’re looking for. A lot of that syntax is pretty magic and should be left alone—but that “match.capitalize” should be pretty straightforward, and you should know you can do other things there if you want. Capitalize will upcase the first byte in the matched string, which I think is what you want. But if not, let me know—there really is no limit to what can be done to the matched string.

Oh and something worth mentioning is that in the example above, the whole string that is matched gets stored in the ‘match’ variable for processing, so there is no need to use parentheses in your pattern. “\w+?” would suffice.
.:.
Ioa Petra'ka
“Whole sight, or all the rest is desolation.” —John Fowles

ja
jandavid
Posts: 38
Joined: Wed Apr 06, 2016 8:41 am
Platform: Mac

Sun Jul 15, 2018 1:29 am Post

Thank you, this is immensely helpful.
AmberV wrote: ... if all you need to do is run sed or something first, and then pandoc to finish it off, you can put both lines into the “Script” field. So that’s one really easy way to automate or chain together several tools.

I've played around with it a bit and I think I can get it to do everything I need, but I'd need some more help. I tried pasting two sed commands and then the pandoc line into the script field as you suggest, but I'm doing something wrong as I get a "$inputfile: ambiguous redirect" error message.

Code: Select all

sed 's/\\(\w+?)\{\}/\\\u$1\{\}/g;'
sed 's/\\label{\S+?}//g;'
pandoc <$inputfile> -f markdown-auto_identifiers -t latex --biblatex --columns=120 --top-level-division=chapter -o <$outputname>.tex


Any suggestions?
I do like the replacements tab that Scrivener provides, as it gives a great overview over all that's going on, so having it all in the script field would be preferable over pipes, which would make the whole thing too long (with the additional benefit, that I could place a comment after each regex to remind me of what they are doing). What would be the best way to do this? A string of simple sed commands one after the other as I'm trying above? Or would it make sense to turn it into an actual script? like perl? ... but I'm not sure how to pass the arguments <$inputfile> <$outputname> to the script ...(which also seems the problem above) ...

Some of the regex patterns I need to reproduce would be:

Code: Select all

s/\\(\w+?)\{\}/\\\u$1\{\}/g; # Capitalize leipzig glosses
s/\\label{\S+?}//g; # delete all LaTeX labels for Word export
s/\\ref{\S+?}//g; # delete all crossrefs
s/(\[\@\w+\:.+?\])\s\./$1\./g; # fix for extra space before period after citation.
s/^#\s+?(.+?)\s+?\{\.unnumbered\}/\:\:\: \{custom-style=\"Unnumbered Heading\" 1\}\n$1\n\:\:\:/g; # convert unnumbered section to custom word style
s/(^\\\w+?\[??.*?\]??\{.*?\}\s*?)\%+?\s*?(.*)/$1 <!-- $2 -->/g; # convert LaTeX to HTML comment so that Pandoc ignores them (otherwise it  escapes the % sign)


There's probably a more elegant solution to this, but that's all that I can do with my (and google's) expertise. Could you help me properly frame this?


And then for the ruby script:
AmberV wrote: ... Capitalize will upcase the first byte in the matched string, which I think is what you want. But if not, let me know—there really is no limit to what can be done to the matched string.


Capitalizing works, yes! But, as you suspect, the example is not as simple as I had put it in the post. I'd need to do some similar regexes like the ones above, where I exclude part of the pattern and add stuff to the matched string, or matching multiple capturing groups, like the example above putting a comment into HTML tags. And I don't know how to translate the simple s/.../.../g syntax into ruby. ... :?

What does seem to work is to place multiple
chunk.gsub!(/PATTERN/) { ... }
chunk.gsub!(/PATTERN/) { ... }
one after the other, so if I figure out how to write the correct patterns in ruby this can probably go a long way ...

Thank you very much for your help!

User avatar
nontroppo
Posts: 1005
Joined: Mon Mar 05, 2007 5:22 pm
Platform: Mac
Location: Airstrip One

Sun Jul 15, 2018 5:20 am Post

I'm currently very short on time so can't help with regexes, but one general point, the whole purpose of a tool like pandocomatic is that it provides a pincipled way to manage Pandoc and scripts to run... It allows you to run general setup/cleanup scripts, direct "pipe" scripts (pre and post processors that work on the raw character stream), and manage Pandoc filters (very cool functionality that works on the semantic chunks of Pandoc documents). You don't need it, but it provide a more elegant way to combine all of these disparate elements into templates that are simply specified from Scrivener.

https://heerdebeer.org/Software/markdow ... -templates