## Case transformation regexes \u and \U do not work in compile

ja
jandavid
Posts: 40
Joined: Wed Apr 06, 2016 8:41 am
Platform: Mac
This was discussed already a while ago here (https://www.literatureandlatte.com/foru ... se#p224631), and was recommended to be posted in the Bug Hunt forum but, alas, I never did. My apologies!

Yet, here we go again, new project, same problem ...

In the compile replacement tabs, when using regexes, case transformations such as \u or \U do not seem to work.

How to reproduce:
Compile a document with some words in it
Enter (\w+?) in the replace column and \u$1 in the "With" column. Check RegEx. Compile. Every "word" is changed to "\uword" when it should be changed to "Word". Thank you very much for looking into that (or helping me figure out what I'm doing wrong). ja jandavid Posts: 40 Joined: Wed Apr 06, 2016 8:41 am Platform: Mac In addition, I've now realized that in the with column of replacements \n or \r are also interpreted as a literal "n" and "r". AmberV Posts: 21681 Joined: Sun Jun 18, 2006 4:30 am Platform: Mac + Linux Location: Santiago de Compostela, Galiza Contact: According to Apple’s documentation, it is ICU compliant. It’s very close to PCRE, but it doesn’t support any special backslash transformations or string replacements. To insert a whitespace such as tab or return, you have to insert the literal string (which you can do into fields with ⌥⇥ and ⌥↩). Backslash only works for backslashes and$.
.:.
Ioa Petra'ka
“Whole sight, or all the rest is desolation.” —John Fowles

ja
jandavid
Posts: 40
Joined: Wed Apr 06, 2016 8:41 am
Platform: Mac
You're right, the literal return works ...
that's a workaround for now. But what about case transformations?

I don't have sufficient programming knowledge to fully understand what you write (assuming it's an Apple problem given the links?), but is there any workaround for changing the case of the replacement string?

AmberV
Posts: 21681
Joined: Sun Jun 18, 2006 4:30 am
Platform: Mac + Linux
Location: Santiago de Compostela, Galiza
Contact:
Yes, the regular expression engine is a framework provided by the Mac, it’s not something we can modify ourselves. The operators supported in the replacement pattern are pretty limited—table 3 in the documentation lists everything you can do. Namely: insert capture groups with $1,$2, etc., and insert the characters “$” and “\”. As for workarounds, using the Processing compile option pane, you could make use of other regular expression engines to further manipulate the output in ways Replacements cannot. Beyond simple command-line approaches, that can dip into programming however. .:. Ioa Petra'ka “Whole sight, or all the rest is desolation.” —John Fowles ja jandavid Posts: 40 Joined: Wed Apr 06, 2016 8:41 am Platform: Mac AmberV wrote:As for workarounds, using the Processing compile option pane, you could make use of other regular expression engines to further manipulate the output in ways Replacements cannot. Beyond simple command-line approaches, that can dip into programming however. I do postprocess on the command line with pandoc. To do that and the replacements together in one go via a script would indeed be elegant. Would you be able to help me implement this? I'm sorry, I'm not a programmer, my knowledge ends with regexes and what I find on the internet. ... I have different compile formats defined for different purposes all using pandoc. Postprocessing arguments look like this Code: Select all <$inputfile> -f markdown-auto_identifiers -t latex --biblatex --columns=120  -o <$outputname>.tex or Code: Select all -t docx --bibliography=$HOME/.pandoc/Bibliography.bib -M reference-section-title=References -N --reference-doc=$HOME/.pandoc/templates/refdoc-num-headings.docx -o <$outputname>.docx

I also have one format defined with the filesplitter script that you had posted here (viewtopic.php?f=2&t=52114&p=267229&hilit=split+multiple+files#p267229).
which I pasted into the script field in Scrivener replacing the MultiMarkdown command that you had provided with this:

Code: Select all

pandoc -f markdown-auto_identifiers -t latex --biblatex --top-level-division=chapter --columns=120 -o #{filename} #{tmpfile.path}

Works flawlessly BTW, I'm always impressed when I use some code that I don't understand half of it, and it does some magic for me

I assume I could do all of the postprocessing as different scripts, where I first execute the regexes I need and then process it via pandoc similar to this one. If you could help me with a script that I could modify, where I could add some regexes, that would be awesome (I can probably construct the regexes myself.)

Thank you very much!

AmberV
Posts: 21681
Joined: Sun Jun 18, 2006 4:30 am
Platform: Mac + Linux
Location: Santiago de Compostela, Galiza
Contact:
Well one cool thing about shell scripts is that at their most basic they can be thought of as merely a sequence of individual commands that you’d input by hand into Terminal one after the other. There is of course much more than can be done with them, but if all you need to do is run sed or something first, and then pandoc to finish it off, you can put both lines into the “Script” field. So that’s one really easy way to automate or chain together several tools.

But for simple cases, it may be better to use pipes. I provide an example of this in the Processing pane documentation, bottom of page 670. This example takes MMD output and injects it into the clipboard instead of making a file when you compile. The principle can be applied to other things however, such as:

Path:

Code: Select all

/usr/bin/sed

Argument

Code: Select all

-E 's/replace/with/' <$inputfile> | pandoc ... It’s a little quirky because you’re putting the first part of the command in one field and two commands in the second, as arguments to the first, but separating path from arguments is a bit of artificial contrivance anyway. The result that is sent to the shell is “ ”, so as long as you recognise all of this will be ending up on the same line together, you can do most of the stuff you would do in a “one-liner” in Terminal. Naturally you would need to modify the Pandoc command slightly to take standard input from the pipe, which will have the text that is modified by sed, instead of opening the original file. The output would remain the same, as you still want a file in the end, and you want Pandoc to create it. In the case of the Ruby splitter script (glad to hear you’re getting good use out of it ), then that would be a decent place for the transformation, since we’re already processing the full text. Try something like the following. In the script, look for the line of code in the first line given below, and paste in the second line after it: Code: Select all ... next if chunk.length < 1 chunk.gsub!(/PATTERN/) { |match| match.capitalize } ... Put your regular expression into the “PATTERN” spot, between the slashes, and see if that does what you’re looking for. A lot of that syntax is pretty magic and should be left alone—but that “match.capitalize” should be pretty straightforward, and you should know you can do other things there if you want. Capitalize will upcase the first byte in the matched string, which I think is what you want. But if not, let me know—there really is no limit to what can be done to the matched string. Oh and something worth mentioning is that in the example above, the whole string that is matched gets stored in the ‘match’ variable for processing, so there is no need to use parentheses in your pattern. “\w+?” would suffice. .:. Ioa Petra'ka “Whole sight, or all the rest is desolation.” —John Fowles ja jandavid Posts: 40 Joined: Wed Apr 06, 2016 8:41 am Platform: Mac Thank you, this is immensely helpful. AmberV wrote: ... if all you need to do is run sed or something first, and then pandoc to finish it off, you can put both lines into the “Script” field. So that’s one really easy way to automate or chain together several tools. I've played around with it a bit and I think I can get it to do everything I need, but I'd need some more help. I tried pasting two sed commands and then the pandoc line into the script field as you suggest, but I'm doing something wrong as I get a "$inputfile: ambiguous redirect" error message.

Code: Select all

sed 's/\\(\w+?)\{\}/\\\u$1\{\}/g;' sed 's/\\label{\S+?}//g;' pandoc <$inputfile> -f markdown-auto_identifiers -t latex --biblatex --columns=120 --top-level-division=chapter -o <$outputname>.tex Any suggestions? I do like the replacements tab that Scrivener provides, as it gives a great overview over all that's going on, so having it all in the script field would be preferable over pipes, which would make the whole thing too long (with the additional benefit, that I could place a comment after each regex to remind me of what they are doing). What would be the best way to do this? A string of simple sed commands one after the other as I'm trying above? Or would it make sense to turn it into an actual script? like perl? ... but I'm not sure how to pass the arguments <$inputfile> <$outputname> to the script ...(which also seems the problem above) ... Some of the regex patterns I need to reproduce would be: Code: Select all s/\\(\w+?)\{\}/\\\u$1\{\}/g; # Capitalize leipzig glosses
s/\\label{\S+?}//g; # delete all LaTeX labels for Word export
s/\\ref{\S+?}//g; # delete all crossrefs
s/($\@\w+\:.+?$)\s\./$1\./g; # fix for extra space before period after citation. s/^#\s+?(.+?)\s+?\{\.unnumbered\}/\:\:\: \{custom-style=\"Unnumbered Heading\" 1\}\n$1\n\:\:\:/g; # convert unnumbered section to custom word style
s/(^\\\w+?$??.*?$??\{.*?\}\s*?)\%+?\s*?(.*)/$1 <!--$2 -->/g; # convert LaTeX to HTML comment so that Pandoc ignores them (otherwise it  escapes the % sign)

There's probably a more elegant solution to this, but that's all that I can do with my (and google's) expertise. Could you help me properly frame this?

And then for the ruby script:
AmberV wrote: ... Capitalize will upcase the first byte in the matched string, which I think is what you want. But if not, let me know—there really is no limit to what can be done to the matched string.

Capitalizing works, yes! But, as you suspect, the example is not as simple as I had put it in the post. I'd need to do some similar regexes like the ones above, where I exclude part of the pattern and add stuff to the matched string, or matching multiple capturing groups, like the example above putting a comment into HTML tags. And I don't know how to translate the simple s/.../.../g syntax into ruby. ...

What does seem to work is to place multiple
chunk.gsub!(/PATTERN/) { ... }
chunk.gsub!(/PATTERN/) { ... }
one after the other, so if I figure out how to write the correct patterns in ruby this can probably go a long way ...

Thank you very much for your help!

Online
nontroppo
Posts: 1009
Joined: Mon Mar 05, 2007 5:22 pm
Platform: Mac
Location: Airstrip One
I'm currently very short on time so can't help with regexes, but one general point, the whole purpose of a tool like pandocomatic is that it provides a pincipled way to manage Pandoc and scripts to run... It allows you to run general setup/cleanup scripts, direct "pipe" scripts (pre and post processors that work on the raw character stream), and manage Pandoc filters (very cool functionality that works on the semantic chunks of Pandoc documents). You don't need it, but it provide a more elegant way to combine all of these disparate elements into templates that are simply specified from Scrivener.

https://heerdebeer.org/Software/markdow ... -templates

AmberV
Posts: 21681
Joined: Sun Jun 18, 2006 4:30 am
Platform: Mac + Linux
Location: Santiago de Compostela, Galiza
Contact:
Before getting into the nitty gritty, I’d second the use of a prepared system like the excellent pandocomatic, as it’ll handle these details for you. But if you’d like to learn how this stuff is done, here are some tips.

Firstly Scrivener’s placeholders won’t work inside of a script because the compiler isn’t going to into a script and modify it (that would be too risky). So you’re sending <$inputfile> directly to the command line, which is in fact syntax that is confusing it (and that it is syntax is why we can’t just blindly change text inside of a script). As you guessed, what you need to do is supply these values to the script somehow, and that is what the the Arguments field is for. Something as simple as this should work: Code: Select all <$inputfile> <$outputname>.tex From within a shell script, you can refer to arguments by$1, $2 and so forth. Thus: Code: Select all sed 'pattern'$1 > tmp
sed 'pattern' tmp > tmp2
pandoc -o $2 tmp2 rm tmp; rm tmp2 What this does is first run sed on the input file ($1), piping the results to a file called “tmp” in the compile folder. Next we run the second sed command using this output file as input, and pipe its result to a new file called “tmp2”. Then we run Pandoc with the output set to our designated compile name, which is stored in $2, using the second temp file as input. Lastly we clean up the two temp files (feel free to leave that last line off as you test, as it can give you valuable insight into the process if something isn’t working right. I’d stress this approach is better as an educational step rather than a solution. With your script, you are running sed with no inputs or outputs, which is fine to do, but only if you intend to use pipes to move data around. You could for example dispense with the temp files with a single command like this: Code: Select all sed 'pattern'$1 |
sed 'pattern' |
pandoc -o $2 The pipe character on the end means to send the output to the next command directly. Thus we do not need to provide input data save for with the first command. The middle command looks for standard input via the pipe and sends its results to standard output. The third command has a built-in output file call with “-o”, but no input, so it’s using the standard input from the second command. ⠂─────── ⟢⟡⟣ ─────── ⠂ Now on to the Ruby code, as you note you can just use this command sequentially if you need to perform multiple search and replace operations. That’s probably the easiest, from a not-having-to-learn-Ruby-programming perspective. If you find things get a bit slow in real-world usage, it’s the sort of thing that could be optimised. As for translating s/replace/with/g to Ruby syntax, the form described above is an alternate approach that can be used when you need to do something more complicated than what regular expressions all by themselves can do. You’d want to use that for capitalisation, but for the things that are doing simple replacements, like deleting cross-refs, you can use a simpler syntax: Code: Select all chunk.gsub!(/\\ref{\S+?}/, '') Ruby documentation can be accessed on the command-line with the “ri” command. So to look up documentation on the gsub method, you can type in “ri gsub”. You’ll find usage examples in there, for doing replacements that insert stored call-backs. Ruby uses the \1, \2 format instead of$1, $2, by the way. Here is the one that inserts a space between a period and citation: Code: Select all chunk.gsub!(/($\@\w+\:.+?$)\s\./, '\1\.') Hopefully it should be pretty straight-forward! .:. Ioa Petra'ka “Whole sight, or all the rest is desolation.” —John Fowles ja jandavid Posts: 40 Joined: Wed Apr 06, 2016 8:41 am Platform: Mac Thank you very much, Amber! This is very comprehensive and, I'm learning quite a bit. I'm still running into a few problems though: Sed: I've followed your instructions and to test put Code: Select all sed 's/\\(\S+?)\{\}/\\\u$1\{\}/g;' $1 > tmp pandoc -f markdown-auto_identifiers -t latex --biblatex --columns=120 -o$2 tmp
into the script field while providing <$inputfile> <$outputname>.tex as Arguments.

However, I get the following error, which I assume, is because I am using $1 to refer back to my capturing group while the same placeholder is used to refer to the input. sed: 1: "s/\\(\S+?)\{\}/\\\u$1\{ ...": RE error: invalid repetition count(s)

I tried with \1 but with the same problem. A simple replacement, without capturing group and backreference works fine.

⠂─────── ⟢⟡⟣ ─────── ⠂

Ruby:

I managed to get ruby to work ... well, mostly.

BTW, just for the sake of completeness, some context to this (as you might wonder what the point of such strange macros is): I'm a linguist, and we often use so-called "functional glosses", i.e., words/parts of words that have some grammatical function, such as PST, meaning "past", or SG meaning "singular". In the text they usually appear as small-caps. Now there is a LaTeX package, called "leipzig" (after the Leipzig Glossing Rules) that has the most common glosses predefined as macros (and makes it fairly easy to define your own). The package not only prints them in the right form in small-caps, but also works together with the glossaries package to create glossaries. The leipzig macros are mostly (with a few exceptions) simply the name of the gloss but with the peculiarity that they start with a capital letter (and are usually followed by {} to ensure that a potential space after the gloss is not eaten up). So to get the small-caps PST gloss in LaTeX I would type \Pst{}, SG is produced by \Sg{}, etc.

So now with Scrivener 3 I've defined a character style for glosses, which (a) in Scrivener has small-caps activated so it looks as it should and (b) when compiling for Word the style gets a prefix "[" and a suffix "]{.smallcaps}". Now (c), when compiling for LaTeX while it needs a prefix "\" and a suffix "{}" and initially I wanted to supply these directly in compile, but, since I also need to capitalize, and therefore need to postprocess anyways, I figured it's probably better (to avoid false positives) to give my Scrivener style a more unique prefix and suffix so other potential \LaTeX{} macros don't get messed with, so now my gloss prefix is <gl> and my suffix </gl>, hence I need something that turns <gl>word</gl> into \Word{}. After some more googling, I finally came up with this, which works

Code: Select all

chunk.gsub!(/<gl>(\S+?)<\/gl>/) { |match| '\\' + $1.capitalize + '{}' } (not sure it's the right way to do it, but it does the job). Yay, I wrote my first ruby replacement pattern ... Out of idle curiosity: I've experimented a bit and the one thing I didn't figure out was how to add \n newlines to the replacement pattern. For example this replacement. I thought either Code: Select all chunk.gsub!(/^\#\s+?(.+?)\s+?\{\.unnumbered\}/, '::: {custom-style="Unnumbered Heading 1}:::\n\1\n:::') or Code: Select all chunk.gsub!(/^\#\s+?(.+?)\s+?\{\.unnumbered\}/) { |match| '::: {custom-style="Unnumbered Heading 1}' + \n$1\n + ':::' }

would do the trick, but ... they don't.
This particular example is actually unnecessary, since pandoc passes unnumbered headers quite well to latex on its own ... I've constructed something like that for my other workflow to Word, but as I was experimenting with the ruby replacements a bit I wanted to see what it all can do and got stuck there. Which brings me to ...

⠂─────── ⟢⟡⟣ ─────── ⠂

Pandocomatic:

Yes, I fully agree, this would very likely be a much better and much more elegant solution ... in fact, when I started to set up my first Scrivener 3 project a few weeks ago, I downloaded pandocomatic and also scrivomatic, and played around with that workflow a bit. Yet, I must admit, it was a bit overwhelming, so I ended up abandoning it again. I will likely never need the vast amount of options, and I though if I can get it all done with styles and a few regexes after, I'll be fine. Also, at least I'll know more or less what I'm doing. So it felt like it would be overkill for my needs, as well as make it harder for me to troubleshoot and tweak to my needs. Please don't get me wrong, this is not intended as a criticism of those tools, or their documentation, but rather a criticism of my own limited skills

However, then the problem with Scrivener (or rather Apple) not being able to capitalize a word in the regex replacement appeared ... and ... here we are, and now I have to tweak a ruby script and all kinds of other nasty things where I'm equally lost how to troubleshoot ...
So, in the end, maybe I should have stuck with pandocomatic and invested the time necessary to learn all that. I leave that to the experts to judge. Maybe over the weekend I'll give it another shot ...

In any case, I feel like I'm now so close to getting it working, hopefully I can figure our the few remaining problems. Thank you very much for your help!