The L&L Blog  /  Scrivener

Coding XML Formats in Cocoa

I’ve long been intending to add some technical, coding-related content to this blog. I’ve always admired the blogs of other developers who share some of the coding problems they have faced and solutions they have reached, and in developing Scrivener there are a number of issues that I’ve come across and found solutions for that I’m sure could be useful to other developers. So, this is the first more techie, coding-orientated post to make it to the blog. Those with no interest in Cocoa development, look away now. I have recently been rewriting Scrivener’s file format. As you may or may not know, .scriv files are packages – essentially folders that on OS X just look like, and are treated the same as, files. Were you to move a .scriv file to a different platform, it would appear as a regular folder. On OS X, you can ctrl-click on a .scriv file in the Finder and select “Show Package Contents”. File packages are great for programs such as Scrivener. Regular files usually have to be loaded into the program’s memory in their entirety when loaded; with a file package, the program can just look inside it and open whatever it needs as and when it needs it. Given that Scrivener can import movie, sound, PDF and image files, you can imagine how much memory might get eaten up if it had to load everything into memory right from the get-go. Instead, with its package format, it can just load the binder structure file and then load up each document as you select it in the binder, and flush from memory any large files that aren’t currently being used. For Scrivener 2.0, .scriv files will still be file packages, but I’ve been doing some work on the format of the files inside the .scriv file. Currently all text files are saved internally as RTFD files (RTFD stands for “rich text format directory”). RTF files are a standard rich text format that can be opened on all major platforms (they are essentially plain text files with formatting mark-up), and were designed by Microsoft; RTF files support pretty much everything that Word documents do. RTFD is Apple’s extension of this format. An RTFD file is a file package in itself (as with .scriv files, you can ctrl-click on them in the Finder and select “Show Package Contents” – you will find a TXT.rtf file inside there, for instance, which holds the actual text). Apple designed the format so that such files could also hold QuickTime files and any other file type; but the trouble is that RTFD files can only be opened on Macs. Most of the other files inside .scriv packages use the Apple .plist format – I may have given some of them the .xml extension (.plist files are technically XML files), but internally they are just Apple .plists. (.plist stands for “property list”). This format for Scrivener files was generally a great and solid 1.x file format. It works, and it doesn’t take too much code to maintain on my part – the Cocoa frameworks make it very easy to write to RTFD and PLIST formats. However, for 2.0 I wanted to make Scrivener’s format less platform-specific. The current format has two main flaws: 1) Its use of .plist and .rtfd files means it’s a format that can only be read on the Mac. Although I personally have no plans to switch to or code for other platforms, this would be a significant hurdle for anyone we wanted to work with to port Scrivener to, say, Windows. 2) The .plist format is not human-readable – at least, not when used with the sort of data that Scrivener has to write out. This makes it difficult for anyone on any platform, including the Mac, to write utilities that might work with the Scrivener format. For these reasons, I am in the process of making the following changes to the .scriv package format: 1) The .scriv package will no longer contain all files in the root folder. Instead, subdirectories will be used to make it easier to navigate. That way, should it be ported to a different platform that doesn’t support packages, it will be easy for the user to find the file required to open the project. And in general it’s just neater, of course. 2) I will no longer use the RTFD format and will switch to using RTF instead. The only reason I didn’t use RTF to begin with was that Apple’s standard RTF reader and writer – the one provided in the Cocoa frameworks – ignores images. That is, it fails to load or save images in the text. Over the years this is something I’ve fixed myself, though, so I use a modified version of the RTF reader/writer to save and load RTF files that retain images with no problems. This not only means that all the text files stored inside a .scriv file are now platform-independent, but also that they are using a file format that has been around for over twenty years. It’s also a format that can be opened in a plain-text reader. 3) Instead of using .plist files, I am creating my own XML file formats where applicable. (There are other changes too – for instance I am now using a checksum file that can tell which files have been changed since the last session; this means that should Scrivener crash, you will no longer be faced with the time-consuming “Synchronising…” panel, as Scrivener will be able to update only the search indexes for the files that have changed rather than going through every single file in the project.) Needless to say, writing my own XML file formats is the most time-intensive part of this process. Fortunately, Cocoa has some excellent and easy-to-use classes for generating and reading XML – the NSXML… classes. For instance, suppose I wanted to create the following XML:

<MyXMLFormat Foo="Yes">
     <Text>This is some XML.</text>
</MyXMLFormat>

It’s as easy as this: NSXMLElement *myXMLElement = NSXMLElement alloc] initWithName:@”MyXMLFormat”]; [myXMLElement addAttribute:[NSXMLElement attributeWithName:@”Foo” stringValue:@”Yes”]]; NSXMLElement *textElement = [[NSXMLElement alloc] initWithName:@”Text” stringValue:@”This is some XML.”]; [myXMLElement addChild:textElement]; [textElement release]; // Do something with myXMLElement, such as write it to disk using NSXMLDocument. [myXMLElement release]; Still, I came across a couple of problems in writing my own XML, and that’s what I want to share with you here – the problems and the solutions. Writing Less Code The first aspect of working with an XML format in Cocoa I’m going to talk about isn’t actually a problem at all. It’s just that I’m lazy – I get bored of writing out the same things over and over again. The NSXML classes and methods are fantastic – they contain everything you need to read and write good XML – but they are also rather generic, designed to cover all uses. But for many common uses, this can mean writing more code. For instance, suppose you need to read the example XML above and you know that the MyXMLFormat element only has one <Text> sub-element; or, if it has more than one, your app can only handle one anyway so that either way all you really need to do is get the first child element named “Text”. Using the NSXML classes as is, this is easy enough but it takes a few lines of code: NSString *textString = nil; // Initialise the variable with nil. NSArray *childElements = [myXMLElement elementsForName:@”Text”]; // Gets all of the child elements named “Text”. if ([childElements count] > 0) // Did we find “Text” sub-elements? { textString = [[elements objectAtindex:0] stringValue]; } if (textString != nil) { // do something. } Pedants will note that I can easily reduce the string-reading part of the code above to two lines of code if I sacrifice a micron of readability: NSArray *childElements = [myXMLElement elementsForName:@”Text”]; NSString *textString = ([childElements count] > 0 ? [[elements objectAtIndex:0] stringValue] : nil); Still, as I say, I’m lazy. Why write two lines of code when you can reduce it to one? So I wrote my own NSXMLElement category to reduce the lines of code in my XML read/write classes. Here are the two methods that do the job: @implementation NSXMLElement (KBAdditions) – (NSXMLElement *)firstChildElementWithName:(NSString *)name { NSArray *elements = [self elementsForName:name]; if ([elements count] == 0) return nil; return (NSXMLElement *)[elements objectAtIndex:0]; } – (NSString *)stringValueOfFirstChildElementWithName:(NSString *)name { NSXMLElement *element = [self firstChildElementWithName:name]; return (element != nil ? [element stringValue] : nil); } @end (Note that I split it into two methods as sometimes you will not want only the string value but will want to get the whole element – for instance to examine its attributes. -firstChildElementWithName: does that, and -stringValueOfFirstChildElementWithName is just a convenience method that calls the former method and returns only its string value.) Now, to get the string value of the sub-element <Text> from <MyXMLFomat>, I only have to write one line of code: NSString *textString = [myXMLElement stringValueOfFirstChildElementWithName:@”Text”]; Nice. This reduced how much code I had to write for reading my custom XML files, but I was still finding there was one sequence of code I was repeating quite a lot in the code for writing to XML. Again, consider the example from above: NSXMLElement *myXMLElement = NSXMLElement alloc] initWithName:@”MyXMLFormat”]; [myXMLElement addAttribute:[NSXMLElement attributeWithName:@”Foo” stringValue:@”Yes”]]; NSXMLElement *textElement = [[NSXMLElement alloc] initWithName:@”Text” stringValue:@”This is some XML.”]; [myXMLElement addChild:textElement]; [textElement release]; // Do something with myXMLElement, such as write it to disk using NSXMLDocument. [myXMLElement release]; In this case the “Text” element is really basic – it contains only a string value and no attributes. Still, it takes three lines of code to add it: NSXMLElement *textElement = NSXMLElement alloc] initWithName:@”Text” stringValue:@”This is some XML.”]; [myXMLElement addChild:textElement]; [textElement release]; Again, pedants will note that I can reduce this to one line if I really want: [myXMLElement addChild:[NSXMLElement alloc] initWithName:@”Text” stringValue:@”This is some XML.”] autorelease]]; That’s not so bad. But what if I want to add an attribute to the element? In that case I have to go back to my original code so that I have a reference to the element to work with: NSXMLElement *textElement = NSXMLElement alloc] initWithName:@”Text” stringValue:@”This is some XML.”]; [myXMLElement addChild:textElement]; [textElement release]; [textElement addAttribute:[NSXMLElement attributeWithName:@”SomeElement” stringValue:@”SomeValue”]]; So, I added another method to my NSXMLElement category: – (NSXMLElement *)addChildElementWithName:(NSString *)name stringValue:(NSString *)stringValue { NSXMLElement *childElement = [[NSXMLElement alloc] initWithName:name stringValue:stringValue]; [self addChild:childElement]; return [childElement autorelease]; } This adds the child element for me, creating it using the name and string value passed-in, and then returns a reference to the child element that was added. So now, to add the “Text” element, I can just do this: [myXMLElement addChildElementWithName:@”SomeElement” stringValue:@”SomeValue”]; And if I need to add an attribute, that’s easy too: NSXMLElement *textElement = [myXMLElement addChildElementWithName:@”SomeElement” stringValue:@”SomeValue”]; [textElement addAttribute:[NSXMLElement attributeWithName:@”SomeElement” stringValue:@”SomeValue”]]; My code is tidier and easily readable; I’m happy. The other XML class I decided to write a category for was NSXMLDocument. To load an XML document from file, I need to do this: NSXMLDocument *xmlDoc = [[NSXMLDocument alloc] initWithContentsOfURL:url options:NSXMLNodePreserveWhitespace error:&error]; I need the “preserve whitespace” option because otherwise Cocoa’s XML loader will obliterate any return and tab characters, and obviously I want to preserve them. (Although I use RTF for saving the text of files, there may be plain text representations of the text in some files – for instance, in the search.indexes file – and the user may have entered a tab or return in some other locations that get saved in the XML file.) However, sometimes I found that this could return nil unless I also passed in the NSXMLDocumentTidyXML option to tidy up any malformed XML (obviously I’m working hard to ensure that Scrivener can’t generate any malformed XML, but there are instances in the program where it may need to try to read some, and equally obviously I need to ensure that it doesn’t fail to read projects in the event that it has created some dodgy XML somewhere). But you wouldn’t normally want the NSXMLDocumentTidyXML option passed in, because it wipes leading tab characters among other things; you only want to use it (or I did in this case) if there is a problem reading the XML document without this option. In other words, I want to initialise an NSXMLDocument with the NSXMLNodePreserveWhitespace option, but if that fails I want to try again using NSXMLDocumentTidyXML. That’s easy enough: NSXMLDocument *xmlDoc = [[NSXMLDocument alloc] initWithContentsOfURL:url options: options:NSXMLNodePreserveWhitespace error:&error]; // Did it fail? If so, try passing in the tidy option in case it failed because of malformed XML. if (xmlDoc == nil) [[NSXMLDocument alloc] initWithContentsOfURL:url options: options:NSXMLNodePreserveWhitespace|NSXMLDocumentTidyXML error:&error]; if (xmlDoc == nil) // If it still didn’t work, log an error. [[NSAlert alertWithError:error] runModal]; That does the job nicely, but rather than have three lines of code every time I need to do this, I can reduce it to one by separating the above out into an NSXMLDocument category: – (id)initWithContentsOfURLPreservingWhitespace:(NSURL *)url error:(NSError **)error { if (self = [self initWithContentsOfURL:url options:NSXMLNodePreserveWhitespace error:error]) return self; self = [self initWithContentsOfURL:url options:NSXMLNodePreserveWhitespace|NSXMLDocumentTidyXML error:error]; return self; } @end So, again, I have less code to maintan in all of my XML reading objects: NSXMLDocument *xmlDoc = [[NSXMLDocument alloc] initWithContentsOfURLPreservingWhiteSpace:url error:&error]; if (xmlDoc == nil) // If it still didn’t work, log an error. [[NSAlert alertWithError:error] runModal]; That’s the convenience methods built to aid and abet my laziness out of the way. But I had a bigger problem. Dealing with invalid XML characters Scrivener users paste text in from Word and other sources, and some of that text is going to end up in my XML files – maybe because the user copied some Word text into a document title, or maybe into the text itself, which will have a plain-text representation saved in the search.indexes XML file. The trouble is, it turns out that not all Word characters are valid XML characters. Not all characters are valid XML characters, period. I discovered this when trying to load a project that had been converted to use XML and having it fail with an XML reading error message about “invalid pcdata char value 12” or some such. After some poking around, I discovered that this was caused by a page break character – Cocoa’s NSFormFeedCharacter – occurring in the text. The text is stored as RTF on disk – that part is fine – but also has a plain text representation saved in the search.indexes XML file. And that was the problem; the unicode character used for NSFormFeedCharacter (0x000c) made the XML reader choke when trying to load the search indexes table from the XML file. So I took the path of least resistance and wrote a shoddy method that removed all form feed characters from the plain-text version of any text that got saved in the search indexes XML file, making a mental note to look into it further later as I knew that couldn’t be the only character that would cause problems. Sure enough, a while later another file refused to load – and this time it was some other random Word character that had been pasted in that was causing the problems. The character appeared as some odd symbol in Scrivener and isn’t supported by the text system anyway, but when saved in the XML file it would cause the XML reader to choke and fail to open the file. That was when I started doing some more digging around and came across this: http://cse-mjmcl.cse.bris.ac.uk/blog/2007/02/14/1171465494443.html Wich in turn led me to the XML specs: http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char It turns out that there are whole ranges of unicode characters that make for invalid XML, and I had wrongly assumed that the NSXML classes would take care of all of this for me when they won’t. As specified in the specs above, valid unicode ranges for XML files are: Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */ The invalid characters aren’t the sort of thing you are likely to be using in your document titles or text anyway, but that doesn’t mean they can’t appear in Scrivener documents when pasted in from other sources (such as Word). And it turns out that the NSXML classes will allow you to write any string value, but will fail when you try to read a string value that contains invalid XML characters. In other words, the Cocoa classes leave it up to the individual developer to ensure that whatever he is writing to XML contains only valid XML characters. So the trick now was to write an NSString category method that would clean up any strings to be written to XML by removing any characters that fell within the invalid XML ranges. I hacked something together based on the web page mentioned above, essentially going through a string character by character and checking it didn’t fall within the invalid ranges, but it threw up some odd warnings and, although it seemed to work, I knew it wasn’t very efficient. Thinking that I might need some low-level C magic to build an NSString out of unichars or some such (can you tell that I’m a high-level Cocoa guy?), I asked for advice on the Cocoa developer lists (lists.apple.com), which are an amazing resource. There, the always helpful Jens Alfke (thank you, Jens!) pointed out I could specify ranges of unicode characters using NSMutableCharacterSet, which was something I had completely missed and which led to the solution, namely: • Build a character set containing all of the unicode characters that are valid XML as specified in the XML specs. • Invert this character set so that we have another character set containing all of the invalid characters, and keep this character set around in memory so that it only ever needs creating once (because we don’t want to go through the process of creating it and adding all the character ranges every time it is required, which could be a lot). • Check to see if the passed-in string contains any of the invalid characters contained in the character set. • If not, the string can be used as-is. • If it does contain invalid XML characters, though, go through it looking for all invalid characters and remove them. The resulting NSString category method is below: @implementation NSString (XMLMethods) – (NSString *)validXMLString { // Not all UTF8 characters are valid XML. // See: // http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char // (Also see: http://cse-mjmcl.cse.bris.ac.uk/blog/2007/02/14/1171465494443.html ) // // The ranges of unicode characters allowed, as specified above, are: // Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */ // // To ensure the string is valid for XML encoding, we therefore need to remove any characters that // do not fall within the above ranges. // First create a character set containing all invalid XML characters. // Create this once and leave it in memory so that we can reuse it rather // than recreate it every time we need it. static NSCharacterSet *invalidXMLCharacterSet = nil; if (invalidXMLCharacterSet == nil) { // First, create a character set containing all valid UTF8 characters. NSMutableCharacterSet *xmlCharacterSet = [[NSMutableCharacterSet alloc] init]; [xmlCharacterSet addCharactersInRange:NSMakeRange(0x9, 1)]; [xmlCharacterSet addCharactersInRange:NSMakeRange(0xA, 1)]; [xmlCharacterSet addCharactersInRange:NSMakeRange(0xD, 1)]; [xmlCharacterSet addCharactersInRange:NSMakeRange(0x20, 0xD7FF0x20)]; [xmlCharacterSet addCharactersInRange:NSMakeRange(0xE000, 0xFFFD0xE000)]; [xmlCharacterSet addCharactersInRange:NSMakeRange(0x10000, 0x10FFFF0x10000)]; // Then create and retain an inverted set, which will thus contain all invalid XML characters. invalidXMLCharacterSet = [[xmlCharacterSet invertedSet] retain]; [xmlCharacterSet release]; } // Are there any invalid characters in this string? NSRange range = [self rangeOfCharacterFromSet:invalidXMLCharacterSet]; // If not, just return self unaltered. if (range.length == 0) return self; // Otherwise go through and remove any illegal XML characters from a copy of the string. NSMutableString *cleanedString = [self mutableCopy]; while (range.length > 0) { [cleanedString deleteCharactersInRange:range]; range = [cleanedString rangeOfCharacterFromSet:invalidXMLCharacterSet options:0 range:NSMakeRange(range.location,[cleanedString length]-range.location)]; } return [cleanedString autorelease]; } @end So now I can ensure that I’m writing valid XML by passing any strings written to XML through my -validXMLString method: NSXMLElement *fooElement = [[NSXMLElement alloc] initWithName:@”Foo” stringValue:[[binderDoc title] validXMLString]]; Phew. Now I can get back to finishing Scrivener’s new internal XML generators confident that they won’t end up creating any projects that refuse to open because of bad XML. P.S. I apologise for the really poor formatting of the code on this page – no matter what I try, I can’t get Blogger to format it nicely. I’ve tried http://formatmysourcecode.blogspot.com and other methods, but no matter what it just looks rubbish. If you know of a good way of formatting code so that it doesn’t look terrible on Blogger, please let me know (I’d love to make it automatic in Scrivener’s “Copy as HTML” code too).

0 Comments

Add your comment here...

Keep up to date