migmit: (Default)
[personal profile] migmit
I'm playing with a certain pet project, an iOS app. I started wondering if it's possible to handle copy-pasting from Notes.app, while preserving some formatting. Here's what I've got:

1) Using a string (as in, UIPasteboard.general.string) keeps indentation, and even checkboxes are represented normally, with [ ] for an unchecked item, and [x] for a checked one. Unfortunately, you lose even the most basic formatting, like bold/italic
2) Using RTF representation from the same pasteboard (which is inconsistent — iOS Notes.app generally adds a RTFD format, while macOS prefers RTF) doesn't even give you information about indentation. Well, not exactly: information is there, but it is in textLists property, which is not available on iOS without hacking.
3) Using HTML representation does give you some information about indentation (in form of tab positions, but whatever, they are available on iOS), and, of course, about fonts (including bold/italic stuff), but loses the checkboxes. That's right, we can't reliably figure out which boxes are checked, and which aren't.
4) Most interesting: Notes.app internal pasteboard format. It's not documented, and, probably, isn't supposed to be used by others; but that is not a reason why you shouldn't.
So, we start by getting the data:
if let data = UIPasteboard.general.data(forPasteboardType: "com.apple.notes.richtext") {

But then all we have is some binary data. Inspecting that data showed that it's a property list, in binary format. I've converted that binary list to XML format and looked at it in the text editor — it appeared to be a keyed archive. There is a standard tool for those: NSKeyedUnarchiver. There is just one problem: the whole serialized data seems to be of an internal, non-exported type ICNotePasteboardData. But there is a simple remedy for those: we can substitute one type for another. And, meditating on the XML property list, I realized that there is only one part of it that might contain something interesting. So, what it boiled down to is this:
class FakeNotesData: NSObject, NSCoding {
│   var attributedStringData: Data?
│   required init?(coder: NSCoder) {
│   │   attributedStringData = coder.decodeObject(forKey: "attributedStringData") as? Data
│   }
│   func encode(with coder: NSCoder) {
│   │   if let data = attributedStringData {coder.encode(data, forKey: "attributedStringData")}
│   }
}

let keyed = try! NSKeyedUnarchiver(forReadingFrom: data)
keyed.requiresSecureCoding = false
keyed.setClass(FakeNotesData.self, forClassName: "ICNotePasteboardData")
if let obj = keyed.decodeObject(forKey: NSKeyedArchiveRootObjectKey) as? FakeNotesData,
│  let attributedStringData = obj.attributedStringData
{

But, unfortunately, this attributedStringData is just another (smaller) chunk of binary data. However, just printing it out as a string, I've noticed immediately that part of it is just the text I copied, without any formatting or indentation. Other parts seemed to be garbage. So, I converted the whole data into a hex string and started digging deeper.

It really helped that a few years ago I worked with Google's protobuf (specifically, I worked on removing it from our system). So, it took me a while, but I realized that this binary data is nothing but a protobuf-serialized message. I've even written a simple protobuf spec for it:
message PasteInfo {
  required string str = 2; // UTF-8
  repeated ChunkInfo chunks = 5;
  repeated AttachmentInfo attachments = 6;
}
message ChunkInfo {
  required uint64 length = 1;
  optional ParagraphStyle paragraphStyle = 2;
  optional float textSize = 3;
  optional TextStyle textStyle = 5;
  optional bool isUnderlined = 6 [default = false];
  optional bool isStrikethrough = 7 [default = false];
  optional int64 baselineOffset = 8; // 1 for superscripts, -1 for subscripts; yes, they didn't use sint for that
  optional Color color = 10;
  optional Attachment attachment = 12;
}
message ParagraphStyle {
  optional ParagraphType paragraphType = 1;
  optional Alignment alignment = 2 [default = LEFT];
  optional WritingDirection writingDirection = 3 [default = LTR];
  optional uint64 listDepth = 4 [default = 0];
  optional CheckedListInfo checkedListInfo = 5;
  optional uint64 startNumberingFrom = 7; // not sure about that
}
enum TextStyle { // could be a bitmask?
  BOLD = 1;
  ITALIC = 2;
  BOLD_ITALIC = 3;
}
enum ParagraphType {
  TITLE = 0;
  HEADING = 1;
  SUBHEADING = 2;
  BULLETED = 0x64;
  DASHED = 0x65;
  NUMBERED = 0x66;
  CHECKLIST = 0x67;
}
message CheckedListInfo {
  optional bytes unknown = 1; // seems always present, with opaque 128-bit values — maybe GUIDs
  required bool isChecked = 2;
}
message Attachment {
  required string guid = 1;
  required string type = 2; // reverse domain name
}
message AttachmentInfo {
  required string guid = 2;
  optional string content = 6;
  required string type = 8; // reverse domain name
  optional fixed64 unknown_ptr = 17; // probably a pointer to somewhere inside Notes.app
  optional uint64 unknown_int = 25; // seems always present, with value 2
}
enum Alignment {
  LEFT = 0;
  CENTER = 1;
  RIGHT = 2;
  JUSTIFY = 3;
}
enum WritingDirection {
  LTR = 0;
  DEFAULT = 1;
  RTL = 2;
}
message Color {
  required float red = 1;
  required float green = 2;
  required float blue = 3;
  required float alpha = 4;
}

So, now it seems that I have all the details I wanted: indentation (as list depth), basic bold/italic formatting, and checkboxes.

Word of caution: those chunks have no respect for paragraph boundaries. I've already encountered some cases where there are several chunks within one paragraph, with random splitting points (not like there is some change there), or chunks spreading through several paragraphs that share the same formatting style (even several list items, even several numbered list items). But parsing that doesn't seem complicated.

Notes.app has a lot of features, and I'm sure this simple format misses many of those; but that's already quite interesting, I think.

There is one more sad thing about it: looks like it's impossible to reliably extract table dimensions from this. Tables are represented as attachments (with a "com.apple.notes.table" type) , so we have GUIDs, which, presumably, can be used to get information from iCloud, and we have something that looks like an internal pointer in Notes.app, but not much more.