Serialization.

Published 26 November 2023 at Yours, Kewbish. 2,417 words. Subscribe via RSS.

Introduction

I remember when I first started learning Python and realized what an f-string was. It was mind-blowing to me back then that not only could you print variables in a certain format, but also modify those variables and do other computation within the brackets, and get it to all display nicely. I was amazed that I could get variables of all different types to pretty-print themselves with an f-string.

I now know that under the hood, the f-string formatting is just calling __str__(), and that all the types I tried just had good __str__()s defined. This is an example of serialization: formally, transforming an object to a different representation so it can be saved to disk or transmitted over the wire. I like to think of it as dimension reduction in a representation — kind of like squashing a cube into a square, or in this case taking a photo of the cube (converting it into 2D data) to be faxed¹. To bring this dimension reduction back to Python, if I ran the following:

dict = {"a": 1, "b": 2}
print(f"{dict}")

I’d get {"a": 1, "b": 2} as the output. This converts the dictionary into a string so it can be transmitted via printing, but in doing so, the string loses some of the properties of the original dictionary. You can’t call .items() on the string, nor can you add new key-value pairs. During serialization, the object’s lost some of its intrinsic properties while retaining the same core information.

This is an interesting problem (feature?) of serialization, and it’s something I’d like to explore further, especially in the context of exporting metadata from apps and tool interoperability. In this post, I’ll dive into JavaScript’s infamous [object Object], JavaScript’s JSON serialization, and alternative serialization methods, like dehydration of data, before pivoting a little into what it means to really own our data (spoiler alert: it has to do with serialization too!)

[object Object]

I was once helping Maple Bacon run their UBC-specific CTF, SaplingCTF. I was working primarily on the CTFd theme, and for this iteration of SaplingCTF in particular, I was trying to get some custom styling working for progression challenges. I thought I had it all working, until things inevitably broke during the competition and I had to do some debugging and hotfixing. While trying to deploy a fix as soon as possible, I left an alert(obj) in there somewhere, which meant that each time someone reloaded the challenge page they’d be greeted with [object Object]. Not my best work.

[object Object] has been ubiquitous in my JavaScript experiences. Forgot to wrap something in JSON.stringify() before console.log()-ing it? [object Object]².

[object Object] comes from the Object’s prototype’s toString() method returning [object Object]. Functions get serialized as [object Function] and date objects as [object Date]. While all of these are objects, the second word in their serialization depends on their constructor type.

But this all changes if you JSON.stringify() the object. If you try it on {}, you’ll get "{}". If you try it on a date, you’ll get a date. What gives?

For one, Dates implement the toJSON() method, so JSON.stringify() knows to show the ISO representation of the date as its serialized value. In general, properties that have .toJSON() will have that method called to determine how to serialize them, and others, like undefined, Maps, and Sets, will be ignored. MDN has more about the specifics of the algorithm here.

I recently learned that you can also provide your own replacer function to JSON.stringify() to tell it how to serialize certain types, like ArrayBuffers or the aforementioned Maps and Sets. It’s called on the object being stringified, then called recursively on each of the object’s properties. You can check the type of the object (e.g. object instanceof SpecialClass) and return a value satisfactorily serializing the object’s properties to be included into the JSON. JSON.parse() also has a reciprocal function called the reviver function. This is useful for including things like Sets in a JSON string, which I typically serialize as an array and rebuild into a Set on the other side.

Serialization in the Wild

There are other ways to approach serialization in JavaScript that avoid having to create your own custom reducers for common datatypes and that handle cyclical references well. For example, Cloudflare’s local Workers simulator, miniflare, uses devalue to flatten objects representing Workers abstractions into JSON so they can be piped through to different parts of the simulator³. devalue brings support for JSON.stringifying common objects like Maps and handles cyclical references, which you’d otherwise have to write replacer functions for. A nice bonus is that it can even unflatten values that are part of a larger string. This is called rehydration: if we think of serializing values as vacuum-packing and drying food for easier storage and later consumption, rehydration is the ‘just add water’ part of eating the MREs. While using it, I noted that devalue has some pretty neat raw output as well. If I recall correctly, it outputs seemingly nonsensical nested arrays of numbers and string keys — think something like [[0, "a", []], "b"]. The project goals state that it’s intentionally not human readable, but it’s interesting that if you squinted hard enough, you can kinda tell where the structure comes from.

On the more theoretical PL side, I also recently learned about thunks. Thunks are a way to delay evaluation of an expensive function (and historically, to delay evaluation of its arguments) until later. In JavaScript, they can be implemented quite easily with arrow functions wrapping some function with other arguments and passing the variable name of the thunk around.

const expensiveFunction = (n: int) => {
  /* calculate factorization of n */
};
const thunk = () => expensiveFunction(1337);
doSomethingElse(thunk); // can pass in the thunk function to later be evaluated

Thunks are mostly used to avoid executing code, but I’ve seen them used in serializing expensive or unreliable function results as well. I was optimizing a toy distributed KV store recently and I was using ’thunks’ to store mappings of unique IDs to values. This let me only occasionally retrieve the underlying values when needed to hydrate parts of the KV map while keeping other parts unevaluated and ready for computation later. To me, thunks feel like a type of serialization too, since they package up a function in a representation for use (evaluation) in another context. However, since they’re usually implemented with basic built-in language features and are usually never read in their bytecode form outside of the program’s execution, maybe this one’s a bit of a stretch.

While Python’s pickling is in an entirely different language, it also bears mentioning in a post about serialization. This past semester, I was working on a research project that had been started by another student the summer before. The project involved a lot of web scraping and I wasn’t looking forward to having to do it again, but the previous student had the foresight to save all the scraped Python objects as pickles! This made it incredibly easy to load in a big class object and start manipulating data. The scraping had also been performed with the help of some other libraries, and Pickle preserves the functions defined on each object, so even without the libraries installed I was able to call basic functions and get started quickly.

Pickling in Python is the richest form of serialization I’ve worked with yet - the fact that the object and all its properties and functions can be recreated from a simple file means that none of the original data or behaviour is lost. At the start of this article, I mentioned how serialization often feels like dimension reduction: after data is serialized, it typically loses some behaviour or some data attributes, even if it’s parsed and reconstructed later. Pickling makes serialization without this reduction practical.

A section on the forms of serialization I’ve encountered so far wouldn’t be complete without a final little hat-tip to Racket and its Lispy concepts of data as code and code as data with S-expressions. I can’t put it into words, but there seems to be something intrinsically connecting the concepts of serializability and having your code live as data and vice versa.

`<iframe>`: Pickling the Web

If serializing is converting something into a form that can be transmitted, there’s also something interesting in primitives that don’t have to be converted to be transmitted. Pickling is one such modality, but so are <iframe>s. Pardon my continual gross misuse of “serialization”, but I also think <iframe> embed tags are, if not a form of serialization, at least something very closely adjacent. They’re arguably one of the most ubiquitous and well-embedded (pun not intended) on the web. They were introduced in 1997⁴ (after Python’s pickle was implemented!)

In some ways, they’re similar to pickle files — they preserve the entire functionality of the object and allow the site to be integrated into the rest of another object. They enable websites to be part of each other in a rich way like pickle files, with some extra strict container boundaries for security.

On the other hand, they’re not persistent like pickle files, even if the hosting website is a local file. They depend on the current iteration of the website, and if the embedded site goes down, the hosting site won’t display properly as well. You could embed an <iframe> to the archive.org as well, but that might not necessarily reflect the most recent updates to the site. Maybe embedding an archive.org link is somewhat closer to serialization than live <iframe>s.

Tool Interoperability

At the core of serialization is the idea that data moves: between scripts (Pickling), between obscured code and developer-facing output (JSON), and between domains on the web (<iframe>s). We’ve seen in this post all the ways serialization can enable interesting theoretical behaviour, but what happens in the real world?

I think serialization-forward software is almost necessarily at odds with the walled-garden, mystery-cloud approach of most SaaS tools. Nowadays, most tools support some way to export your data, but moving between software typically causes you to lose something. I once did a big migration from OneNote to raw Markdown, which, to state the obvious, meant losing all my image positioning, coloured highlights, and so on. That’s a big fidelity jump, so I’ll forgive OneNote, but there’s so many more examples online. You can export webpages to PDF, losing easy access to their hyperlinks or to interactive media. Depending on the service, exports can range from CSVs to JSON — you might get all the required data for your use case, but you also might not. You’ll certainly have to recreate the underlying functionality of the data, or find a tool to do so. If you find another app and switch, it’ll often take significant work to convert to the serialization format expected by the new program.

Serialization, and ways to extract rich metadata, is key for true interoperability. In an ideal world, we’d also be able to get some base idea of the object’s original interactivity models. Today, we’re stuck with copy-and-pasting raw unformatted text and taking static screenshots that don’t encode any of a tool’s behaviour.

The performance of ways to do this is another issue to tackle. To enable a practical workflow where you’re able to switch seamlessly between apps, your context-switching processing time should be as low as possible. This is a challenge when you’ll likely have to convert between file formats and reconcile differences in structure on the fly, all of which appear very expensive time-wise.

I see aspects of this interoperability in tools already — MS Word allows you to embed and interact with MS Excel charts, for example. There are also a plethora of sketchy scripts on GitHub that allow you to convert and import between tools. But they lack the authoritative support, polish, and ease of use that I think are necessary to bring this potential ecosystem together.

Conclusion

Serialization and representation are so closely intertwined that if I ran s/serialize/represent/g on this post it’d still make sense (and I’d still have to apologize for my overly stretched usage of either word). I don’t know how to differentiate between them, and I think in this post I’ve mixed both up and added a few more tablespoons of general interest in data to boot.

I’m interested in thinking about more open ways to move data around, relying perhaps on more easily serialized forms of data, like JSON, and figuring out how to codify original behaviour or intent. Ink And Switch’s Potluck and Cambria projects both fit closely to this space, and I’ve also been inspired by Alexander Obenauer’s idea of universal data portability and thoughts on data views. Streamlit, a Python framework to build data-based apps, seems like an adjacent step towards what I’m envisioning as well. I’d like to dive specifically into how to represent original intent at a rich level sometime — we’ll see what comes out of that thought.

Also, on a final note, I’ll be giving a lightning talk at Neovimconf 2023 about my personal knowledge management system in Vim! Tickets are free, so tune in on December 8th, 2023 if you’re interested in learning more about taking notes with Vim, a little splash of Ripgrep, and some FZF.vim too.

By ‘dimension’ here, I mean one ‘aspect’ or ‘axis’ of data. I loosely also include ’the ability to do something’ - something like an instance function - as a dimension. I think the term ‘dimension’ is probably overloaded here, and there’s probably a better way to express this sort of ‘squishing’. ↩︎
This mistake once derailed an entire hackathon project for about 6 of the 36 hours. It was extremely frustrating to diagnose as a JavaScript beginner - what, was I supposed to know that fetch() was expecting a string for a request’s body? ↩︎
When I was working on updating the Queues implementation, I looked into it for sending the Queue’s messages around. To be honest, when making the PR, all the serialization and buffer manipulation felt a bit like black magic, primarily because I was unfamiliar with how Buffers and ArrayBuffers worked. Learning more of the JS internals, like this, is something I want to improve at. ↩︎
It was a surprise to me that Python’s pickle library predates <iframe>s. The first commit in the Cpython Pickle library file was in 1995, but there are references to an even older ‘flatten’ version of the library. ↩︎

‹ go back