‹ go back

500 Level.

Published 21 June 2026 at Yours, Kewbish. 2,675 words. Subscribe via RSS.


I’d say my undergrad ended on a high note. My last semester, I happily wrote my thesis and took classes that I truly, thoroughly enjoyed. Free of juggling pesky prerequisites, I filled my days with formal methods, French electives, randomized algorithms, and automated testing.

This last class in particular was called CPSC539L. My supervisor had nagged me to take it several times already, and I finally acquiesced. It promised “Topics in Programming Languages: Automated Testing, Bug Detection, and Program Analysis”, and it certainly delivered. Being a 500-level grad course, it was conducted seminar-style. It was the smallest course I’ve ever been in, which built some sense of scholarly camaraderie: about ten of us total. Each week, we’d convene twice and discuss a paper each time. In between, we’d exchange initial responses on the forum and squeeze in a little time for our course research projects. This project, as well as the content of our readings, taught me plenty about different approaches in a new field. Yet I think this course left such a mark on me mostly because it taught me to read a paper.

I’d participated in reading groups before, albeit superficially. In retrospect, I was playing at reading papers more than actually taking much away. I was trying on the academic aesthetic as if in a fitting room. At some point I had probably read “How to Read a Paper”, studiously taken notes, and then turned around to read the next paper without internalizing any of the advice. I’d diligently summarize and underline and scrawl notes in the margins, all while not really critically thinking. One time someone pointed out my notes, cheerfully asking what I had to add to the discussion, seeing as I had lots jotted down. I was mortified.

I still have much to learn about getting the most out of paper reading; this comes with research maturity and experience. In the meantime, this is a meta-review of my favourite course last term and why it taught me so much.

13% – Paper Responses

Each class, we’d have a paper assigned. During the term, we’d take turns to present one in more detail, but most weeks we were tasked with reading the paper and writing a response. This writeup would need to include both a summary of the work as well as questions and reflection to feed in-class discussion. Helpfully, our professor provided a short list of questions to guide our responses: these covered topics like what clarifications might substantially change our impression of the work or what was most compelling about the research.

This process of re-articulating the paper was the most valuable learning process for me. Before, I would write down some disparate notes, but never to the level of identifying contributions, limitations, or research questions. Learning to make these explicit was one of my most meaningful takeaways from the course. Doing so while roughly referencing a structured list of questions was also useful to develop my ability to relativize between papers, since I revisited similar points each time and could find things to compare.

One step I added for myself was trying to track the implicit assumptions each paper made. This was quite fun detective work, like trying to figure out whodunnit in a spy novel. I felt like this helped clarify where certain framing decisions might have come from and gave me material to argue about. For instance, from my notes on “An empirical study of the reliability of UNIX utilities”:

- assumption: finding any bug is useful even if it's out of distribution
	- would argue only useful if erroneous within POSIX spec?
- assumption: distinction between crash, hang, and well-behaved exit is important
	- if program crashes and exits to shell and allows ACE that's bad, maybe uses up some disk space for storing coredumps, but there aren't really external consequences that differentiate them further?

It was also amusing to note overly defensive framing and caveats in writing that was otherwise strongly confident. I tried to reverse enginer what reviewer feedback they might have gotten that prompted such additions. Of course, maybe the authors themselves had wanted to guard against misinterpretation, but sometimes this came across a bit suspect. From my response to “Automatic generation of oracles for exceptional behaviors”, the Toradocu paper:

In the intro, I thought the framing of focusing on exceptional behaviour was overjustified to the point of repetitiveness. I felt this wordcount would have been better spent on concrete examples […]

Writing these responses also forced me to consider how the work was positioned in the broader landscape, with regards to past and future work. Explaining how a paper was different from the prior art was part of the expected response, so I had to at least try to understand and glance over the references when I was drafting. It helped that I had some background in model checking and specification in TLA+, so I could at least contrast the new tools I was learning about against a more familiar philosophy. Another guiding prompt revolved around future work, and I tried to always identify some next steps. I’m quite drawn to the concept of gap maps, especially for getting an overview of a field’s frontier, and thinking about this for each paper helped me develop my “gap sense”. This will come to be of use especially when I start my own research projects.

Through taking notes, I realized one mark of a solid paper was the number of questions I’d come up with in my raw notes compared to the number of lingering clarifications I wrote up in my response. A good paper would usually answer all my questions as I continued reading; I usually didn’t like the ones that left me with too many open questions.

I will admit that the papers start blurring together after some time. I feel for the actual grad students who’re usually taking two classes, sometimes each with four papers a week, alongside their RA or TA work. I felt two a week was a good pace to quickly explore a vast domain while retaining enough fuzzy context where I could recall what paper another reminded me of and later find the exact detail I was searching for based on my notes.

Recurring Motifs

While writing my paper responses, I often returned to similar “shapes” of questions, primarily around applicability, evaluation, and contributions. I noticed classmates also did the same thing: some would frequently harp on the data visualizations, and others were skeptical about what types of bugs mattered. It seemed folks have their own internal criteria and priorities when it comes to evaluating a paper: it’s not just “research taste” that we need to cultivate, but also “review taste”. Below are some of the regular angles in my paper reflections.

My main concern was usually how practical a tool was in context. One of my gripes with academia is this undercurrent incentive towards prioritizing the research novelty of a work over making it usable in practice. I also work adjacent to formal methods, which usually gets discounted as a pure academic luxury that’s not feasible at scale. Put together, this means I value work a lot more when I can see how it might be, or already has been, adopted in real workflows. Often, I’d ask questions about:

  • How fast tools were, and if they could be embedded in PR CI or CD loops — keep in mind these papers were broadly bug-finding tools, so I was thinking of how they could realistically fit into the dev process
  • How tools deal with incremental changes in the codebase, and if artifacts were maintainable long-term
  • How much effort tools take to adopt in practice: do they just require low-lift annotations or does it require a complex, custom DSL? When using the tool, how many components have to be rewritten for each new system we apply it on? How leaky is the abstraction?
  • How tools fail — I really liked when authors would clearly state failure modes, with examples — and how realistically tolerable these limitations would be
  • How much tools generalize to other systems, languages, or applications, and how judiciously the authors demonstrated this with their choice of case studies and results

Another guiding question prompt I liked was “How does the paper’s evaluation match with the proposed problem statement?” I think the smattering of papers we read ran the gamut of eval quality. It’s obviously easier to criticize an eval plan rather than to execute one, but even without background I would sometimes think of other experiments that would feel more natural and convincing. Other times, though, the eval hid some interesting gems. For instance, for the KLEE paper, I wrote: “differential testing like CSmith but with even stronger proof guarantees => really compelling and should have been bumped up in the intro”. Besides this, I usually questioned how the authors chose certain baseline systems, where overhead or improvements over these baselines came from, and setup details for reproducing experiments: I had a pet peeve over authors not explaining how and why they picked certain subsets of benchmarks to present eval on. Moreover, I found evals most convincing when confirmed by experiences with outside users or real-world adoption: I went on about this for the TSVD paper in particular. In a way, this outsources the work of convincing a reader the work is worthy of attention, since it’s social proof others have already been persuaded. Sometimes this can also paper over cases where the eval does not quite match the contributions.

Something more nuanced that I’d think about was the “hardness” of the contribution compared to how it was described. There’s nothing wrong with intuitive approaches, and in fact I really liked how obvious some of the papers we read felt (e.g. “Semantic fuzzing with zest” and the Delta Debugging paper). I was interested in whether the contributions made real advances. For example, most of the papers focused around finding bugs, so I’d often ponder if the bugs a new approach caught would have also been identified by existing tools given small modifications. I was usually harsher on LLM-based papers, because sometimes I didn’t think their contributions would stand the test of time, or even stand alone had it not been for external AI progress1.

Indeed, there were a few questions that came up often for LLM-based papers specifically:

  • How much an approach cost, both in tokens and inference time — I got so fed up with the recurrent lack of transparency that one of my responses reads “another LLM paper, another missing cost discussion”
  • How the approach would perform now, and how they’ll perform, extrapolating for improvements in base model capabilities in the future — even if a paper was published in the last year, half my response questions usually targetted how the approach would need to change
  • What contribution exactly causes the improvements over the baseline — I’m big on ablations, even though they’re expensive, since they can at least approximate some explainability. One of my notes includes:

    Evals for LLM papers have to be very thoroughly thought through and guarded, with defensive writing => it’s quite easy to be on the other side saying “but it’s just probablistic” and throwing up one’s hands at the black box

Reading so many papers also gave me a sense of what presentation elements consistently boosted clarity. I liked:

  • Callout boxes at the end of results sections for main takeaways
  • Consistent colour coding, both throughout the paper and in graphs
  • Sparing and well-thought-through usage of math notation, as opposed to LaTeX vomit that tried to make an approach seem more complicated and rigorous than it was
  • Clear value statements in the introduction, but only if they’re backed up by the eval and rest of the narrative
  • Illustrative examples, especially for concepts that only differ slightly
  • Straightforward explanations of what the contribution or actual approach is — for my least favourite paper of the course, I had to reread it twice to understand what their tool actually was

20% – Participation

Besides the paper responses, another very fruitful component of the class was the discussions we had. While the selection of papers cut across the field quite nicely, the discussion per-paper was also quite diverse, since we came from a variety of labs and research backgrounds across the CS/ECE departments. There was rarely strong consensus in liking or disliking a paper, and even then it was always for different reasons. Our professor mentioned during the first class that she always got more out of papers when discussing them as opposed to reading on one’s own, and I think the range of perspectives in class really made this ring true.

As I mentioned before, another class component was leading one class’s discussion. We were expected to summarize the paper’s approach and main contributions, then trawl through our classmates’ forum responses to organize discussion starters by theme. This exercise of picking out common threads was valuable, not only in picking up new ideas but also seeing where others’ thoughts had clustered, if these differed from my own. I tried to do this briefly myself for most papers, even if I wasn’t presenting, to start thinking where I might be able to riff off of others in class discussions.

The discussions were always extra fun when our professor would share lore about what she’d heard off-the-record from the paper authors at conferences. Extra follow-on work was also good fodder: I especially liked when the discussion lead would collect some replication studies or retrospective reports that could contextualize the work in hindsight. One example was ’“Synthesizing input grammars”: a replication study’: in this case we didn’t read the original paper, but this replication study (and another of my professor’s papers) provides evidence that fairly strongly refutes some of the original claims, and we got a little back-and-forth from the original authors in the form of this rebuttal.

I’ve never been in one, but I have the impression that in our discussions, we got to play pretend as reviewers in a PC meeting, just without the tedium of filtering papers, balancing a program, and the wordsmithing in carefully writing critical reviews. This feels like good grad student training, and I’m glad I got to get a peek at it.

Conclusion

I found a lot of joy in completing my readings each week. I set aside my Saturday mornings (and sometimes early afternoons), and I reveled in the ritual of opening my ACM DL tabs and sitting down to write. There’s something silly in this, reminiscent of kids reading the newspaper to imitate their parents. More often than not it was sunny when I was reading, and to further romanticize things sometimes I’d make a tisane. It was a halcyon habit. I felt very grateful to get up in the morning to read papers.

CPSC539L, and taking a grad course in general, was not as intimidating as I’d thought it would be. I don’t feel that I lacked any background, despite not having much prior PL/SE experience. However, I’d also say that to get the most out of them, seminar-style courses are also not as “chill” as other undergrads make them out to be. Yes, there are fewer drudgerous assignments, and in principle, readings aren’t so demanding, but engaging thoughtfully with a high volume of new research takes a certain mental load. This load has now become a load-bearing skill, though, and overall I’d recommend the experience.

I feel like I’ve started to learn how to read papers seriously, not just as someone cosplaying a grad student, but as someone who can meaningfully dissect the work critically. I’m looking forward to my CAPlendar series later this year, and all the papers I’ll read in between, so I can keep picking up more patterns and hone my reading taste.


  1. I will be obtuse and say several classmates and I were heavily criticizing one such paper in our lab lunchroom just before reading break, though maybe we were just jaded and needed a break. ↩︎


‹ go back