SQLGOD
Documenting my journey as a graduate student working with SQL & trying to publish a paper on its error messages
Fall 2025
August/September
Week 1 (August 25th - 31st)
- Class orientations (CS489: Deep Learning, CS682: Artifical Intelligence, CS784: Scheduling)
- Caught up with professors that I haven't spoken to since the Spring 2025 semester
- Attended Dr.Stefik's kickoff meeting with his research team
Week 2 (September 1st - 7th)
- Met with Dr.Stefik to address the steps necessary for the SQL research paper to be formally conducted under IRB approval and the reality of the paper's potential for publishing
- Met with Dr.Cisneros to seek advice regarding the progression of my Masters degree. Learned that average students do not know what their research topic is until their second year of Masters and that first year is heavily class loaded.
- Scheduled to meet with Dr.Stefik's research team next week to discuss the methodology of the experiment and how it can be modified before filing for IRB exemption
Week 3 (September 8th - 14th)
- I began a literature review to see if there were any papers published on the topic of SQL in the past three months. (As it has been three months since my last literature view). While searching, I discovered that Toni Taipalus had recently published the paper "Enhanced SQL error messages facilitate faster error fixing". After skimming the paper, it was apparent his experiment was designed in an eeriely similar way to that of my prototype study had previously conducted 3+ months ago. After discussing this with Stefik, in his own words, I was "scooped". On one hand, its unfortunate that my approach has been taken so my paper wouldn't add something totally new to the literature, however, this does mean that my approach and thought process is aligned with that of a well decorated PhD, not bad. To continue this journey without making it a replication study, as the aim is to one day publish this work, my goal has now shifted towards targeting the errors of aggregate functions in SQL, rather than general errors of SQL.
Week 4 (September 15th - 21st)
- Met with Dr.Nasoz to catch up and gather their opinion on the direction of my study. They emphasized that I should focus on demographics and changing the queries entirely from what I had previously used in my prototype. Additionally, they warned of making the study too complex as that would narrow the number of viable participants.
- Learned that I need to go through Collaborative Institutional Training Initative (CITI) before submitting to IRB as it is required before approval is granted.
Week 5 (September 22nd - 28th)
- Contacted Toni Taipalus via e-mail to establish a line of communication, express my appreciation and admiration for his work, and ask for suggestions regarding my own desire to research SQL in education. Toni is a Finnish researcher who has published numerous papers regarding SQL and how novices interact with it. After discussion, Toni was very pleased with my proposed study idea and willing to help however possible.
- Demographics and aggregate functions were not considered in any of Toni's work and may serve as my opportunity to bring something new to the publishing space
- Dr.Stefik suggested working with a friend of his from Germany who has previously published a paper on SQL, "An Empirical Study on the Possible Positive Effect of Imperative Constructs in Declarative Languages: The Case with SQL"
- My todo list consists of addressing what my target audience for demographics will be (heavily consider who is and isn't a viable candidate), determing what aggregate functions I want to use, determining what functions were used by Toni in his paper, and read the suggested paper
October
Week 6 (September 29th - October 5th)
- No work was done in regards to research as this week has been busy with class work deadlines. The todo list of last week still needs to be addressed.
Week 7 (October 6th - October 12th)
- In preparation for midterms, research work has yet again been neglected for the most part of this week.
- Target demographic will be the following:
Undergraduate Computer Science students that have completed CS326, Programming Languages
Gradute CS students without any prerequistes
Recent CS graduates, defined as being students who have graduated in the past 12 months.
- Any aggregate functions that are chosen need to be straightforward to understand. Wtih the target demographic, it is important that the study doesn't expect too much of them given that they're novices in the field.
Week 8 (October 13th - October 19th)
- A zoom meeting was held with researchers whose names are redacted. The purpose for this meeting was to address the work they're currently doing with SQL and to get a sense of direction for my own study. It was suggested to use N-of-1 studies to help quickly determine potential avenues regarding SQL error messages and syntax. More specifically, it was suggested to create a N-of-1 study that measures whether or not the new SQL syntax introduced by Google actually has any validity to it amongst novice programmers.
- Google's paper "SQL Has Problems. We Can Fix Them: Pipe Syntax in SQL" I found to be one of little validity and more of self-promotion. Their evidence cites their own employees using a version of SQL they developed and how as time goes on, and as they continue to push people to use it, more people preferred their approach over traditional SQL syntax. The real difference between the two is the use of "|>" before every line and changing around the order in which queries are written.
Week 9 (October 20th - October 26th)
- Dr.Stefik proposed the idea of creating an N-of-1 study that randomly generates tables and a corresponding query that consits of nested COUNTs, requiring the user to read and comprehend the query, then provide an answer between 0-9. This approach aims to narrow down purely syntax readability and measure whether or not theres a meaningful difference across multiple formats. My responsibility now is to use existing resources and create a simple program that can do just this.
November/December
Week 10 (October 27th - November 2nd)
- Last midterm of the semester was this week. That was a nightmare but now hopefully I'll be able to begin dedicating more time towardas research as the month of October comes to a close.
- Began experimenting on how to create the experimental environment as described in last week's update.
Week 11 (November 3rd - November 9th)
- Absolutely nothing, CS689 is taking up all of my time.
Week 12 (November 10th - November 16th)
- Absolutely nothing, all of my classes are eating up my time. At this point, I do not anticipate any work to be done until winter break, but only time will tell.
Week 13 - 15 (November 17th - Dec 7th)
- This period of time consisted of Thanksgiving break, finishing the remaining homework for my classes, and working on the SQL Assistant model for CS689. No work was done towards my own independent research.
-
Taken from the abstract of my SQL Error Assistant paper:
SQL is the industry standard language for relational databases
management systems and is commonly introduced to novices
programmers through systems such as MySQL. Despite its
long history, SQL error messages are often perceived by
novices as unintuitive, confusing, vague, or unhelpful. Over
the past fifty years, relatively little effort has been made to im
prove these error messages to support novice programmers’
ability to understand and adopt the system. This paper pro
poses a modern solution by leveraging Salesforce’s CodeT5
model– a transformer model designed to understand and
interpret coding languages– and fine-tuning it to support
MySQL, enabling the generation of error messages that are
tailored to a programmer’s understanding of the language.
- The TLDR of the paper is that although I generated 100,000 unique samples, the model was overfitting and never learned the context of the queries.
Week 16 (December 8th - December 14th) - FINALS WEEK
- This week consisted of two finals (CS682, CS784) and the project presentation for the SQL error assistant I've been working on in CS689.
- With the first semester of grad school in the books, I look back on it as a rollercoaster of emotions. I would never take or recommend CS682 so long as the professor I took it under continues to teach it.
Not because of them as a person, but because of their inability to put in any effort into helping their students learn the material besides writing on the whiteboard.
CS689 had growing pains of being the first class of its kind ever taught. It has promising aspects, but it still needs serious revision.
CS784 gave me the experience I anticipated having in grad school. It was a healthy balance of not too much work, but still challenging and requiring you to be engaged with the material at all times.
I look forward to next semester, where I can hopefully begin to make progress on my research and live a much more balanced lifestyle with school.
Over this winter break, I look forward to designing and creating experiments that I may use for my experiment, as I have no desire to become a couch potato.