Blog

Friday, 2nd May 2025

Codurance AI Hackathon: Exploring AI-Powered Software Development

On Saturday 26th April I attended an invite-only AI Hackathon at the Codurance headquarters in London. This blog post is a run down of what happened, why it happened and what I (and other attendees) learned. At the end I briefly discuss some limitations of the findings.
The full info can be found on the Codurance website.

The Aim

“AI is transforming software development, but how effective is AI-powered coding in real-world scenarios? Join Codurance's [...] AI Hackathon to put AI-assisted development to the test!” Online there are lots of examples of people building web apps “entirely” using AI/LLMs, but on closer inspection these projects are not generally up to standard. They are usually not well structured, tested or maintainable. So the aim of the day was to see how good the AI tools are at writing real world production standard code that developers would be proud of.

Format

09:30 - Arrival & Registration
10:30 - Kick-off & Challenge 1
12:45 - Lunch 13:45 - Challenge 2
16:00 - Playback & Discussion
16:45 - Event Close
17:00 - Pub

I arrived at the Codurance office a bit early due to train times (about 9:15am). Once I was buzzed in I was met by Matt Belcher and Rowan Lea. Matt is “Head of Emerging Technology” and Rowan is a “Software Craftsperson” both working at Codurance. I was warmly welcomed and given a brief tour of the office as I was the first to arrive - it's a really nice space with lots of spaces for collaboration whilst having spots to get your head down.

Over the next 45 minutes other developers filtered in. It was great to meet everyone! There was a wide range of experience in both the software development industry and using AI tools - I think this played into the whole day very well. Some people were veterans of the industry with 30+ years experience but only had briefly used ChatGPT, at the other end there was a developer who was still quite early in their career but has been following and using AI tools extensively since they first arrived on the scene. This was great as it meant that everyone had something to contribute but also something to learn and improve on!

After some chatting and fuelling (coffee drinking) Matt and Rowan invited everyone into the space we would be working. They then explained that everyone was to be split into group A and group B where group A would use AI tools for the first exercise whilst group B would use traditional non-AI methods. In the afternoon this would be swapped around so everyone gets a go! I was assigned to group A so I got to dive right into the AI tools. It was explained that we should get into pairs or threes within our group to tackle the two exercises.

Exercise 1

Exercise 1 was revealed as “StyleDen”. You can read the full brief here but to summarise it asks you to “build a minimal viable product (MVP) for their e-commerce website.” For the first exercise I paired with the person I was sitting next to. They were a C# developer but had been exploring and learning Python. As I was most comfortable with Python we decided to work together and knowledge share along the way. Since we were assigned to the AI first group we had a discussion about the best way to use it, and more importantly the best way to put it through its paces. We decided to try and use it to its full potential and avoid writing a single line of code if possible, i.e. just prompting and guiding it. So where do we start?

At the beginning of the task we had the README which contained all the requirements and info we had and were going to get. The first thing we needed was a plan of attack. We did what anyone trying out AI would do, we passed the entire requirements doc to AI and asked it to spit out a solution to complete the exercise. You can find the full conversation here.

The first thing it produced was titled “Overview” and it was essentially the parts of the app we would need and suggested technologies for them. It mentioned a frontend built in React, a backend built using Python's FastAPI, a SQLite Database and a few other bits and bobs. It then laid out a file directory structure to help us visualise how to split out the app. It listed some key API endpoints which after a review we were happy covered all the requirements and pretty much were named/described as how we would have designed them ourselves.

One part of the document that I was really interested and happy to see was a section titled “Tech Stack (Quick Justification)”. In this section ChatGPT outlined WHY it chose to use the technologies described above. For me this is a really key aspect of using AI. A lot of the uses I see of AI, we ask it to complete some task or ask it a question; we very rarely, if ever ask the AI why it has done what it has done (this is a key point I raise later in the day).

The last part it produced was a “Plan of Attack (MVP Steps)”. This was really useful as it gave us smaller bite size chunks to iterate on as we created our MVP. My only issue with the plan of attack was “Write some unit tests (especially backend)” was at the bottom of the list. This highlights an issue I have seen time and time again with AI developed (and human developed) code. Testing is not considered or if it is, it's an afterthought. As an advocate of Test Driven Development (TDD) this is a real issue for me. Ideally I want tests to be written first based on the requirements, then code to be written to pass those tests.

After using ChatGPT to create a plan of attack we had another discussion about next steps. As we discussed at the beginning we wanted to fully embrace AI/LLMs so after a brief chat about the technologies we would use, we concluded to continue with the technologies suggested by ChatGPT. This was partially a decision due to us having some experience with the technologies, but also because these are technologies that are widely used. This means in theory the LLMs will have plenty of training data and should produce decent code. That was the theory at least…

To actually start writing the code I used GitHub Co-pilot built into VSCode, with this you can use “agent” mode, see here for more info - essentially this allows you to prompt an AI Agent which will then make edits directly in your VSCode workspace. We started at the first step of the ChatGPT plan of attack and asked it to create a SQLite database along with a seed script (to convert the CSV into a Database). This worked first time and created a file that worked successfully without any tweaks. As previously mentioned it did not create any tests. However, by discarding the changes and adding “Using TDD…” at the start of the prompts the second attempt created a very similar script whilst also writing some tests. The first commit shown here showcases the seed script.

The second step was creating a boilerplate FastAPI app. I used some prompts such as “Create a boilerplate FastAPI app using TDD”, this created a very basic app as well as using the FastAPI framework. Another thing that we explored is documentation writing. If we were writing real production code this app would need to be worked on by other developers that may not have experience with writing/running these APIs. So after getting some working code we asked Co-pilot to “Add local setup and running steps to the README”. All 3 files can be found in the second commit here.

The rest of the first session followed in this flow. After a basic API was created we moved onto the frontend. Neither myself or my partner have extensive experience with React (though I am trying to learn a bit more). The first thing Co-pilot did was run “create-react-app”. I was surprised that it just created a directory using the terminal and then proceeded to run the create react app command directly in the terminal too.

My part of the task was to create the cart page. So I prompted Co-pilot to create a new cart page with tests. I asked it to add some basic functionality like to increase/decrease the item count in the cart. If the item count reached zero then remove it from the cart. After some manual testing of the app I discovered that once I removed the last item from the cart the table still showed but just empty. This was bad UX in my opinion. I was happily surprised with how easy this was to improve by prompting Co-pilot “Currently when no items are left in the cart nothing happens, update this code and tests to display a message such as 'No items in cart'”. It updated the code and tests in a straightforward way and in very little time.

By this point we were running out of time. I wanted to add a couple of finishing touches and asked the AI to add some images and a dynamic total at the bottom of the table. You can see the code's final state in my fork here along with all the local running instructions. All the code and documentation has been entirely written by AI tools. Myself and my partner edited no code manually. All in all I was very impressed with how quickly we got a working app up and running with very little intervention from us humans.

Lunch

Lunch was provided by Codurance and gave us all some well earned time to reflect. It was hosted in the kitchen space along with drinks. Of course Exercise 1 dominated the topic of discussion. There was lots of chatting between pairs within group A about what tools were used, what prompts worked well and other tips and tricks. There were also lots of discussions between group A and B about varying aspects of the task. The key takeaways were:

  • Group A got further in the exercise (a more complete solution with more features) than Group B
    • Clearly due to using AI tools, it allowed them to work faster
  • Co-pilot and ChatGPT were widely chosen AI tools
    • It seemed like this is due to familiarity and being built into VSCode, most of the developers' IDE of choice
  • The AIs did not write unit tests unless specifically asked, but when prompted it did write them mostly to an acceptable standard

Exercise 2

The second exercise was revealed as “StreamStack”, the full brief can be found here. Essentially it was to build something similar to Letterboxd. For this task we decided to mix up the pairs, this allowed new ideas and networking. I ended up forming a three with two other developers who were happy to use Python. We knew we would have no AI help for this exercise so we needed to stick to tools and technology we had experience with.

The entire code output for exercise 2 can be found here. We started off by working out that we would need a backend and frontend. We wrote down some questions and design decisions on post-it notes and created a rough architecture/design doc. One of the team had experience with React and so offered to handle the frontend part. This left myself and the other team member to create the backend. As the functionality was on the simpler side I suggested using FastAPI. It is my preferred technology for creating APIs as it is simple, integrates with Pydantic for validation and has a great testing framework. My backend partner had not used FastAPI before and preferred Flask, it didn't take me long to persuade them to give it a try!

We continued much as you'd expect at a hackathon, we used Test Driven Development to put together the backend API and start integrating it with the UI. It was noticeably slower this time round compared to using the AI tools (especially without the auto-complete/in-line editing). However I personally felt I understood every line of code and was happy that it would pass a code review. I also spent next to no time at all reviewing the code as I actually wrote it myself.

An example of being slower was right at the start. We needed to create the FastAPI app, first of all just with a “Hello world” endpoint to make sure we had set it up right. Previously I would have asked Co-pilot or ChatGPT to write a very brief boilerplate file for a FastAPI app. This time we had to google the FastAPI docs and navigate to the quick start guide, copy and paste the code from there. As I had used this many times before I knew where to look which sped things up somewhat. However, this process would have certainly been faster with the use of an AI tool.

By the end of the exercise we had a slightly crude web app with a UI and a backend. It had some basic filtering and sorting functionality but we did not have time to complete all of the requested features in the given timeline. It had a full test suite though! And I'm sure with a bit more time we could rapidly add new features to it.

End of day discussion

This was the part of the day that was the most insightful to me. A pair from group B kicked off the “show and tell” by showing their “StreamStack” app. They had used cursor and it was immediately very impressive. They had a complete application that had every functionality asked for, looked nice and they even had time to add bonus things like images. One of the members of the pair said something that really stuck with me though. They explained that the application was practically a black box as they had only given it a few prompts and just asked it to create the application. After the AI had finished, they had tried to use images on a different page and were unable to get it working, this should have been trivial. They said “This application code was written two hours ago and I already feel like I'm working with legacy code”. They believed that if they had written it all by hand then adding these images would be trivial but because it was a bit of a black box they would take much longer to understand and make these changes.

After this demonstration what was previously mentioned resonated with me as I feel like my first pair had a similar problem with the AI code being a bit of a black box. This prompted me to ask the question “There has been lots of talk about black box code and not easily understanding the AI changes, did anyone ask the AI to explain the code?”. There was a long pause as it was clear no one had done this, including myself! It seems all groups had spent the day asking AI to write/change code and not once asked it to explain code. This is a feature that has been advertised, particularly with Co-pilot's chat feature. I have used this a few times at work when moving into a new project. I think that was a large unexplored part and a use that we should have tested more during the hackathon.

Another group spoke about abstraction and refactoring. They said that the AI tools seem to heavily favour “copying and pasting” similar code instead of extracting and refactoring into its own function for reusing elsewhere. They said they had similar functionality in three places in their app and the AI re-wrote the logic every time. Again this meant that if they wanted to tweak it in the future they would have to change it in multiple places. It seems AI does not follow DRY. They did say with some guidance and prompting the AI tools could refactor and extract logic, but it wasn't natural for it and had to be asked for specifically.

Another pair of developers followed on from this point. They had asked the AI tools to refactor some code in a specific file, they said it did manage this but along the way would update and change unrelated code in other files. A further person raised their hand and agreed with this point. They vented some frustration with this in production. They told us an anecdote; they had been working on a large codebase with many many files. They wanted to update/refactor a specific file. By default Co-pilot will take your whole workspace as “context” to make these changes. Unfortunately that also means it can access and make changes to every file in your workspace. They suggested a good improvement would be to tell the AI to “read” these files for context but only allow “write” changes in file X,Y and Z.

Lastly, a member of my three for Exercise 2 said “I have achieved a lot less in this problem compared with using AI tools, however I can say for sure I am more proud of the code I have written”. I think this is a key point because as developers, all code we commit has our name on it. We should be proud of the code we write. This perpetuates ownership and in my opinion results in better code being written.

Post-event

After the event we headed to the pub. There was still a bit of chatting about AI but we mostly were all done with discussing AI for the day. It was nice to chat about other non-AI stuff over a beer. We all agreed we would love to attend a similar event in the future - thank you Codurance for having us!

Limitations

If we were to do this again there are some things I would like to test. I think we gave the AI tools the best possible chance by picking problems that are widely solved with lots of examples on the internet. Having said that, there are some questions raised:

  • How well does it do when writing code for embedded systems?
  • How well does it perform in a different problem domain?
    • What about in a domain where there is lots of context required that may not be widely documented in the training data.
  • How well does it perform in an existing code base?
    • Both of these exercises were building something new. How well does it work when asked to change/write new code in an existing project?
  • Would developers with more AI experience do better?
    • Some of the developers had little experience with AI tools. Are there ways of working that unlock better output? Had we known these, would we have done better?
  • As mentioned before, how good and useful is asking AI to summarise/explain code?

TLDR

My key takeaways are:

  • The AI tools are great for writing boilerplate/setup code
  • AI tools avoid DRY
  • The AIs did not write unit tests unless specifically asked, but when prompted it did write them to an acceptable standard
  • The AI tools did better when asked to work in smaller steps
  • Developers are more proud of their work when using less AI
  • Some tools are better than others, with the tools that can edit directly in the IDE saving more time
  • The “auto-complete”/in-line functionality is the way most developers use the AI tools

Ultimately, it is clear to me: developers can already move faster and be more productive with AI tools and these effects are only increasing.