A Story About Failure In Software Projects

This is a story about failure. Think about anything that can go wrong when building software. In our last project, we made almost every mistake you can think of. But this is also a story about how to learn from failure. And I learned what separates good management from bad management in these situations.

I believe, most of our learning can be applied to any software project. Hopefully, telling this story helps other teams to make better decisions than we did. And maybe it helps management to better understand what their team needs.

But let’s start from the beginning.

Disclaimer

Most of my colleagues who were involved with this project reviewed this post. They added valuable perspectives or corrected some of my perceptions. Nevertheless, this story is primarily based on my point of view. I tried to be as objective as possible, though.

To make it more appealing to non-technical readers, I simplified technical details where I felt they were not relevant. I plan to write a separate post about the technical learnings.

The Project

When the project started, there was no Business Solutions team. Just me, a Salesforce Developer. We had roughly 90 users back then and I was the sole person responsible for everything. My responsibilities encompassed administration, user training, requirements engineering, backlog prioritization, solution design, DevOps engineering, and implementation.

What We Were Trying To Achieve

Our operations and support team asked me to automate the monitoring process of our main product, the market-leading solution for dynamic load management of electric vehicles. When one of the controllers that control up to 100 charging stations goes offline or recognizes an error in one of its charging stations, the customer wants to be informed within minutes. We already have a backend that recognizes anomalies on a technical level, but the whole monitoring process that follows was mostly manual and driven by multiple business rules: Finding the customer record and their contacts in the CRM, finding out if they purchased our monitoring services, and informing them of actions taken.

To make things more interesting, the monitoring process is handled in OpsGenie (an Atlassian tool for incident management), while all customer data and all customer communication are maintained in Salesforce. OpsGenie is used by an outsourced team that works in 24×7 shifts. Salesforce is used by our internal support and operations team.

To sum it up, we had to coordinate three teams with three systems:

The technology team. They own the Backend of our product.
The monitoring team. They work in OpsGenie and handle technical actions.
Our internal support team. They work in Salesforce and handle customer communication.

As you can imagine, the potentials for optimization were huge.

So far, the Backend was directly integrated with OpsGenie. However, our internal team was working in Salesforce to communicate with the customer. So we had our monitoring team copy-pasting all details from OpsGenie alerts into cases in Salesforce.
As I pointed out earlier, Salesforce owns all contact information for a customer installation, the services a customer purchased, the project status, and the preferred language of the customer. All this information is required to filter only those anomalies that are actually relevant. The monitoring team did not have access to this information.
Emails were composed manually. The team had a couple of text templates they managed in Outlook or Word, manually looked up the customer contacts, and had to send emails manually from cases.

As a result, the outsourced team spent almost 40% of their work on controllers that should not be monitored. Due to the manual hand-overs, informing a customer typically took hours, not minutes.

Our Timeline

The first prototype started a few years ago in December. I was still working alone back then, so we focussed on Salesforce to start with. We picked the functionality to identify eligible customers, their contacts, and the automatic sending of emails as our minimum viable product (MVP).

The initial design implemented two endpoints to process anomalies that our Backend sends us, store all decisions for documentation purposes, and automatically create cases for them. These cases would then trigger email notifications to our customers. We already knew that it would take months to integrate with our backend directly (as I learned later, this was still too optimistic). Therefore, I designed the system so we could create cases manually. That way, my users could start sending emails without the need to integrate our Backend.

The first MVP was needed by the end of the year. Naturally, I worked massive over-hours during Christmas and New Year’s to make this happen. Only to learn, that everybody who could test and review my prototype was on vacation.

It took two more months until we rolled out the first version. The design still had some rough edges, but essentially it did what it was supposed to do: Automatically determining the recipients for a given case and sending email notifications to eligible customers only.

In the meantime, I tried to integrate our Backend and OpsGenie with Salesforce. Because we had a couple more systems planned for integration, it was apparent that we had to introduce a middleware. Salesforce could never have handled the complexity of implementing all those integrations directly. I knew from a previous job that there is a technology to solve these kinds of problems: messaging. So of course, I wanted to introduce it. I chose RabbitMQ, just because I knew a little about it and it was also used by other teams in our company. I already had some experience with the Streaming API, so how hard could it be to write an adapter connecting Salesforce to a message broker?

Well, as it turns out: Pretty hard if you have no idea what you’re doing. Since I am not particularly familiar with Python or TypeScript, I had to hire a service provider to do that for me. Unfortunately, we had a misunderstanding about the required experience and the developer they provided was a fresh graduate. I was not able to explain to him what messaging even is and how I imagined the adapter. I failed even harder at recognizing why he did not understand a word of what I was specifying and together we failed for 6 months until we abandoned the cooperation. After that, I had the luck to find a freelancer that had extensive experience with the technology and build the adapter in less than 3 weeks.

I was desperate to tell my management that this is way too much complexity and workload for a single person. I was still working alone, had no significant experience with the technology, and had a maximum of 10 hours a week I could put into this topic.

In April that year, we shifted our focus back to Salesforce-only projects. It was refreshing to deliver some value again and I had the time to build solutions around quote management, work order management, and a simple, one-way ERP integration.

We finally started to build a team. Or at least we failed trying. As you probably know, 9 women can’t make a baby in 1 month. Adding three more people in the middle of the project doesn’t make you go faster. Coherently, we hired a junior developer and a working student to support me.

It came as a big surprise, that this didn’t help at all. A couple of months later, someone in management suggested bringing in an interim Head, who was tasked with building a department: The Business Solution. Over the next few months, our new interim Head brought a lot of great changes on their way.

In late 2021, we actually started to build a team: We introduced SCRUM as our development framework and the junior dev switched to the product owner (PO) role. Since we were now two people, we could finally distribute responsibilities a little bit. For the first time, we started to get an understanding of the enormous size of our backlog. I was given a small budget to work with a Salesforce consultancy to scale our Salesforce operation so I could put more focus on solution architecture and delivery.

This could have been our blissful happy ending. Well … I wouldn’t tell this story if it was. In January next year, our colleagues from the US approached us with an urgent requirement: They need this integration done by May because one of our most important customers demands it.

And this is when things started to go south.

We took a brief look at our backlog and my existing concept from February last year and were confident to tell: Yes, everything’s ready. All requirements are basically implemented, and there are only a few requirements that need refinement. We just need a couple of weeks to finish it.

So we took my original concept and picked up where the original prototype stopped. We were quick to build all the features that were missing. In parallel, our PO worked on gathering more requirements. Without going into too much detail, we ended up with 15 functionalities. For the initial MVP. The business made it very clear that they needed all of them right from the beginning, at our initial deadline in May.

We went to work based on my original concept and the new requirements. We worked hard for more than 4 sprints to get all features done. The business neglected to review the work in progress, they only wanted to be involved when everything was done. Our PO did his best to provide feedback and we were very satisfied with the results.

So what can go wrong if you cram every feature your company can think of into your first release? Well … everything.

Every time we had a release candidate ready for approval on QA, our business users found one detail that didn’t work for them. This always blocked us from releasing, even though most other features were working just fine. In total, we had to postpone the release 4 times. When we finally got the approval, we started to write minimal documentation for user training. After our key stakeholder reviewed the documentation, she immediately found a small –but fundamental– flaw in our process. One day prior to rollout. We had to postpone it. Again.

It took us until June 2022 to fix all these problems. Luckily, our system integration problems were solved by then. In mid-April, our third internal team member started. He is our backend developer and owns our system integration infrastructure (which I so gloriously messed up). He is one of the brightest guys I ever worked with and was able to take over in a matter of weeks.

By the end of June 2022, we were done. Or so we thought. We finally had the capabilities to test it with Production load. As we learned, Salesforce couldn’t handle it. The whole system got clogged with asynchronous jobs and started to throw database errors. To make things worse, we inserted way too much data in the database. By our projection, we would run out of space in less than 20 days. There were several other massive flaws in my design that only appeared to me after I saw it handling production load. It became clear that the initial design didn’t scale (to put it mildly).

We had to start from scratch, to potentially scale to thousands of controllers (and not just a few hundred). Additionally, we had to re-think the way we documented all the decisions the system made. The business logic to check if a customer was eligible for monitoring was complex and it was extremely helpful to have some sort of documentation available. However, having to go through hundreds of records was not helping our users. Keeping this logic would have required extensive investment in data licenses and a complex UI to present the information in an accessible way. We had to find a different design that was easier on the Salesforce limits and was better at fulfilling the requirements for documentation and integration with OpsGenie.

Unfortunately, we didn’t have the time.

Instead, we had to go live with the flawed design and find a way to filter the incoming data to reduce the load on our system. We found a way to only route traffic from US controllers to our system (As of today, these are only 10% of our controllers). This way, we could still go live, deliver some value, and learn a little bit about our domain understanding. Out of the 15 features that were originally planned for our MVP, we learned that only two of them were actually vital. And fortunately, those two were working already. It was easy to disable the other 13 features to make the rollout less challenging for our users. On July 11th, we rolled out the true MVP of our monitoring solution.

We split the team so we could stabilize and support the old design while I was working on the re-design. This was a little bit challenging but definitely paid off. Most of the work was experimenting anyway. The hard part was optimizing for Salesforce-specific limits while still maintaining the functionality of documenting all relevant decisions.

I totally failed to communicate that this meant a full rewrite. Most of my stakeholders were under the impression that „everything is basically done“, and that this redesign is merely a little bit of variable renaming and code re-organization. Out of the 15 features I mentioned earlier, more than 10 were somewhat implemented with the old design. It came as a massive shock to everyone, that we had to rewrite major parts of all of them.

The hardest part was maintaining the functionality of the old design, while silently introducing the new design. However, after two more weeks, I was able to migrate the first of our MVP’s features. Feedback was optimistic and fortunately, it was also solving all flaws of the old design. So we rolled it out after a couple of days of testing in August. As it turns out, you can deliver functionalities one by one. Each one delivers value and allows us to learn.

This was our breakthrough.

Equipped with that learning, we designed a release plan for the rest of our functionalities. One or two features per release. Start the next release only, if the previous release is working on production.

Three weeks later, we rolled out the next two features with the new design.

Two weeks later, we rolled out the feature to process alert updates from OpsGenie.

Another two weeks later, we were able to migrate the last feature to the new design.

Our Mistakes and Learnings

This story was tough to tell. Looking back, I am still baffled at how stupid and naive I was. But on the other hand, I am very grateful for all these learning opportunities. This is not a matter of course. In most organizations, you do not have the leeway for more than one or two mistakes. This is a real problem because mistakes happen. If we punish people who make them, we strip them (and the whole organization) from the chance to learn from them.

I will use the word mistake very broadly in the following paragraphs. I include things like bad decisions due to incomplete information or bad judgment, completely misjudging something even though you should have known better, wrong predictions about the future, or simply a development that invalidates something we thought we knew.

Mistakes Will Happen. Inevitably.

This is probably the biggest learning I had. It may sound trivial, but the complexity lies in how they happen, and if you notice them. Most of the things I was doing, I was doing exactly how I have been doing them for years. But I hadn’t failed so miserably before. So what was different?

I believe the worst mistake I have been doing all over my career is falsely concluding that just because something didn’t blow in my face, it was actually a good decision. The more volatile and uncertain your environment, the more dangerous mistakes or misjudgments can become. Usually, a small number isn’t even noticed. You can compensate them with a little more effort. Or, if you can’t, your project has a small delay. But in the end, nobody cares.

However, once your little mistakes accumulate above a certain threshold, everything goes down. And then, people will notice every single mistake that was made along the way. In order to avoid that, we need to pay attention to how risky an approach is.

The Development Team

Most of the mistakes are my personal failures. As the team grew over the course of the project, we made some them together.

I Used The Project As My Personal Playground

My whole career I have been constantly learning new things along the way. This is one of the traits I pride myself in. This involves learning new technologies, experimenting with them, and introducing them as part of the project.

Up until today, I never consciously assessed the risks of said experimentation. I just did it, because I was eager to learn new things. More than 80% of the time, my experiments work, and everyone benefits from them. And for the other 20%, I can usually improvise. However, that didn’t work this time. I failed so miserably (more on that later), that I wasn’t able to correct the course.

From now on, I will try to be more honest when I review risks, so I can more consciously decide if they are worth taking. Not every project is your personal playing ground for learning new things.

I Underestimated The Complexity

Retrospectively, the whole mess took its start when I decided to introduce a message broker. I still stand by the technical assessment that this is the only way to scalably integrate multiple systems. However, I completely underestimated the complexity of learning this from scratch. The fundamental mistake was that I wanted to introduce the broker.

I thought I could learn a new technology (Node.js or Python) while still fulfilling my primary responsibilities (administering and developing a Salesforce Org for almost a hundred users). I was not clear enough to my management, that this was a more-than-stupid overestimation of myself.

There is no shame in focusing on what you are good at. On the flip side, there is no glory in overworking yourself to impress people who have no clue of what you are doing anyway. It is your job to protect your mental well-being. As a technical expert, it is your job to explain to your management what is needed to get a certain job done. Failing out of vanity is much worse than explaining why something is not a one-man job.

I Did Not Choose My Outsourcing Partner

The first agency I failed so horribly with had a long history of working with another team in our company. To keep things easy for me, we planned the collaboration under the management of the said team.

Initially, I thought that it was a good idea. As I learned, it’s one of the worst things that can happen.

The partner had highly skilled developers that had all the skills I needed. However, those were not available to me. They were working on the projects of our other team.
The other team was the partner’s first priority. When they needed resources, they were provided for them.
I had no transparency and no control over resources and billing. So I didn’t even have the chance to review and reject the invoices for the non-deliverable work.
Until today, I still do not even have the slightest clue how much money we wasted on this collaboration.

It’s not the company that has the skills you need. It’s their individual consultants. Specific people. If you do not get access to them specifically, there is no point in working with the company at all. Even a proven partnership is completely useless if the involved people change.

We Did Not Hire The Right People

As I said earlier, I did not make it clear to my management what is needed in order to get this job done. I still thought that I would be able to learn the technologies in days.

There are two very bitter learnings I had from managing a partner I had no control over.

Don’t expect a junior developer who just graduated to speak fluent English. Also, don’t overestimate your own English if you are not a native speaker. Because let’s be honest: On our resumes, we are all C2-level speakers. It is irrelevant how good their technical skills are – if you cannot communicate your requirements, you will fail.
Non-technical people usually don’t understand how important skills and experience in tech are. They will always try to bargain you down to work with a less experienced (hence, cheaper) resource. This can work if you possess the expert knowledge to train them. If you don’t, It won’t.

If you do not have the skills and experience to do something by yourself, outsourcing will be a gamble. And just like real gambling, in the long run, the odds are against you. Eventually, you will end up with a partner who doesn’t deliver. Because you don’t have the skills, you will realize it way too late and you will lack the skills to fix it. This will cost you more money than you ever saved, and eventually, it will come at a time when the delay will break your neck.

Don’t get me wrong: You may end up with highly skilled developers or consultants who understand your problems and work in your best interest. These service providers or freelancers do exist. However, if you do not have the competencies, you have absolutely no means of controlling the outcome. You won’t even notice if you are ripped off.

I was too conceited to understand that this part of the project needed someone more capable than me. I was too convinced that I can learn anything within weeks. Instead, I should’ve focussed on what I am good at. And I should’ve put much more pressure on my management to get the right people for the job.

It is not your responsibility if a project fails because management refuses to hire the right people. However, it becomes your responsibility, if you take over and fail because you lack the skills.

We Did Not Revisit Our Requirements

When we got asked to continue the work, we pulled out our 10-ish months-old requirements and concepts. We did not take a single hour to revisit them. To check them for topicality and accuracy. Instead, we blindly continued our work as if nothing had changed.

This was fatal because as we learned months later, my understanding back then was flawed. And it didn’t get better as the concept aged. So we worked based on an imperfect understanding of the problem and with wrong assumptions. The design was not yet proven to handle production load. The new processes were not even reviewed, let alone approved by the business.

Based on our shitty understanding, we completely misjudged our progress. Not only did we work with outdated requirements, but we also communicated way too optimistically that all the heavy lifting is basically done.

The problem is not with being honest. The problem is with raising completely unrealistic expectations by communicating too optimistically. Other people will make promises based on these expectations, and this is what will blow up in your face eventually.

Our MVP Was Not An MVP

Even though we called it our „minimum viable product“, it was, in fact, the full solution. Instead of walking the business through prioritizing the features, we simply accepted about 15 features as „must haves“. I had a rough understanding of what was really important, but we never tried to pin down two or three features to go live with.

This made us work for weeks on multiple features, without any real-world feedback. We had our assumptions, but we never validated them. Because we never thought of delivering these features separately, we also made a bad job at decoupling them technically. As a result, we were literally merging into the same package version. Needless to say, this greatly increases the effort of merging pull requests.

Since the MVP was months of developers‘ work, it was also fairly complex. It brought too many changes for the business to review at once. Therefore, we had lengthy reviews with lots of questions, where we only discussed one or two features at a time. Every time we thought we were done, we discovered a new flaw in one feature.

Because we couldn’t release them independently, we had to hold back the whole version.

The First Design Was Not Optimized

This was probably one of the harder-to-anticipate mistakes. Things like that just happen in agile environments. To reiterate, the biggest mistake was not just messing up the first design. These things happen. The actual mistake was: Taking more than 14 months to realize. The more features you build on a non-validated design, the higher the cost of fixing that mistake.

One endpoint was not designed to process records in bulk. Additionally, we inserted way too many records into the database. If you are not using Big Objects, Salesforce is not very efficient at handling large amounts of records. Without additional licenses, you will run out of space after 10m records. After less than 2m, you will already see a significant decline in performance.

Unless you absolutely need to store the data, updating existing records is much more efficient. There is also no reason to write endpoints that process single payloads. Instead, design every endpoint to process bulk.

We Waited Too Long To Provide Documentation

We were so busy building features that never made it to production, that we didn’t waste our time documenting them.

Therefore, we only documented the system when the business had approved it. Only then we had the time to build an actual BPMN chart. And only then did our most important key user (she would be the main operator of the module) had the opportunity to really understand how those features worked together. This made her see some inconsistencies in our understanding of the process. The worst part: they were there from the beginning. Had we taken the time to flesh out these parts of the documentation right from the beginning, we would have saved a lot of time.

The documentation is an excellent way to challenge your own understanding of the problem. Only when you try to describe it from a user perspective, you see its flaws. And it also helps the business to see rough edges and misunderstandings much better than requirements documents. Requirements focus on the what. The solution concept focuses on the how from a technical perspective. Only the documentation tells the how from the user’s perspective.

It Took Too Long To Release

Because we crammed all of the features into the first release, we never really went live. There was a small release without system integration about 3 months after the project started. But even though this was production on Salesforce, it didn’t bring us real feedback.

The key to „going live“ is not only bringing something to production somewhere but bringing everything to production on all systems. Only then you can learn if your concepts actually make sense and work in real-world conditions. In our case, it was the proverbial fatal blow to throw production load on the first design. This would have been an easy fix if we had one or two features built on it. We’d quickly come up with a more useful design and rewrite them. However, in our case, we had more than 10 half-baked features based on the faulty design. Because of all the other mistakes that were made earlier, it would have taken 10 times longer to fix the design. And we were out of time.

If I’d have to name one thing that everyone should take away from this story, then it is this: Work as hard as you can to bring a tracer bullet live. There is literally nothing more valuable than validating your hypotheses in the real world. Don’t theorize too much, don’t cram too many features in your MVP, and if you are missing system integration, find a way to make it happen.

The Business Side

Not only the technical team made mistakes. We also added some mistakes regarding scope and prioritization.

The Scope Was Too Large

For various reasons, we tried to cram too many features into our so-called “minimum viable products”. We failed to keep the scope small for our MVP.

This made the project much more complex than it needed to be. Many of the problems we ran into were most certainly caused because it was too hard to understand the domain.

Everyone should understand that the MVP is not equivalent to “phase one” or the “must haves” of the project. The MVP should never include all the must-haves at once. It is not intended for prioritization, it is intended for learning.

We Did Not Work With The Domain Experts

It took us months until we learned that the people who requested some functionalities weren’t the ones who fully understood them. They didn’t even own the process. Nevertheless, we were only talking to them.

Make sure you really understand who will use the things you build. Explicitly ask around. You will be very surprised by the people you will find. Speaking from experience, the business users that approach you rarely have the full picture.

Timelines And Scope Were Forced

Almost all teams with customer contact will eventually use the ultimate wild card to get their things prioritized: The strategic customer that will terminate the contract, if a certain thing will not be made possible within x days.

I’ve experienced this a couple of times now, but there are two things I have never seen: a) the feature being delivered on time and in quality and b) the customer quitting as a result.

Just because you are forced to work with Waterfall, the methodology doesn’t magically work. Be vigilant, when someone tries to force a fixed timeline with a fixed scope on you. It is important to understand that we don’t do Agile because we are lazy or want to annoy them, but because we firmly believe that it’s the best way to deliver software.

What A Team Needs From Their Management

There’s only one question management needs to ask themselves: Do I want my team to succeed?

The answer to that question is not as obvious as it seems. There may be situations, where your team is in over their heads and there is no way they will be able to learn quickly enough to succeed eventually. This is the time to bring in new people and change responsibilities.

But typically, the mess is just a combination of badly managed expectations and bad luck due to the volatility and uncertainty. In these situations, your team has everything they need to succeed, except enough time. This is not the time to bring in new people or take responsibilities away.

In these situations, your team needs only two things: Trust and active support.

How To Trust Your Team

Trusting a team that already failed is not easy. Especially for middle management that is too detached to understand the technical details, yet experiences all the pressure from top management.

Let me tell you one thing: If the team has the experts for all the required technology, there is no one more qualified than them.

Why? Because they already did all the learning. Working in an agile environment essentially means experimentation. They know everything that doesn’t work. It doesn’t mean the next try will be a guaranteed success. But the chance of failing again is considerably lower.

On the other hand, nothing is more frustrating and discouraging than getting responsibility taken away after learning so much. This is why people quit.

How To Actively Support Your Team

When you’re actively supporting your team, you listen to their needs. This may be communicating a new timeline, a change in methodology, securing resources they know they need or offering guidance where they do not know they need it.

Being passive means letting them fail without caring for their needs. Active support without the necessary trust usually ends up in micromanagement or replacing people without aligning with the team first.

This creates an atmosphere of distrust and detaches the team from the solution. Nobody will be willing to take ownership, and eventually, nobody will care for the project anymore.

Summary

Looking back, the most interesting thing was to understand how single mistakes or misjudgments interact. Most things never become problems, if they happen alone. Only in combination do they create problems that ultimately result in failure.

This is why it is so hard to anticipate their effects and not repeat the same mistakes over and over again.

The culture of how the company treats mistakes has a massive influence on how a team can recover from making them. There are companies where the smallest mistakes that inevitably happen result in blame games and the replacement of people. Other cultures embrace them and understand that they are the necessary components of learning in an uncertain and volatile environment. These cultures bear more resilient and more trusting teams, that eventually get more done, are faster, and have less fluctuation.