Continuous Quality for AI Coding: Checksum | SourceForge Podcast, episode #116

By Community Team May 20th, 2026

Checksum brings continuous quality to the AI coding era by automatically generating, running, and maintaining tests against real APIs, real data, and real-world edge cases before code ever reaches production. Powered by its Code World Model and intelligent testing agents, Checksum helps engineering teams ship 10x faster with scalable, always-on testing that learns from every commit and production bug.

In this episode, we speak with Gal Vered, Co-Founder and CEO of Checksum. We discuss the evolving role of AI in software development, particularly focusing on AI-generated code and its verification. We highlight the rapid advancements in AI’s ability to write code and the emerging challenges in ensuring that AI-generated code is reliable in real-world environments. Gal explains how their company addresses these challenges by providing a continuous quality layer that uses AI to generate and maintain end-to-end tests. This helps bridge the gap between code that appears correct and code that functions correctly in practice.

The conversation delves into the complexities of software engineering, emphasizing the importance of verification as AI tools become more prevalent. Gal discusses the concept of “continuous quality” and how Checksum autonomously generates tests to ensure code reliability. We also explore the balance between automation and human judgment, noting that while AI can handle repetitive tasks, critical thinking and decision-making remain essential human roles. The discussion touches on the future of AI in software engineering, predicting that verification will become a significant part of the process. Gal envisions a future where AI-powered testing tools are integrated into coding platforms, enhancing the overall development workflow.

Watch the podcast here:

Listen to audio only here:

Learn more about Checksum.

Interested in appearing on the SourceForge Podcast? Contact us here.

Show Notes

Takeaways

AI-generated code increases the need for reliable verification systems.
Faster code generation creates new bottlenecks around trust and testing.
Engineers now spend more time reviewing and validating AI-written code.
End-to-end testing is critical for shipping AI-assisted code confidently.
Manual testing approaches struggle to keep pace with rapid AI development.
Maintaining large test suites becomes difficult as applications evolve quickly.
Continuous quality automation helps reduce testing maintenance burdens.
AI-generated tests must still allow humans to control quality decisions.
Engineering roles are shifting from writing code to guiding AI systems.
Human judgment remains essential for prioritization and quality standards.
AI coding tools risk producing average outcomes without strong human direction.
Production complexity matters more than surface-level code generation.
Real-world production data reveals bugs traditional testing often misses.
Verification may become half of the autonomous software engineering stack.

Chapters

00:01 – Intro: the growing trust problem in AI-generated code
02:10 – What Checksum does and why verification matters
04:09 – Explaining AI-powered testing to engineering teams
05:56 – Why DIY testing approaches eventually break down
08:25 – Continuous quality and autonomous testing workflows
11:39 – Human judgment versus automated testing systems
14:05 – How AI changes the role of software engineers
16:20 – Critical thinking concerns in AI-assisted development
20:26 – Real-world bugs and production verification challenges
22:11 – AI-generated code and the future of secure software
23:05 – Why verification could become half of engineering
24:16 – Potential partnerships with AI coding platforms
26:01 – Understanding the “context void” in software systems
30:24 – The idea behind the “code world model”
32:33 – Why simulation is essential for autonomous engineering

Transcript

Beau Hamilton (00:01.57)
Hello, everyone, and welcome to the SourceForge Podcast. I’m your host, Beau Hamilton. Now, over the last couple of years, the conversation around AI and software has moved extremely fast. At first, the headline was that AI could help developers write code faster. And then it became AI could write entire chunks of code on its own. And now we’re seeing AI tools write the majority, in some cases, of the new code inside organizations. And the conversation is starting to shift yet again, this time from what AI code can be generating to what AI code can be trusted. Because writing code is one thing, but knowing whether that code actually works in the real world is something else entirely.

And that’s where things get a little bit complicated. Software doesn’t live in an isolated environment. It doesn’t live in a vacuum. It runs inside browsers. It runs inside APIs, user interfaces, authentication flows. There’s all sorts of things that have to come into play, and there’s all sorts of unpredictable real-world conditions. And a piece of AI-generated code can look perfectly fine in isolation. It can pass a bunch of checklists. It still fail, though, once it runs inside the actual environment it was written for. So the bottlenecks really start to appear in all sorts of different ways, shapes, and sizes, right? And I think that’s a pretty useful setup for today’s conversation with Gal Vered.

Gal Vered is co-founder and CEO of Checksum. And Checksum’s broader thesis is that software engineering is heading toward a world where generation is increasingly automated, but verification is still the hard part. And the company describes itself as building a continuous quality layer for that next phase, using AI to generate and maintain end-to-end tests that help close the gap between code that looks right and code that actually proves itself in practice. So with all of that said, I want to bring Gal in to talk more about this world of AI generated code. We’ve got a lot to cover. So I’m excited to have him here. Gal, welcome to the podcast. Glad you could join us.

Gal Vered (02:04.526)
Yeah, and thank you for hosting and I know everyone who’s listening and I’m excited.

Beau Hamilton (02:10.006)
Awesome, yeah, well, glad to have you here. And I want to get right into it. So I know Checksum is really concerned with what happens after AI code gets written and whether teams can actually trust it in production. For listeners hearing about Checksum for the first time, can you just help us unpack that further and describe what it is you guys offer?

Gal Vered (02:27.246)
Yeah, sure. So I think in a sentence, we help you go from pump to production. What it means in reality is that you said it exactly right at the beginning. AI helps you write code significantly faster. If I’m going to take cloud code or cursor, point it to the entire checks on backlog. At the end of the day, I will have 100 PRs ready. But do they work? Like, what do I do with them? How do I ship them? Do I understand what each PR is actually doing? That is the read bottleneck.

And the reality is, and we see it with our engineers, with our customers, engineers, 90% of the time just goes over reviewing PRs, finding the edge cases in production, firefighting. Engineers used to ship one or two features every month. Now they’re shipping 10 features every week. That’s a lot of balls in the air and you just need to constantly context switch. So I I jokingly say that AI made engineers 10x more efficient, but also 10x more busy. And so with that, saying, like our goal of checks, I mean, is to help you fully test every PR, every change you make, fully test your app through millions of test cases, hundreds of test cases, depending on the type of tests based on real sessions. So you can actually ship code in confidence. So if you get 100 PRs from Call of Codes, you also know those 100 PRs are working.

You can shift them confidence to production and you can just move faster. we help you realize the gains of AI. If AI makes code 10x faster to write, we want to help you ship 10x faster. So move all of those efficiencies into actually ship.

Beau Hamilton (04:09.816)
Got it. OK, that’s a really good overview. And I think really the core idea is just instilling this higher sense of confidence in the code that you’re generating. Again, producing code more quickly is one thing, but it’s about knowing whether you actually ship something that is working effectively. That’s really the real value. So on that note, if I’m coming at this from the engineering side or the QA side, and I’m hearing about Checksum for the first time, how would you alter that description to me to help explain it to me?

Gal Vered (04:45.259)
Yeah, look, at end of the day, engineers already know that they ship features significantly faster. The app is keep changing and now they have a choice and they’re facing this choice. I don’t even need to explain to them. They can either slow down and making sure that app is entirely covered, write tests, write end-to-end tests, test their app manually, deploy carefully, monitor everything, or they can move fast and firefight and let quality slip.

So what Checksum does is something very, very simple. We continuously and autonomously generate a set of testing suite, end-to-end test, API test, integration test based on real usage. So those are actually testing your app the same way your users are using them in order to make sure everything is working 100% of the time. So the engineer can be the hero that not only able to write 10 or 15 features, but actually get those features to production nicely with zero bugs and ship fast. So that’s kind of like the empowerment we want to give to our customers. And we think a robust, comprehensive testing suite in every layer is the way to do that.

Beau Hamilton (05:56.118)
Now I know a lot of teams are aware of this issue, this trust issue, but what they do to bolster that trust, I think varies quite considerably from maybe user to user or organization to organization. They know that they should have this end to end coverage. And I think many of them build it themselves for the engineering teams out there that do go down this route, this DIY route. Where do those approaches start to break down and run into issues?

Gal Vered (06:26.987)
Yeah, it breaks down in two main points. The first point that it breaks down with is simply coverage. Like again, your shipping features on a daily basis, new PRs or changing existing features and how they work. The reality today is that you’re on a customer call on Monday, they give you feedback. One hour later, you already have a PR to fix this feedback, right? All you need to do is paste some description into Cloud Code, it writes it. But your app is changing so much, it’s very hard to get coverage unless you’re willing to slow down. Again, you can always slow down and say, hey, you don’t merge this PR until you update a lot of the tests and you have a full testing suite.

Now, even if you slow down and you’re able to keep up, maintenance becomes the problem. Let’s say, again, today small teams are able to ship significantly complex apps and definitely it’s true for enterprise. Let’s say you have 500 end-to-end tests in your testing suite, 1,000 end-to-end tests in your testing suite. Every week, 100, 200, of those tests are going to break because you’re shipping so fast. And when they break, essentially someone needs to stop their work and firefight. Because when they break, pipeline, either your pipeline is broken and you can’t ship, or you need to ship blind because 30% of your app, you don’t know if it’s working or not. So when they break, someone needs to firefight. And with Checksum, it’s always on. When they break, we do the triage, we reduce the noise, we reduce the false positives. So you can just move faster again. It’s all about moving fast.

Beau Hamilton (07:59.918)
It’s all about moving fast and making sure you have all your kind of ducks in row. You have everything set up. Because I know it’s almost like a trap I feel like teams might fall into where they know they need to have this end to end coverage. They invest in it. But over time, like you were saying, it turns into this sort of maintenance project. And then the value starts slipping because things start breaking and you start running into problems.

Now I think that leads me to my next question, which obviously that this problem has started to really appear with the rise of these AI generation tools and the speed of which people operate and everyone having to move faster and faster. And so there’s a lot more competition out there. The market has changed quite a bit. I’m curious what makes checks some different from other automated testing tools out there? And then where does it fit into a team’s existing workflow?

Gal Vered (08:54.37)
Yeah, I think there three main things. One, we call it continuous quality. Typical testing tools or even cloud code and cursor are passive. Someone needs to go open a web interface, open cloud code, open cursor, type something in there, the test, fix the test. Checksum is proactive. You still have full control, so you can still generate tests you want. You can delete, you can remove. But at the end of the day, Checksum continuously and autonomously wakes up in a workflow and add new tests to cover the new features. Update existing tests based on the changes you’ve made. When you your testing suite and the test fails, we automatically triage using AI the failure. Essentially, one of the big problems with test suites in general is the noise, right? It’s like every time something fails. So we automatically triage it within minutes and you always get zero-nose report.

If a test needs to be healed, we heal it automatically. You don’t need to do anything about it. And if there’s a bug, we report the bug. We create a very nice report that’s very clear so we can just move forward. So I think quality is a lot about discipline and it’s, and we keep a very high bar of discipline automatically. And then on top of it, you have full control to, you know, shape your quality and fully own your quality. So that’s the first thing. The second thing is.

We deliver test as code. So at the end of the day, a lot of the platforms are kind of black boxes. The tests are on their side. You don’t know really how it works. It’s some magical AI thing with some proprietary algorithms. We provide test as playwright. You fully own the tests. Nothing is proprietary to check some in the test themselves. And it’s also very nice because you can use coding agents with those tests to run those tests. they, you know, the playwright is we provide test as playwright. For example, for end-to-end tests or by test with API testing, et cetera, it’s in the LLM training data. So you don’t need to explain it what it is, how it works. Like it’s, just already knows the language cause it’s know, it knows the coding. And the third one is accuracy. Like we have benchmarks, 75% of the tests that are failing are fully hit with checks and within the first few minutes, we then have a deeper workflow infused by four deployed engineers on our side that can get the healing to 100%.

Gal Vered (11:14.317)
So we can deliver results and that’s the most important thing. And if we need to, we have four deployed engineers that are able to tune the models, make sure it works for very complex use cases for enterprises. So we deliver results and solve your problem versus a passive tool that you need to use every day and spend hours. And then it’s all in the platform. So you can’t really use the things you’ve built if you decide to do something else.

Beau Hamilton (11:39.864)
Yeah. Well, one thing I want to know too is, is what is automated and then what still needs a person in the loop with a lot of these automated, you know, features and tools you offer. know checks. I’m obviously you guys talk a lot about the autonomous generation and the abilities to execute on these capabilities. But, you guys also make a point to, you know, bring in the QA teams and still offer lots of relevance and, and, meaning for, for these, these individuals. How much of the testing life cycle can check some automate today? And then where does the the human element the human judgment part still still play a role?

Gal Vered (12:18.891)
That’s a perfect question. Cause I think that, that’s true generally to work with LLMs. LLMs can make you move faster and they can, you know, save you a lot of time and do all of the grant work, but the thinking and the taste and the decisions that’s, that’s a human’s job. So look, your automation engineers or engineers in general, they know their app the best. They know their customers, they know all of the edge cases. have like years of deploying this software to production, they know all of the quirks. So they’re able to decide what’s more important, what’s the best ROI, and how best, what’s the best way to test the app that gives you 100 % confidence. Checksum can still do those things autonomously and provide good results. And especially as our agent learns from your decisions, because our agent is also continuously learning, it’s able to get the decisions closer to you.

But at the end of the day, you own quality. And if there’s a certain way that you feel like a feature needs to be tested, you have full control on that. And in addition, what we’ve learned is that each engineering team has different tolerance for bugs, right? If you serve an app to developers, maybe CSS bugs are not such a big deal. And if two divs aren’t fully aligned, not a big deal, and you’re very in the MVP stage, so some bugs are acceptable, the most important thing is to ship it.

While if you’re an enterprise company that sells a product to a non-technical person, everything needs to be perfect. So everything is considered above. That level of tolerance and organization knowledge, which is crucial to understand how do you make the organization faster and how do you ship faster, is still owned by the engineers, because our model at the end of the day needs to learn from their decisions.

Beau Hamilton (14:05.966)
Okay, so in some it’s humans obviously aren’t being removed entirely, they’re just sort of being moved away from lot of the repetitive maintenance work and into areas where there’s more judgment and context and where priorities really matter, I would say. Do you feel like there’s maybe a learning curve when you pivot away to this more judgment-based working approach as opposed to the technical hands-on experience and like with coding or is it a pretty natural sort of pivot?

Gal Vered (14:39.34)
Yeah, this is a fantastic question because it’s a full topic of how do you use AI? I think it’s natural in the sense that we provide testers code and we don’t force you to change any of your workflows and Checksum is just another team member who deploys and constantly cares about testing and uploads it. you know, everyone would love to have this team member that just writes your tests automatically and always maintain them. And you know, they got it covered. And so it’s natural in this case. It’s not natural. But that’s not specifically to check some in the way that the role of software engineering and automation engineers in general change, right? So today when you write a feature, it used to be 90% of the, and it’s always easy to explain with writing code because people have experienced there. It used to be that 90% of the role of a software engineer would be to write code. Today, you know, we, I think if not 100%, probably 90% of the code is written by an LLM. An engineering job becomes more in defining what’s more crucial, defining the architecture, reviewing the code, making sure we don’t accumulate too much tech debt.

So the role is changing. And the role of people in general is changing when working with AI, because you have this huge force multiplier. So for people who understand that and already made this change and who are AI adopters, I think it’s a very natural way to use Checksum. And I think for people who still feel like, you know, they want to do the work. It’s not a natural change, but I think every month we see less of less of those people because, because again, regardless of Checksum, the world is moving there and regardless of Checksum, people are making this transition. So it’s not unique Checksum is what I’m trying to say.

Beau Hamilton (16:20.672)
No, that’s interesting to think about. I just think of to the, when you rely on a lot of these AI tools to generate the code for you, let’s say, it makes me think of the kind of critical thinking skills that are, you know, really super important. you know, speaking of headlines, like in introduction, it’s like a lot of the headlines are talking about the findings from relying a lot on AI tools and chatbots is like lot of the thinking critical thinking skills kind of goes, starting to be depleted. Teams and individuals aren’t able to, you know, maybe kind of use a lot of their own judgment because they’re outsourcing it to AI. So I’m curious, like, it just makes me think of with these, this pivot, do you worry that there could be an issue with, with the critical thinking part of the equation that’s required for this pivot for some of the QA and engineering teams that, are switching to a more judgment focused role?

Gal Vered (18:09.549)
I don’t know if I have a specific question from the end-to-end testing world. I think that in general, I would assume that as you outsource the thinking to an LLM, LLM is going to be, it’s regressing to the average, right? Regardless, because in a world where everyone uses LLMs, then everyone, LLM becomes the average. Even if it’s today very, very good, it will still be the average.

So I think when you fully lean into LLMs, you’re kind of guaranteeing average results. So in order to create great results and the definition changes, right? Great results, like two years ago, great results was X, today great results is 10X, because with LLMs you can, but if you want to achieve great results, if you want to break out from the mean, you must have very, very clear opinions, very, very clear directions. I think with testing and quality and engineering in general, the alpha comes from understanding the customer or the task and focusing on the impact. and that’s where they are. So if, and it’s a, it’s it’s a balance, I would say.

Beau Hamilton (19:15.974)
Yeah, absolutely. I think that’s a good answer. It’s a balance and it kind of puts the onus on the individual to, you know, make sure to pivot and work on their skill set and continue to, you know, like test their own critical thinking skills and lean in and kind of makes them get sort of in their head and figure out kind of what, you know, look at things more from an abstract level, I suppose. But this is I think this is all getting kind of besides the point. I want to pivot back to focus more on Checksum and some of the real world examples, because I don’t want to go too far down the philosophical tangent. We could lead a whole other episode around that that topic. But I’m curious about some of the real world examples you might have accumulated having worked with a number of clients from all around the world using these AI generated tools for coding. Can you share an example where checks have caught, let’s say, a critical bug or maybe regression that a team’s existing tests completely missed?

Gal Vered (20:26.358)
Yes. I think Checksum catches like hundreds, if not thousands of bugs every week for our customers. So that’s, you know, people ship with AI, there’s a lot of outages and there’s a lot of problems. There’s a lot of issues. Sometimes those are minor bugs. Sometimes those are major bugs. think the coolest bugs we catch because we kind of use real information is when the bug is related to data. So it’s a specific customer or a specific data query. You only have in production in real mass and no one will catch it but checks them because we based our tests on that suddenly is able to actually catch a bug that none of the engineers thought about, none of the UI managers thought about.

But I think what’s more important that I want to highlight here is that at the age of AI testing and quality in general, it’s not just about catching bugs. It’s also about knowing that there aren’t any bugs in your application and that’s extremely important because if you go back to the analogy of i.cloud code to our backlog and I have 100 PRs and again in a world where humans wrote all of those PRs my goal is just to find the bugs because I have high confidence that it works and ship it. In the world where I wrote those PRs knowing that there are zero bugs in any of those PRs allow me to ship it to production today versus needing to test every PR extremely thoroughly and probably be able to ship it only 10, 15, 20 days from now. So that’s, it’s a different way to think about quality. Cause knowing that again, knowing that this thing is shippable is the value. And it’s not always about finding bugs. It’s about very comprehensive tests and ability to communicate those tests so the customer can see them and say, yeah, I get it. I trust it. Let’s ship.

Beau Hamilton (22:11.842)
Yeah, it makes me think of Anthropic recently, you know, released preview of their their mythos agent that’s going out and kind of causing a stir in the cybersecurity world with with the ability to find all these bugs and zero days. But if you think of it in the future where this sort of agent, this this model is built into the code generation tools themselves, I mean, it’s you’re going to be able to ship much more secure code in the first place, like you were saying.

Right? Yeah. Now you’ve been pretty explicit that as AI makes writing code easier, the bottleneck shifts to verification. Looking ahead, how do you see the relationship between AI code generation and AI powered testing evolve over the next few years, let’s say, if we can think, you know, plan that far in the future. Where do you think this is all headed?

Gal Vered (23:05.73)
Yeah, I think it becomes 50% of engineering becomes verification. So if you think about an autonomous engineer, which I assume at a certain part of this is where the world is headed. You have an agent that writes code and this agent is extremely good. And for the most part, we’re already there, right? Cloud code is it writes pretty decent code even if you don’t have any improvement. think we’re already there. But knowing that the code works in actually shipping the code is a completely different job. I don’t think it’s job for LLMs. LLMs have very small context window relatively even if the context window 10x, 100x, it’s still very small when you talk about complex systems, complex production systems. And being able to fully verify the code is it probably requires a different approach or maybe an approach that infuses with the elements, which is what we’re doing in Checksum. So we see verification as 50 % of an autonomous AI engineering system that allows you to ship code to production autonomously.

Beau Hamilton (24:16.642)
Now I can see some serious value for some of the established AI coding platforms to integrate your tool set, your product directly. Is this a partnership or scenario you envision in the future? Or do you think of your company as a separate offering?

Gal Vered (24:39.98)
Yeah, no, I can see it happening. To be honest, we see such a strong demand from the market that we’re just going after, you know, the demand we’re seeing. So we don’t have any, we were partners with JCP and this was very useful for us, but we don’t have any partnerships specifically with coding agents for now. I think it’s also an evolving space. But that’s definitely is something we’re kind of looking at.

Beau Hamilton (25:02.466)
That makes sense. it’s definitely an evolving space. And yeah, just based off what you’ve shared with us, I could just see the value there. just because I also think of like Anthropic, like when I’m generating code with Claude, it has that verification layer where it kind of reviews everything, make sure the prompt and the output align, and it checks everything. And that’s something that’s exactly universal, at least at this phase, with other tools out there.

So just thinking about this out loud. So I want to circle back around to one of the more interesting ideas in your framework for all of this. You talk about something called the context void. And it’s this gap between what a coding agent can see and what actually determines whether software works in production. Can you explain what that means some more? What is this context void? And then why does closing that gap matter more than just writing better tests?

Gal Vered (26:06.646)
Yeah, that’s a fantastic question. So when you think about complex systems, think about a product like Twitter, right? I think that’s an example I like to use a lot. On the surface, it’s a very, very simple product. You can probably vibe code it in one day. It’s tweets and likes and users and comments. And that’s basically it. Then you have a clone of Twitter. Again, I’m pretty sure that if I tell cloud code or cursor to do it right now in 30 minutes, I’ll get the clone of Twitter that actually works and I can deploy. Twitter is complex. And the reason it’s complex is that it’s a product used by hundreds of millions of users every day. And it’s deployed into production and the deployment in production is extremely complex. And a ray of machines all working together to give you a response extremely quickly that is uniform across everyone. So everyone see the right and updated data.

That’s a very complex system. The code might not be super complex, but the system itself is extremely complex. And going through all of the edge cases, third party integration, services, there’s a lot of data there. And the value becomes comes from being able to infuse all of this data and all of this information in your coding and in your testing and verification. And this is not a job that LLMs are suited for. And we’re not talking about, okay, just build an MCP that connects LLMs to production and being able to query the data. I’m not talking about that because in order to verify significant changes and in order to push code to production regularly, it’s about running millions of scenarios like Checksum is doing and not just reading your app and saying it works. It’s actually running your app through millions of scenarios. And that’s the context for it.

It’s basically LLMs are very similar to a senior engineer that joined your company today. Like it has no idea about your customers, no idea what’s happening in production. You can give them access to those tools and it can start learning, but there’s so many contexts that they need to learn into their knowledge base. And the way humans work, just, know, after six months, one year, suddenly the senior engineer becomes super effective because they have all of this knowledge. LLMs, just don’t think, you know, unless, are just not well suited for it, unless you can actually ingrain this data into training, which will never happen because it’s a very specific and sensitive data. So Checksum provides the other layer, basically all of the information from production into the coding workflow.

Beau Hamilton (28:43.214)
That’s really interesting. it’s good. That’s great context. it makes me it sort of gives you some reassurance for that existential sort of thread of AI can just kind of clone websites or the threat it has for some open source software. For example, there’s so much more complexity behind the scenes of I just think of like working with APIs and databases and configurations and all these other permissions and user behavior. And then not to mention the community aspect where, I mean, Twitter is a good example where when it got bought out, it’s like there’s a community there. And when you don’t have the community or they feel alienated, they leave. And then when you have a platform with no community, just becomes kind of sad. The value is lost. So there’s a lot of play more than just the surface level of cloning that some of these AI tools can accomplish.

That kind of leads me also to this bigger idea behind all this, which is something you call the code world model, right? Your argument, as I understand it, is that autonomous software engineering won’t really happen just because coding agents get smarter. It requires this simulation of the digital environment that the software actually interacts with. So an agent can test and adapt against something much closer to reality before the code actually reaches the end user.

When did this concept of the code world model, when did it come into play? Where did you come up with it in the first place? And why do you think simulation is the missing piece for autonomous software engineering?

Gal Vered (30:24.108)
Yeah, that’s, that’s the Cold War model is essentially the solution for the context for you. So what we’re saying is, okay, there’s a context void. It’s not just about MCPs and skills. Like LLMs can just cannot just reason on such amount of data. And a good example of it is autonomous cars, right? So autonomous cars, the first thing you do when you want to develop an autonomous AI, autonomous driver, is you need a world model for cars. So you need to be able to simulate pedestrians, snow, rain, traffic cones, traffic jams, everything. in order to have Waymo cars roam the streets of San Francisco without a driver, it’s not enough to just simulate five simulations or maybe some examples. You need to be able to run every new, a model you have through millions of simulations, 20 different sensors, 30 different sensors to make sure it actually works, right? Like it’s not gonna be enough to just say, yeah, here’s a kid jumping into the road, the car stopped, ship it, right? You need to actually test it across the board. For us, it’s very clear that that’s exactly the same thing you need for coding agents.

The ability to simulate millions of scenarios in real production environment and running your app, really running your app, not just reading the code and saying if there’s a bug or not, which is what LLMs do, through those scenarios in order to give you a very strong signal of whether your app is working or not. And we also think it’s not a nice to have, it’s not going to make you 20% faster. It’s a requirement in the same way where I couldn’t imagine autonomous cars being shipped without a very robust simulations of millions of scenarios. We think the same is true for code, at least on the enterprise level. And I think as coding agents becomes better, smaller teams, deployments and software will look more like enterprise because smaller teams can just achieve more. So the complexity will increase. The complexity will increase because basically you can create higher complexity without hiring a lot of engineers.

Beau Hamilton (32:33.858)
It’s really interesting. Do you have a do you have a substack or a place where you write about this on a regular basis? Just out of curiosity.

Gal Vered (32:40.524)
Yeah, we actually we use mostly the company blog. So on check some on our blog, check some AI and we publish all of our thoughts pretty regularly about the Cold World model, the context world, etcetera.

Beau Hamilton (32:54.136)
That’s awesome. Okay, so so for listeners who love to know to hear more about some of these ideas, where would you recommend they visit? You mentioned the company blog, where’s the the what’s the URL to follow there?

Gal Vered (33:05.922)
Yeah, checksum.ai/blog or checksum.ai and check out our full website and the documentation touch on how we take those concepts and productize them, so how the product actually works.

Beau Hamilton (33:21.57)
Perfect. Okay, well that’s where I’m headed after this. And I hope listeners visit as well. We’ll have links in the description and the article associated with this episode. But Gal, thank you so much for all the insights you shared with us, all these really interesting ideas. And I think you guys are doing some impressive work and I’d love to chat with you again in the future and talk about updates.

Gal Vered (33:43.512)
Perfect. I appreciate you having me and I enjoyed the conversation and hope your listeners enjoyed it as well.

Beau Hamilton (33:49.878)
Awesome. All right. Well, thank you, Gal. And thank you all for listening to the SourceForge Podcast. I am your host, Beau Hamilton. Make sure to subscribe to stay up to date with all for upcoming B2B software related podcasts. I will talk to you in the next one.

Tags: B2B software, Checksum, End-to-End (E2E) Testing, Podcast, Software