The quality of AskTrip’s answers is fundamental to earning users’ trust, and we’ve recently shared the key areas we’re focusing on to make improvements [see: 1,400 Qs = lots of learning]. But as we’ve passed the 1,500-question mark, something important has become clear: scaling up is revealing issues that weren’t visible in our earlier testing.

We manually review every Q&A and flag any that we feel don’t meet our standards. So far, we’ve identified 13 clear failures – less than 1% of the total. We suspect there are a similar number that aren’t outright bad, but also not good enough. So let’s say 26 out of 1,500 (about 1.7%) are sub-optimal. While that’s a small number, we’re determined to drive it down further.

As noted in our previous post, we’re analysing these issues closely and have already identified concrete steps that should lead to significant improvements. But this phase has also highlighted a broader insight: these kinds of flaws only emerge at scale.

Just as randomized controlled trials often lack the power to detect rare side effects, early pilots of AI systems – like our initial 250-question evaluation – can miss edge-case failures. It’s only through broader, real-world use that such issues surface. And that’s invaluable. These findings help us better understand the limits of our system and guide the next wave of improvements.

We increasingly see AskTrip as a journey. The launch went well, and now we’re building on that strong foundation with meaningful refinements. Will it ever be perfect? Probably not. But our commitment to continual improvement is unwavering.

It’s been an incredibly rewarding learning process so far—here’s to the next 1,500 questions.