Vibe coding is a new term for creating software using AI. The term was introduced by Andrej Karpathy, a founding member of OpenAI and former Director of AI at Tesla. It describes a workflow where developers express tasks in natural language, and AI tools generate code, tests, and supporting components.
What started as a wave of interest around AI-assisted coding quickly turned into daily practice. Few expected the term itself to gain this level of traction. Yet in 2025, “vibe coding” was named word of the year by Collins Dictionary.
Adoption continues to grow. More teams include AI tools in their daily workflow. However, opinions still differ.
Some developers treat AI as a practical assistant. Others question the reliability of generated output. People choose different sides. At the same time, AI adoption in software development is not a choice teams can delay for long. The market has already moved forward.
Usage is widespread and measurable: more than 90% of development teams use AI tools in their workflow, which resulted in ~6 hours saved per developer each week by reducing repetitive tasks, according to McKinsey.
At Aristek, curiosity about emerging technology has always been a working principle. Since capable AI coding tools became available, our engineers have been testing them across different project types, team configurations, and technical environments, accumulating practical knowledge of where the tools perform reliably and where they require careful management.
AI is already embedded in how we work at Aristek. We have a defined policy for its use across the software development lifecycle, and our engineers apply it regularly on real projects.
Within that context, one engineer decided to take the experiment further: to test whether AI, given the right instructions and a human guiding the process, could produce a fully functioning application from start to finish. The goal was to find out exactly what that looks like in practice, and where the limits are.
Eugene, a software engineer at Aristek, spent a weekend building a real application using GitHub Copilot as the primary agent. The process wasn’t smooth, but still, the result worked. The final version included a working frontend, backend services, automated tests, and a generated design. And 99% of the code was produced by AI.
How much time does it actually save? What works as expected, and what breaks under pressure?
This article documents the full experiment in detail. Each stage also includes expert comments from Aristek engineers and specialists who work with these tools on live projects, so the conclusions reflect more than one session on a weekend.
We didn’t plan this. The industry did
AI did not enter software development quietly. It showed up, gained attention fast, and became part of daily work before most teams had time to form a clear position. We have seen this pattern before. jQuery, frameworks, and Agile methods followed a similar path. At first, they felt optional. Then they became expected.
Today, the same shift is happening with AI. Most teams already use AI tools in some part of the development process. The difference is usually not whether AI is used, but how deeply it is included in daily work. Some engineers rely on it heavily. Others use it selectively and review the output more cautiously. At the same time, the pace of delivery continues to increase.
Recent data reflects this shift:
- Up to 55% faster task completion when developers use AI coding tools (GitHub)
- 39% increase in time spent in focused work when using AI-assisted tools (Microsoft)
- 3–5× productivity gains reported in some cases, depending on the task and setup (Docker)
These results do not apply evenly to every task. But they do show a clear direction: teams that include AI in their workflow reduce time spent on routine steps and move faster through implementation.
This changes how developers approach their work. AI does not replace engineers. It changes the way they write, test, and structure code. Those who adapt to this workflow gain an advantage in speed and output.
This context led to the experiment described in this article. The goal was simple. Take a real task and work through it with AI as part of the process. Observe how it behaves across different stages. Note where it helps, where it slows things down, and how much guidance it requires.
To make the process clear, we structured it around five stages of the development lifecycle: business analysis, design, development, testing, and deployment. Each stage reflects a typical step in real project work and shows how AI fits into it.
Business analysis stage: where the first decision is already delegated
Before a single file existed, before a framework was chosen, before any technical decision was made, the project concept itself came from a conversation with ChatGPT.
Eugene opened ChatGPT and asked a simple question: suggest a few ideas for a demo project. The model returned three options. He selected one and asked a follow-up: describe how this product should work.
The response outlined user flows, basic functionality, and system behavior. Eugene saved it as a draft file. That document became the starting point for everything that followed.
From a rough idea to a structured brief (without a single meeting)
There was no detailed specification at this stage. No predefined architecture. Just a direction, and AI helped turn that direction into a structured description.
The model asked clarifying questions along the way: which backend framework, which frontend approach, what scale to plan for. Eugene answered, the AI incorporated the answers, and the result was a coherent project brief – not ideal, but with enough structure to begin implementation. The whole process took one conversation.
This matters because writing a proper brief is usually slow work. It requires aligning stakeholders, translating business intent into technical language, and resolving ambiguity before development begins. AI does not replace that process entirely, but it shortens the distance between “we have an idea” and “we have something to build from.”
Speed changes the scope of early analysis
This approach affects how early-stage analysis is done.
Tasks that usually take days can be reduced to a shorter cycle. Competitor research, for example, can be outlined in one session. AI can summarize existing products, highlight common patterns, and suggest positioning. This does not replace direct research, but it provides a starting point.
The same applies to technical decisions. Given enough context, models can suggest architecture options and technology stacks. These suggestions are based on patterns present in training data, not on project-specific knowledge. The quality of output depends on the clarity of input.

Where AI needs direction
AI can structure information, but it does not define priorities on its own.
If the input is vague, the output will be generic. If key constraints are missing, the model fills the gaps with assumptions. In practice, this leads to plausible but incorrect decisions.
Eugene’s role at this stage was to guide the process. He did not write the document from scratch, but he controlled the direction through answers and clarifications.
Verdict
AI works well in early analysis when the goal is to move from idea to structure.
It reduces the time needed to produce a working draft. It helps identify gaps and prompts the right questions. At the same time, it depends on human input to stay relevant.
In this experiment, the business analysis phase did not require formal documentation or extended planning activities that usually accompany production projects. A short interaction was enough to define the project at a level suitable for the next stage.
In real delivery environments, this phase includes deeper requirements analysis, stakeholder alignment, prioritization, risk review, and technical discussions. AI works well here as a support tool. It speeds up the creation of initial requirements, drafts user flows, and structures information quickly, but human review remains necessary before implementation starts.

Design stage: 3 hours to set up, 10 minutes to build
“The gap between what an agent can do and what it does is usually a gap in instructions, not a gap in capability.”
With a project brief in hand and a clear implementation plan generated from it, the next step was design. This is the stage where most engineers would hand work to a designer and wait. Eugene decided to see how far an AI agent could take it.
The short answer: further than expected, but not without significant setup.
“It knew exactly what to do. It just couldn’t see the result.”
The agent had access to Figma through an MCP integration and could create components, set colors, define spacing, and build layouts. Technically, it understood design. The problem was that it had no way to verify its own output visually.
It placed a button, wrote text inside it, and had no mechanism to check whether the text actually fit. It set padding values that looked correct in the API response and were wrong on screen. The first iterations had overlapping elements, misaligned components, and text spilling outside its containers.
The fix was straightforward once identified. Eugene added an explicit instruction to the agent: after placing any element, query its dimensions through the Figma API and confirm they match the intended values. The agent already had access to this information. It simply had not been told to use it. Once that validation step was part of the instructions, the output quality improved substantially.
This is a recurring pattern in AI-assisted work. The model does not automatically apply every capability it has. It applies what it is told to apply. The gap between what an agent can do and what it does is usually a gap in instructions, not a gap in capability.

The tooling situation is honest about where it stands
Setting up the design pipeline required navigating a fragmented tool ecosystem. Figma’s official MCP offered approximately five read-only actions at the time of the experiment. It was sufficient for reading an existing design and not useful for creating one.
Eugene switched to an open-source alternative with 93 available actions, which worked but crashed unpredictably and required a specific startup sequence each time.
This is the current state of MCP integrations across most design tools:
- Official integrations from major vendors are conservative, stable, and limited in scope
- Open-source alternatives offer broader functionality but vary significantly in reliability
- Crashes and connection failures are common and require manual recovery steps
- Tool behavior can differ depending on which agent system calls them
- Setup time is real and should be budgeted separately from actual design time
None of this makes the approach unworkable. It does mean that the first use takes longer than subsequent ones, and that someone on the team needs to understand the integration well enough to debug it when it fails.
What came out at the end
The setup took two to three hours. Once the pipeline was stable, the agent produced a complete design system in approximately ten minutes: components, design tokens, color palette, spacing rules, and page layouts.
The result was not polished to a professional standard, but it was coherent, consistent, and ready to hand to a frontend implementation agent.
A senior designer working on the same scope would typically spend several days, sometimes a full week, depending on the complexity of the component library. The agent produced a working baseline in an afternoon, most of which Eugene spent configuring the tools rather than directing the design work itself.
Beyond this specific project, AI at the design stage can produce useful output in several areas:
- Generating initial design systems with tokens, typography scales, and color palettes from a brief description
- Producing multiple layout variations for the same component quickly
- Translating design specifications into structured data for frontend implementation
- Checking dimension consistency and spacing rules across components
- Documenting design decisions in a format that can be referenced by development agents
One decision worth making early
Agent instructions for design tasks can be saved and reused. The Figma connection sequence, the validation steps, the dimension-checking rules, and the overlap prevention logic that Eugene built during this stage now exist as a reusable configuration.
The next project that needs a design agent does not start from zero. It starts from a working baseline and adjusts for specific requirements.
The cost of building a reliable design agent is paid once. After that, it is shared.

Verdict
The design stage required the most setup effort. Around two to three hours were spent configuring and stabilizing the workflow.
Once configured, the agent produced usable design output quickly. The final result was visually consistent and ready for implementation. The quality was closer to the work of a mid-level product designer preparing an internal MVP or early prototype than to a polished senior-level design prepared for a mature commercial product.
The layouts, spacing, and component structure were coherent enough for development work to continue without major redesign. At the same time, the result still lacked the detail, refinement, and product thinking that experienced designers usually add during later review stages.
The key takeaway is practical. AI can generate design structure at speed. Quality depends on validation rules and tool stability. Without both, errors accumulate even when the process appears correct.
Development stage: Can AI handle a full stack on its own?
“I wasn’t watching it work most of the time. I’d write the instruction, press enter, and go do something else. The code was there when I came back.“
After the design stage, the process moved into development. This was the most illustrative part of the experiment – not because the output was perfect, but because of how much was produced with limited direct input.
Starting from the draft document and staged instructions, GitHub Copilot generated a full-stack application.
The final codebase included a Next.js frontend, a Node/Express API gateway, four backend microservices structured as a monorepo, shared component libraries, common TypeScript interfaces, and shared contracts between services.
The frontend read design tokens directly from Figma through the MCP connection and implemented the
UI without Eugene specifying a single color value or padding rule manually.
Backend services were built in six and nine minutes respectively, running in parallel through Copilot’s FleetMod execution mode. Total agent-running time across all backend services: under two hours, and our developer was present for a fraction of that time.
Instructions define the system
The structure of the project depended on how instructions were organized.
Each service had its own instruction file with local context. A root instruction described the overall system. This separation reduced conflicts that appear when one agent tries to manage the entire codebase in a single context.
Without this separation, inconsistencies appeared early:
- Different package managers used across services
- Conflicting port configurations
- Variations in file structure and conventions
Adjusting the instruction structure resolved most of these issues at the source.

Parallel work instead of sequence
Development did not follow a typical step-by-step process.
Using Copilot’s parallel execution mode, multiple parts of the system were generated at the same time. Backend services were created in minutes. The frontend consumed design data directly from Figma and implemented the interface without manual specification of styles.
This changed the pace of development:
- Backend services generated in 6–9 minutes each
- Frontend implemented using design tokens and references
- Shared contracts created alongside services, not after
What would normally take sequential effort was handled as parallel tasks.
Mistakes are part of the process, not a sign it is failing
The agent produced working code, but not clean code on the first pass.
Typical issues included:
- Mismatched dependencies
- Incorrect service configuration
- Deviations from defined architecture
- Incomplete or inconsistent file structures
The response was not to fix each issue manually. Instead, Eugene updated the instructions and restarted the stage. This approach produced consistent results across iterations.
Traditional code review was not practical at this scale. Instead of reading 40,000 generated files line by line, the focus moved to running the application, verifying expected behavior, identifying visible issues, and requesting targeted fixes from the agent. This shifts part of quality control from manual inspection to execution and validation.
Where AI is effective in development
AI performs well when tasks are clearly defined and repeatable. Common strengths include:
- Generating standard application structure
- Creating boilerplate and service scaffolding
- Implementing APIs based on defined contracts
- Connecting frontend and backend through shared types
- Following consistent patterns across multiple services
Where control is still required
AI does not maintain consistency without guidance. Human input is required for:
- Defining architecture and boundaries
- Enforcing standards across services
- Resolving conflicts between generated components
- Deciding when to restart versus patch existing output
Without this control, small inconsistencies accumulate into larger issues.
Verdict
The development stage demonstrated how quickly AI can generate large amounts of application code when the project structure, architecture, and instructions are already defined. The full backend was generated in under two hours of runtime, but the process still depended on continuous human oversight.
Eugene reviewed outputs, adjusted instructions, restarted stages when inconsistencies appeared, and verified whether the generated system behaved as expected.
The result was a working system with multiple services and a connected frontend. At the same time, the experiment also showed that generation speed alone is not enough. AI handled implementation tasks efficiently, but system consistency depended on architecture decisions, instruction quality, validation mechanisms, and human review throughout the process.
QA & testing stage: 42 tests written. None of them by a human.
“I asked for positive flow, negative flow, and edge cases. It wrote 42 tests. And somewhere in there, it decided to try XSS injection on its own. I didn’t ask for that.”
After development, the process moved into testing. Eugene gave the agent a single instruction: write end-to-end tests covering positive flow, negative flow, and edge cases. The agent produced 42 Playwright tests running across five parallel workers.
It launched a real browser, navigated the live application, and identified UI locators dynamically by inspecting the running interface rather than reading the source code. The tests ran and passed.
The biggest surprise was edge cases. The agent included XSS injection attempts and tests with extremely long input strings without being asked to. Given only the phrase “edge cases,” it inferred that security-relevant inputs belonged in scope.
Whether the specific tests it wrote represent thorough security coverage is a separate question, but the behavior itself reflects something useful: when given a broad instruction and enough context about the application, the agent applies general knowledge about what good testing looks like rather than taking the minimum interpretation of the request.

Hooks: the part of QA that runs before a mistake can travel far
Beyond test creation, the experiment introduced a structured validation mechanism through hooks.
Two hooks were configured:
- PostGenerationLint
- ValidateDTOSync
These hooks executed automatically during code generation.
PostGenerationLint checked each file immediately after creation. It returned structured feedback with error details and required fixes. The agent processed this feedback before moving to the next step.
ValidateDTOSync enforced consistency across services. If a data contract changed in one service, the hook detected mismatches in others and blocked further progress until alignment was restored.
The distinction between hooks and instructions is worth stating clearly. Instructions tell the agent what to do. Hooks enforce what the agent system will allow. The agent does not decide whether to comply with a hook.
The hook fires, the system parses the response, and execution either continues or stops. This makes hooks the most reliable form of quality control in an agentic workflow, because they operate independently of the model’s judgment.
Where AI handles testing well
Beyond what appeared in this specific experiment, AI contributes usefully to testing in several areas:
- Writing end-to-end tests from a plain-language description of expected behavior
- Generating test data sets that cover boundary conditions, empty states, and malformed inputs
- Producing test scaffolding and boilerplate that engineers then refine
- Running regression checks against an existing test suite after changes
- Documenting what each test covers in a format that makes review faster
Where a human still needs to be in the room
Eugene was direct about the limits of what he verified. The 42 tests ran and passed. He did not audit whether they tested the right things at the right depth. This is the central risk of AI-generated test suites: coverage percentage is a metric the agent can optimize for, but coverage percentage does not measure whether the tests reflect the actual requirements of the system.
Specific areas that require human judgment in AI-assisted testing:
- Evaluating whether test cases reflect real business logic rather than surface behavior
- Identifying gaps in coverage that a metric would not reveal
- Reviewing tests for redundancy, specifically multiple tests asserting the same condition in slightly different ways
- Assessing whether security-relevant tests cover the actual threat surface or only obvious cases
- Making the final call on what constitutes an acceptable failure threshold before release
Telling an agent to achieve 80% test coverage produces 80% coverage. Whether that 80% covers the parts of the code that matter most is a question only someone who understands the product can answer.
Verdict
The testing stage showed that AI can generate and run tests with minimal input. The system produced a working test suite and executed it successfully.
At the same time, trust in the results requires verification. AI can create tests quickly, but it does not determine their relevance without guidance.
Yes, like in the previous stages, the most reliable outcome comes from combining automated generation with targeted human review.
Deployment stage: The stage that almost ran itself
“I didn’t write a single line of Docker configuration. It generated everything. One file didn’t work, I fixed it in a few minutes, and we moved on. That’s roughly where things stand right now.” Eugene
Deployment was the shortest stage in the experiment, and in some ways the most straightforward.
At the final stage, deployment followed the same pattern as earlier steps. The agent used the existing instructions to generate infrastructure configuration without manual setup.
It created Dockerfiles for each service and two Docker Compose configurations: one for development and one for production. The setup reflected the structure defined during development. No separate infrastructure design was introduced at this point.
From instructions to a running environment
The deployment setup was derived directly from the project context.
The agent:
- Generated Dockerfiles for all services
- Created a production Docker Compose configuration
- Created a development Docker Compose configuration
- Connected services based on previously defined ports and dependencies
The only manual input was related to port selection. Eugene specified non-default ports to avoid conflicts on his local machine. The agent applied these values without additional adjustments.
Where configuration needs correction
One issue required direct intervention. The development Docker Compose setup did not run as generated.
Eugene reviewed the configuration and fixed it manually. The correction took a few minutes. The rest of the setup worked as expected.
This reflects a common pattern. AI-generated infrastructure often reaches a near-complete state. Final adjustments are still required for edge cases.
Working with legacy systems
Most modern agent systems include an initialization stage. The agent scans the repository, maps dependencies, reviews available documentation, and generates a working instruction file describing the codebase structure. This reduces the time needed to navigate large projects and helps engineers start working faster.
However, this should not be interpreted as automatic understanding of a production system. Generating a feature inside a mature application still requires architectural awareness, validation, and engineering control. Existing systems contain business rules, undocumented dependencies, historical decisions, infrastructure constraints, and edge cases that are often distributed across teams rather than stored in a single source of truth.
Documentation is only one part of the problem. System consistency also depends on code quality, naming conventions, service boundaries, test coverage, release processes, and how predictable the existing architecture is. AI performs best in environments where these elements are already structured and maintained.
A monolith with clear patterns and stable documentation is easier for an agent to analyze. A fragmented microservices environment with inconsistent standards and missing ownership introduces more uncertainty, because the agent can only infer relationships from the information available in the repository.
In practice, AI shortens the onboarding and discovery phase. It does not replace the engineering work required to safely extend or modify a production system.
Where AI is effective in deployment tasks
AI performs well when infrastructure follows standard patterns.
It can:
- Generate container configurations for services
- Define service relationships in Docker Compose
- Reuse configuration patterns across environments
- Align infrastructure with application structure
Where human input is still required
Deployment still depends on validation and environment awareness.
Human input is required for:
- Resolving configuration errors
- Adjusting environment-specific parameters
- Verifying service communication and dependencies
- Extending setup to production-grade infrastructure
CI/CD pipelines, monitoring, and alerting were not part of this experiment. These areas require additional configuration and validation.
Verdict
Deployment was the stage where AI required the least intervention and produced the most complete output relative to what was asked. The infrastructure configuration was generated from instructions, worked almost entirely on the first attempt, and needed one manual correction on a single file.
For new projects, this stage is where the time savings are clearest and the risks are most manageable.
What this experiment shows about AI-assisted development: results, time, and honest limits
Across five stages, the pattern was consistent. AI accelerates the execution of well-defined tasks significantly. But it does not replace the engineering judgment that makes those tasks well-defined in the first place.
- In business analysis, AI compressed the distance from idea to structured brief from days to a single conversation.
- In design, it produced a coherent component system in minutes once the tooling was configured correctly.
- In development, it generated backend services, a frontend, shared contracts, and infrastructure in under two hours of agent runtime.
- In testing, it wrote 42 Playwright tests and inferred that XSS injection belonged in scope without being asked.
- In deployment, it produced a near-complete Docker configuration on the first pass.
At every stage, the quality of output depended on the quality of the instructions behind it. This is the central finding. AI does not drift because it lacks capability. It drifts because it lacks structure.
The engineer’s job shifted from writing code to defining the system that generates it: instruction architecture, validation rules, context management, and knowing when to restart rather than patch. These are engineering decisions, and they require experience.
The other consistent finding is that it gains compound. The first project in this workflow required the most setup. The instruction files, validation hooks, and design pipeline configurations built during this experiment are reusable. The next project starts from a working baseline. Setup cost is paid once.
And at what cost? Time.
The full process, from the first prompt in ChatGPT to a working application, took around 10 hours. Not bad for a weekend, but that needs a bit of context.
Those were not 10 focused hours at a desk. The engineer worked across a weekend, in the evenings, writing an instruction and leaving the agent to run while he went and did other things. He cooked dinner. He drove. He came back, checked the output, wrote the next instruction, and left again.
About half of the total effort went into setup. This included configuring the agent system, stabilizing the design pipeline, and restarting the project several times.
Once the setup was stable, individual steps were short:
- Backend services generated in 6–9 minutes
- Full test run completed in about 17 minutes
- Design system produced in roughly 10 minutes per run
- Infrastructure configuration generated within a single stage
What was built, and how difficult is this type of application?
The application built during the experiment was not a production-scale platform. It was a relatively compact service-based system created to test how far AI-assisted workflows could go across the full development lifecycle.
The final version included:
- A Next.js frontend
- Multiple backend services
- Shared contracts between services
- Automated testing
- Containerized deployment
- AI-generated design assets
For an experienced engineering team, this is a manageable scope. At the same time, it still represents enough moving parts to expose coordination problems, architectural inconsistencies, testing gaps, and workflow limitations.
That is what made the experiment useful.
Without AI assistance, a similar prototype would typically require more implementation time across design, backend development, frontend integration, testing, and infrastructure setup. In this case, AI reduced the amount of manual implementation work substantially, especially during scaffolding and repetitive development tasks.
However, the reduction came mostly from accelerating execution, not from removing engineering complexity.
The project still required:
- Architectural decisions
- Instruction management
- Output validation
- Workflow corrections
- Review of generated behavior
- Repeated refinement across stages
This distinction is important. AI reduced the time spent writing predictable code manually. It did not eliminate the need for experienced engineering oversight.
For teams already working with structured development processes, this is where the largest practical gains appear today.
How this compares to standard delivery
For context: building a backend with this number of services from scratch typically takes an experienced engineer around 20 hours for the data layer alone. Adding frontend, design, infrastructure, and testing brings total effort to 50–60 hours under normal delivery conditions.
The same scope was completed here in 10 hours, roughly half of which was environment setup. The gap closes on more complex systems, where domain knowledge and architectural judgment cannot be delegated to an agent. But for well-scoped greenfield work, the time difference is real.
The application ran. The services communicated. The tests passed. Eugene identified visible issues and fixed them; anything not surfaced during that review remained in the codebase. That is the trade-off stated plainly: AI-assisted development at this pace produces working software, not reviewed software. Closing that gap is a process question, not a capability question.
How to integrate AI into real development workflows
Every stage of Eugene’s experiment ended with the same observation: AI performs well when it has clear instructions, a feedback loop, and a human who knows what a correct result looks like.
Remove any of those three, and the output degrades in predictable ways. The design agent drew crooked layouts until it was told to validate dimensions. The development agent mixed package managers until the instructions enforced consistency. The test suite looked complete until someone noted that coverage numbers and coverage quality are different things.
This pattern points to something worth stating directly. AI does not fail because it lacks capability. It drifts because it lacks structure. And structure, in a real team environment, comes from a framework that defines how AI is used, how its outputs are checked, and how the whole thing fits into existing workflows.
At Aristek, integrating AI across the development lifecycle is structured around six operational layers. Each layer addresses a specific question about how AI fits into real team work, from deciding where to apply it, to tracking whether it is delivering value over time.
The framework operates across six layers, each addressing a specific question about how AI fits into real work.
1. Use layer — apply AI where it adds value
AI is most effective when applied to specific, well-defined tasks across roles.
Typical use cases include:
- Business analysis: drafting requirements, structuring scenarios
- Design: prototyping, exploring UI options
- Development: generating code, refactoring existing logic
- QA: creating test cases, identifying edge conditions
- Operations: analyzing logs, identifying anomalies
This layer defines scope. It answers a simple question: where does AI save time without reducing clarity?
A common risk appears here. AI can generate more output than a team can realistically review.
2. Control layer — validate outputs early
Speed without validation creates inconsistency.
Control mechanisms introduce checks at the moment output is produced:
- Developer review during generation
- Defined code review rules
- Automated test validation
- Output constraints and guardrails
These controls reduce the chance of incorrect logic moving forward.
Without them, common issues include:
- Inconsistent implementations
- Incorrect assumptions in generated code
- Tests that pass but do not validate real behavior
3. Integration layer — make AI part of daily work
AI creates impact only when it is part of existing workflows.
This includes:
- Integration into IDEs and development tools
- Use within CI/CD pipelines
- Inclusion in pull request workflows
- Connection to documentation and knowledge bases
This shifts AI from individual use to team-level practice.
Without integration, usage remains fragmented. Results vary between developers, and outputs do not align.
4. Context layer — provide the system with real data
AI depends on context. The quality of output reflects the quality of input.
Relevant context includes:
- Access to the codebase
- Existing design systems
- Project documentation
- Data models and domain rules
When this context is available, outputs align with the system.
Without it, results become generic. Rework increases because decisions do not match the actual project.
5. Observability layer — track what is happening
AI usage needs to be visible.
This includes:
- Tracking how often AI is used
- Measuring output quality
- Monitoring performance impact
- Tracking associated costs
Visibility helps teams understand where AI adds value and where it introduces inefficiencies.
Without it, adoption is difficult to manage. Costs and results remain unclear.
6. Evolution layer — improve the process over time
AI workflows do not stay static.
They require continuous adjustment:
- Refining prompts and instructions
- Updating workflows based on results
- Optimizing cost and execution time
- Adapting to new tools and models
Without this step, initial gains decrease over time.
What this experiment actually proves
The experiment produced a working application, an honest account of where AI helped and where it didn’t, and one conclusion that holds across every stage: AI in software development is not a future consideration, it is a present one.
The teams that treat it as such are already moving faster. The teams waiting for more certainty are falling further behind that gap with every sprint.
The point was never that AI replaces developers. The experiment demonstrates the opposite clearly. Every stage where the output was good, a developer had defined the structure, written the instructions, and reviewed the result. Every stage where it drifted, that oversight was missing. The tool is capable, but the judgment belongs to the engineer.
At Aristek, this experiment is not an isolated weekend project. It reflects how our engineers are approaching development today, on real projects, across different environments and constraints.
We have worked through the setup costs, the tooling gaps, the context management decisions, and the points where human review is non-negotiable. That experience is what we bring when we help teams integrate AI into their development process.
We work with engineering teams to define where AI fits into their specific workflow, how outputs are validated before they move forward, and how to maintain full visibility over what is being generated, reviewed, and shipped. The goal is a development process where AI handles execution and engineers remain responsible for every decision that matters.
If your team is already working with AI but results are inconsistent
… or if you are planning to introduce it into active development, book a free consultation to discuss this.
If you have questions about the experiment and its results, you can also reach out to our team.




