Re-architecting a legacy data system with scalable pipelines

A company that helps colleges and universities improve enrollment through a combination of data analytics, marketing, and technology. They support institutions across the full student lifecycle, from recruitment to retention.

EdTech

AI Summary
A higher education data platform built on a legacy Azure stack became difficult to maintain and scale. It operates in a complex environment with different data formats, strict compliance requirements, and integration constraints across multiple institutions from a single codebase.

In this case, we show how the platform was reworked into a scalable data product within five months. The material covers system analysis, architecture design, data pipeline implementation, and results, including cost reduction, improved data flow, and successful onboarding of 20+ clients.

Key achievements

20-30%	infrastructure cost reduction
20+	clients onboarded
Up to 30%	development cost savings

Challenge

An existing data platform had become difficult to maintain and scale, which led the client to seek support in modernization. The system, built on Microsoft Azure, had been in place for years and no longer supported new requirements.

This situation began to impact the business directly. Product quality declined, and day-to-day operations slowed due to the outdated stack and unclear system logic. Plans to onboard new customers and expand the offering faced delays, as the platform could not support these goals reliably.

To address these challenges, external expertise was brought in to assess the current state and define a clear path forward. The focus was on stabilizing the platform and building a foundation for future growth. Close collaboration with internal stakeholders helped clarify constraints and align priorities.

Several issues affected progress from the start.

Lack of system clarity
The legacy platform lacked documentation, and internal teams could not fully understand parts of the system logic. The data structure had evolved over time, which made the system more difficult to maintain and extend.
Unclear infrastructure direction
The client needed a flexible solution without dependence on a single cloud provider. At the same time, the team had to evaluate IBM Cloud, AWS, and Databricks without a final infrastructure decision in place.
Restricted access and slow onboarding
Strict security policies introduced multiple approval layers. This slowed onboarding and limited access to data, which delayed early progress and reduced team efficiency.

Solution

The client needed a scalable data management solution to replace a legacy Azure-based system and support growing data operations across multiple institutions. The goal was to modernize the platform within its existing ecosystem, improve interoperability between components, support flexible data flows and customization, and reduce reliance on a single cloud provider.

They turned to the Aristek team for deep technical expertise to assess the situation, develop a scalable data product, and guide key decisions. From the start, the team worked closely with the client’s architects to evaluate technology options, including Databricks and multi-cloud approaches.

A dedicated team of data engineers and DevOps specialists, supported by a project manager, was assembled within one month. In the initial phase, the team reverse-engineered the legacy system to recover missing knowledge and define a clear migration path.

The resulting solution was implemented on AWS and includes the following components:

End-to-end data pipelines
Automated pipelines process data from ingestion through transformation, validation, and delivery. The design supports consistent data flow across systems and reduces manual intervention.
Flexible data ingestion layer
Supports databases, CSV and JSON files, and REST APIs with authentication and pagination. The approach allows the platform to integrate with multiple data sources without additional complexity.
Custom transformation and mapping logic
Implements business rules, data mapping, enrichment, and validation. The structure improves consistency and makes future updates easier to manage.
Automated data delivery
Processed data is exported, sent to downstream systems, or returned to the client. This reduces delays and supports reliable data distribution.
Event-based orchestration
Triggers and workflows run automatically when new data appears. This enables faster processing and reduces dependency on manual execution.
Infrastructure flexibility
The architecture was designed based on a joint evaluation of AWS, Databricks, and other options. The final setup supports future migration and avoids reliance on a single provider.
Security and access management
Credentials are managed through AWS Secrets Manager with secure authentication practices. This aligns with enterprise security requirements and controlled access policies.
Code quality processes
Development includes linters, mandatory code reviews by two engineers, and GitHub-based version control. These practices improve maintainability and reduce the risk of errors.

Project scope

The team integrated into the client’s workflows, tools, and communication channels, allowing the internal team to stay in control. A daily overlap of 2 to 4 hours with US-based stakeholders supported alignment despite the time zone difference, with project management ensuring clear coordination and communication.

The project was divided into the following key stages:

1. Discovery phase & system analysis
- Analyzed the legacy system and the existing data warehouse
- Performed partial reverse engineering to understand existing logic and data flows
- Identified gaps, inconsistencies, and areas requiring restructuring
- Documented findings to define requirements for the new system
2. Architecture design
- Designed a scalable data processing architecture based on AWS
- Defined pipeline structure, data flow logic, and integration points
- Focused on flexibility, cost efficiency, and future portability
3. MVP development (3 months)
- Built core data pipelines covering ingestion, transformation, mapping, and delivery
- Implemented support for multiple data sources, including files, databases, and APIs
- Maintained 2-4 hours of daily overlap with US-based teams
4. Testing & validation (1 month)
- Tested pipelines with available data and refined transformation logic
- Validated data accuracy, processing flows, and system behavior
- Prepared the system for production use
5. Deployment & launch (1 month)
- Deployed the solution into the client’s environment
- Completed integration with existing systems
- Ensured stable operation and readiness for handling real data workloads.
6. Ongoing support & improvement
- Provided continuous support after launch
- Monitored system performance.
- Refined pipelines and adjusted workflows

How it works

The system operates as an automated data pipeline with event-driven processing. The client provides data in formats such as CSV, Excel, or other files. An SFTP server is monitored for new uploads. When new files appear, a pipeline is triggered.

Files are received and copied from the source (e.g., SFTP server) into the system.

Data is extracted from source formats such as CSV, Excel, or other files.

Transformation and mapping logic are applied based on client-specific rules.

Data is validated, enriched, and prepared for further use.

Processed data is stored in internal storage.

Data is distributed to target systems or returned to the client as downloadable files.

In parallel, data can be used for analytics and visualization (e.g., Power BI).

Team

Data Engineers x6
DevOps Engineers x2
Project Manager x1

Tools & technologies

Python

Apache Spark

AWS

AWS Glue

AWS EventBridge

AWS Secrets Manager

GitHub

Project results

	5 months from kickoff to launch The platform went from reverse engineering to production in 5 months. This included MVP, testing, and launch for one product within a larger environment.
	20-30% infrastructure cost reduction A simpler approach kept the platform lean and reduced infrastructure spend by 20-30%. The client kept the needed processing power without a heavier setup.
	Up to 30% development cost savings The staff augmentation model gave access to experienced engineers without building an in-house team. The client achieved about 30% savings.
	20+ clients onboarded The new setup carried over 20 clients from the legacy platform into a cleaner operating model. It provided a more efficient way to manage client data.

Key takeaways

The project replaced a legacy setup with a working platform for one part of a much larger system. New files now move through SFTP monitoring, automated processing, validation, and storage before reaching downstream use cases such as client downloads and Power BI.

Because access arrived in stages, the team kept progress moving by working from test data first and then switching to real inputs once approvals came through. With 2-4 overlapping hours with US-based teams, handoffs and reviews stayed active even across time zones.

The result is a stable setup that gives the client cleaner operations, lower costs, and a clearer path for future client onboarding.

Further development focuses on:

adding new data sources and clients
expanding transformation logic
improving monitoring and system stability

If your system is holding you back, it might be time to rethink the approach.

We can help you shape a clear solution and next steps.

Re-architecting a legacy data system with scalable pipelines

AI Summary

Key achievements

Challenge

Lack of system clarity

Unclear infrastructure direction

Restricted access and slow onboarding

Solution

End-to-end data pipelines

Flexible data ingestion layer

Custom transformation and mapping logic

Automated data delivery

Event-based orchestration

Infrastructure flexibility

Security and access management

Code quality processes

Project scope

1. Discovery phase & system analysis

2. Architecture design

3. MVP development (3 months)

4. Testing & validation (1 month)

5. Deployment & launch (1 month)

6. Ongoing support & improvement

How it works

Team

Data Engineers x6

DevOps Engineers x2

Project Manager x1

Tools & technologies

Project results

Key takeaways

adding new data sources and clients

expanding transformation logic

improving monitoring and system stability

If your system is holding you back, it might be time to rethink the approach.