Beryl Analytics Blog

Building a Data Pipeline From Scratch: A Reference Architecture for SMEs

By Beryl Analytics • 12 February 2026 • 10 min read

Every analytics and AI project eventually runs into the same wall: the data is scattered, inconsistent, and locked inside tools that do not talk to each other. Sales lives in the CRM, money lives in the accounting software, product usage lives in the app database, and marketing lives in three more platforms. Before you can forecast demand or predict churn, you have to get that data into one trustworthy place, kept fresh automatically. That is what a data pipeline does. This article lays out a reference architecture aimed at small and mid-sized businesses, right-sized so you are not building a platform fit for a thousand-person enterprise when you have twenty.

The five stages of a data pipeline

Almost every data pipeline, whether it powers a corner shop dashboard or a global retailer, breaks into the same five stages. Understanding them keeps the architecture clear no matter which tools you pick.

1. Ingestion

Ingestion is how raw data gets pulled from its source systems into your pipeline. Sources include SaaS apps (via their APIs), operational databases, flat files, and event streams. The key decision is batch versus streaming. Batch ingestion pulls data on a schedule, say every hour or once a night, and is simple, cheap, and sufficient for the vast majority of business reporting. Streaming ingests data continuously and is only worth the extra complexity when you genuinely need up-to-the-second freshness.

For most SMEs, managed connector tools handle ingestion without custom code, letting you point at a source and a destination and move on.

2. Storage

The destination for raw and processed data is typically a cloud data warehouse. These systems separate storage from compute, so you pay for what you use and scale without managing servers. For smaller teams, a warehouse is almost always the right choice over a sprawling data lake, because it is simpler to query and govern. The warehouse becomes your single source of truth.

3. Transformation

Raw data is messy. Transformation cleans it, standardizes formats, deduplicates records, joins sources together, and shapes the data into the clean, well-modeled tables your analysts and models will actually use. This is where business logic lives: how you define an active customer, how you calculate revenue, how you bucket products. Getting transformation right is what turns raw exhaust into trustworthy reporting.

4. Orchestration

Orchestration is the conductor that runs each stage in the right order, on schedule, and handles failures gracefully. It makes sure transformation only runs after ingestion finishes, retries when a source is temporarily down, and alerts you when something breaks. Without orchestration you end up with brittle manual scripts that someone has to remember to run.

5. Observability

Observability is how you know your pipeline is healthy and your data is correct. It covers freshness (did today's data actually arrive?), volume (did we get roughly the expected number of rows?), and quality (are values in the ranges we expect?). Skipping observability is the most common reason teams quietly lose trust in their dashboards. We treat data correctness as a first-class concern, which is why data quality runs through everything we build.

ETL vs ELT: which order to clean

The classic debate is whether to transform data before loading it (ETL) or after (ELT). The modern, cloud-warehouse answer for most businesses is ELT.

In ETL, you transform data in transit before it lands in the warehouse. This made sense when storage and compute were expensive and you wanted to load only clean, final data. Its downside is rigidity: if you need the data shaped differently later, you have to reprocess from the source.

In ELT, you load raw data into the warehouse first and transform it there using the warehouse's own compute. Because cloud warehouses are cheap and powerful, this is now the default. It keeps a raw copy of everything (so you can always reshape later), simplifies the ingestion step, and lets analysts iterate on transformations without touching the source systems. For an SME building from scratch, start with ELT unless you have a specific reason not to.

Right-sizing for a smaller team

The biggest mistake SMEs make is copying the architecture of a tech giant. You do not need a streaming platform, a data lake, a lakehouse, and a fleet of specialized engineers to get clean, reliable reporting. Right-sizing means deliberately choosing the simplest stack that meets your needs.

Prefer managed services. Managed ingestion connectors and a serverless warehouse mean you are not babysitting infrastructure. Your scarce engineering time goes to business logic, not plumbing.
Default to batch. Hourly or nightly refreshes cover the vast majority of business decisions. Add streaming only where a real decision depends on real-time data.
Model only what you need. Build clean tables for the questions you actually ask, and resist the urge to model every conceivable future use case up front.
Document as you go. A short description of what each table means and where it comes from saves enormous pain when the team grows or memory fades.

A well-built small pipeline that runs reliably every night beats an ambitious architecture that is half-finished and breaks every week. Start small, prove it works, and extend it as your needs grow.

Frequently asked questions

Do we need a data engineer to build a pipeline?

Not necessarily for a first version. Managed tools have lowered the bar significantly, and many SMEs build their initial pipeline with a partner and a small internal team. The hard part is usually transformation logic and governance, not raw engineering.

How much does a data pipeline cost to run?

With cloud-native, pay-as-you-go services, a modest SME pipeline can run on a small monthly budget, since you pay mostly for the compute and storage you actually use. Costs scale with data volume and refresh frequency, which is another reason to default to batch.

How long does it take to stand one up?

A focused, well-scoped first pipeline that brings a few key sources into a warehouse with clean reporting tables can often be live in weeks, not months, provided the requirements are clear and access to the sources is granted promptly.

The takeaway

A data pipeline is the foundation everything else stands on. Get the five stages right, ingestion, storage, transformation, orchestration, and observability, default to ELT on a cloud warehouse, and resist the temptation to over-engineer. The goal is not the most impressive architecture. It is clean, fresh, trustworthy data that your dashboards and models can rely on every single morning. If you want help designing one that fits your team, get in touch through our contact page.

data pipelinebuilding data pipelinesETL pipelinedata engineering

Want analytics that actually moves the number?

Beryl Analytics builds predictive models, data pipelines, and dashboards that drive decisions for businesses across New Zealand and Australia. We ship to production and prove the return.

Talk to Beryl Analytics