SageMaker Pipelines vs. DIY Orchestration - What I've Learned

There's a conversation that happens in almost every ML team at some point. Someone's just finished training a model in a notebook, it's working beautifully, and now the question comes up: "How do we actually run this thing reliably, on a schedule, with proper monitoring, and without it exploding in production?"

That's where orchestration enters the picture. And that's where the arguments start.

Do you reach for SageMaker Pipelines - AWS's managed offering that promises to handle the heavy lifting - or do you roll your own setup with Airflow, Prefect, Step Functions, or some combination of Docker containers and optimism?

I've worked with both. Here's my honest take.


First, What Are We Even Talking About?

Before we dive in, let's make sure we're on the same page.

SageMaker Pipelines is AWS's built-in orchestration layer for ML workflows. You define steps - processing, training, evaluation, registration - and it handles scheduling, compute provisioning, logging, and integrates with the rest of the SageMaker ecosystem.

from sagemaker.workflow.steps import ProcessingStep, TrainingStep
from sagemaker.workflow.pipeline import Pipeline

pipeline = Pipeline(
    name="my-ml-pipeline",
    steps=[preprocessing_step, training_step, evaluation_step],
    sagemaker_session=session,
)

pipeline.upsert(role_arn=role)
pipeline.start()

Clean, right? Looks almost too easy.

DIY orchestration is the umbrella term for everything else. Maybe you're using Apache Airflow to schedule DAGs. Maybe it's Prefect or Metaflow. Maybe it's an EC2 instance running a cron job that no one is fully sure about anymore. (We've all been there.)

# Airflow DAG - roughly
from airflow import DAG
from airflow.operators.python import PythonOperator

with DAG("ml_pipeline", schedule_interval="@daily") as dag:
    preprocess = PythonOperator(task_id="preprocess", python_callable=run_preprocessing)
    train = PythonOperator(task_id="train", python_callable=run_training)
    evaluate = PythonOperator(task_id="evaluate", python_callable=run_evaluation)

    preprocess >> train >> evaluate

Both approaches get the job done. The question is: at what cost?


What SageMaker Pipelines Actually Gets Right

Let me give credit where it's due. SageMaker Pipelines solves some genuinely annoying problems.

1. The AWS Ecosystem Just Works

If you're already living in AWS - and most production ML teams are - SageMaker Pipelines is native. It talks to S3, ECR, IAM, CloudWatch, and the SageMaker Model Registry without you writing a single line of glue code.

You define a training step, and it knows how to spin up the right instance type, pull your container from ECR, mount your S3 data, and write logs to CloudWatch. No custom integrations. No surprise boto3 calls at 2am.

from sagemaker.workflow.steps import TrainingStep
from sagemaker.estimator import Estimator

estimator = Estimator(
    image_uri=training_image_uri,
    instance_type="ml.m5.xlarge",
    instance_count=1,
    output_path=f"s3://{bucket}/output",
    role=role,
)

training_step = TrainingStep(
    name="ModelTraining",
    estimator=estimator,
    inputs={"train": training_input},
)

That's it. It handles the rest.

2. Step Caching Is Genuinely Useful

This one doesn't get enough attention. SageMaker Pipelines caches step outputs, so if your inputs haven't changed, it skips the step entirely on the next run. For expensive preprocessing jobs, this is huge.

from sagemaker.workflow.steps import CacheConfig

cache_config = CacheConfig(enable_caching=True, expire_after="30d")

preprocessing_step = ProcessingStep(
    name="Preprocessing",
    processor=processor,
    inputs=[...],
    outputs=[...],
    cache_config=cache_config,  # Skip this if inputs haven't changed
)

With DIY setups, you're building this yourself. Which is fine, until you're three weeks in and you've accidentally rebuilt it four different ways across four different pipelines.

3. Model Registry Integration

Once your pipeline runs, registering a model into the SageMaker Model Registry is a first-class citizen of the workflow. Approval workflows, versioning, deployment triggers - it's all connected.

In a DIY world, your model might end up as a .pkl file in an S3 bucket with a timestamp in the name. (Not ideal. Don't do this.)

4. The Visual Pipeline Graph

Minor thing, but: there's a DAG visualisation in SageMaker Studio. For onboarding new engineers or explaining pipelines to non-technical stakeholders, this is worth more than you'd expect.


Where SageMaker Pipelines Will Frustrate You

Here comes the honest part.

1. Local Development Is Painful

This is the big one. With DIY orchestration, you run your code locally, iterate fast, fix bugs in minutes. With SageMaker Pipelines, debugging often means pushing code to AWS, waiting for a job to spin up, watching it fail, reading CloudWatch logs, and doing it again.

That feedback loop is slow. Annoyingly slow.

Yes, SageMaker has local mode. Yes, it helps somewhat. But it's not seamless, and the parity between local and remote behaviour is never quite perfect.

# Local mode - helps, but not a full substitute
estimator = Estimator(
    ...
    instance_type="local",  # Runs locally in Docker
)

You'll still hit cases where it works locally and breaks in the cloud because of an IAM permission, a VPC routing issue, or some other infrastructure quirk that only exists when you're running for real.

2. The SDK Is Verbose

SageMaker's Python SDK requires a lot of ceremony. You're defining processors, estimators, inputs, outputs, parameters, and pipeline objects just to do something relatively straightforward. For complex pipelines, the code can feel like you're configuring more than you're building.

Compare this to Prefect, where a pipeline is basically just annotated Python functions:

# Prefect - clean, readable, Pythonic
from prefect import flow, task

@task
def preprocess(data_path: str) -> str:
    # your logic here
    return processed_path

@task
def train(data_path: str) -> str:
    # your logic here
    return model_path

@flow
def ml_pipeline(data_path: str):
    processed = preprocess(data_path)
    model = train(processed)
    return model

This is just... Python. There's a lot to like about that.

3. Cost Creep Is Real

Every SageMaker Processing step spins up a managed instance. That's great for production workloads. It's less great when you're running a lightweight validation step that could run on a Lambda for a few cents.

With DIY setups, you have direct control over compute. Run cheap steps on Lambda, expensive steps on EC2 spot instances, and pay accordingly. SageMaker Pipelines isn't impossible to optimise for cost, but it's not the default - you have to think about it explicitly.

4. Vendor Lock-in Is Real Too

This is the uncomfortable truth. Once your ML workflows are deeply embedded in SageMaker Pipelines - pipeline definitions, step configurations, SDK calls everywhere - moving away from AWS becomes a significant project.

DIY setups with Airflow or Prefect are more portable. Your orchestration logic isn't coupled to a specific cloud provider, even if your compute and storage are.


What DIY Gets Right

Full Control

You decide how jobs are scheduled, where they run, how failures are handled, and how results are logged. You're not fighting an opinionated framework.

Easier Testing

Your pipeline steps are functions. Test them like functions:

def test_preprocessing():
    output = run_preprocessing(input_data="test_data.csv")
    assert output is not None
    assert len(output) > 0

With SageMaker Pipelines, testing the full pipeline requires cloud infrastructure. Unit testing individual steps is possible but requires you to architect around the framework's constraints.

Flexibility

Need to pull data from a non-AWS source mid-pipeline? No problem. Need to call an external API between training and evaluation? Easy. Need to branch based on model performance using some custom logic? You're writing Python - do whatever you want.


Where DIY Falls Apart

Here's the thing nobody says until they're knee-deep in it: DIY orchestration is a second engineering project running in parallel with your ML project.

You end up building:

SageMaker Pipelines gives you most of this for free. With DIY, you're either building it yourself or living without it - and "living without it" tends to mean "suffering quietly until something breaks in production."

Also, onboarding engineers becomes harder. "Here's our bespoke ML orchestration system" is a sentence that fills engineers with a specific kind of dread.


The Honest Recommendation

After working with both, here's where I've landed:

Go with SageMaker Pipelines if:

Go DIY (or hybrid) if:

The realistic answer:

Most production ML teams end up with a hybrid. SageMaker Pipelines for the core model training workflow, because the integrations are too convenient to ignore. Something else - often Airflow or Step Functions - for the broader data orchestration that feeds into it.

[External Data Sources]
        |
  [Airflow / Custom ETL]
        |
  [SageMaker Pipeline]
    ├── Preprocessing
    ├── Training
    ├── Evaluation
    └── Model Registry
        |
  [SageMaker Endpoint / Lambda]
        |
    [Production]

That's not a cop-out - it's just the reality of building ML systems at any reasonable scale. The art is in drawing the boundary between the two cleanly, so you're not maintaining two orchestration systems that do the same thing in slightly different ways.


Final Thought

The SageMaker Pipelines vs. DIY debate is really a proxy for a bigger question: how much do you want to own vs. how much do you want to buy?

SageMaker Pipelines is buying. You get speed, integration, and managed infrastructure in exchange for flexibility, portability, and some local dev pain.

DIY is owning. You get control and portability in exchange for building and maintaining more of the stack yourself.

Neither is wrong. Both have a time and a place. The mistake is treating one as universally correct and finding out three months later why it isn't.

Build accordingly.