Local AI Models in Engineering Workflows

April 26, 2025
Local AI Models in Engineering Workflows

Over the first few months of this year, we’ve continued our quiet but steady work on building and refining engineering agents — the kind that orchestrate complex design and analysis workflows, combine different computational tools, and produce results you can trust in real projects. Much of that work has revolved around local AI models.

We run models locally primarily for privacy reasons. Customer data often can’t be sent to a cloud LLM, and even when it technically could, keeping it in-house is the safest path. Thanks to recent progress in open models, this hasn’t meant compromising on capability.

New Models in Our Stack

Two releases stood out this year:

  • DeepSeek R1 (January): a reasoning‑focused model efficient for engineering agents. It handles decision‑heavy workflows well, keeping chains of steps coherent and productive.
  • Gemma 3 (March): a capable, general‑purpose model from Google. A reliable all‑rounder for routine and procedural tasks.

We use LangGraph as the main orchestration framework for our agents. In our setup, LLMs act as decision‑making engines, deciding which tools to run, when to iterate, and whether to reprocess results. While parametric data and certain numeric evaluations are handled by traditional ML models, the LLM remains the conductor of the whole process.

When a Better Model Makes Old Tricks Obsolete

One interesting finding this spring came when we plugged newer models into older pipelines. In a few cases, the results were so much better that entire retrieval‑augmented generation (RAG) steps became unnecessary for that particular task. For more complex tasks, RAG is still a valuable tool.

For example, in some old soil investigation interpretation workflow, the new multimodal generic model produced better summaries and conclusions without accessing the reference documents. The old model needed a detailed RAG setup to get close to that level. Of course, that’s not always the case, for complex or highly specialized tasks, RAG still makes sense — but it’s a reminder that model improvements can simplify systems.

Running Locally Is Easier Than You Think

You don’t need exotic hardware to run these models. A standard NVIDIA RTX 4080 — found in many high‑end gaming PCs — can handle modern LLMs at 40–70 tokens per second, which is plenty for real‑time engineering pipelines.

Our workflows are often built to provide live feedback on iterative parameters, sometimes with visual updates during the process. Latency isn’t just about LLM speed — it’s about the whole system staying responsive — and modern GPUs make this practical.

Why We’ll Keep Doing It This Way

Local models let us:

  • Keep customer data totally private
  • Avoid vendor dependency on external APIs
  • Experiment with very complex iterative workflows without extra costs for hosted models
  • Secure option to run models even when the internet is down

When to choose local vs hosted

  • Data sensitivity: regulated or confidential datasets, on‑premises requirements, or customer contracts that restrict external AI services
  • Latency and control: interactive agent loops, edge or offline operation, or when you need tight control over execution
  • Model specialization: a hosted model clearly outperforms available open models for a specific task, or a proprietary toolchain is required
  • Cost and scale: steady, predictable usage where owned GPUs are cost‑effective; bursty workloads may favor hosted

They’re not always the right tool — there are still cases where a specialized hosted model is worth the privacy trade‑off — but in most of our work, the local‑first approach has paid off.

In engineering, the best tools are the ones that disappear into the workflow — the ones that make it easier to focus on the problem, not the process. This year’s model updates have made that a little easier, and we’re looking forward to seeing what the next releases bring.

Looking for expert solutions? Discover Crestia’s professional services today.