AI forecasting has a credibility problem in retail and ecommerce, and the technology itself is rarely to blame. Most companies are asking their models to predict a business their own systems cannot fully see. The model ingests demand history, carrier scans, and purchase frequency, but three operational signals that reshape what those numbers mean sit in separate systems, owned by separate teams, with no path back into the forecast.
Those three signals: returns volumes, delivery promises made at checkout, and failed deliveries that suppress repurchase. Each one changes the picture the model is working from. When they are missing, the AI is forecasting a simplified version of the business, and the places where it is most wrong are the places where operations are most complex.
The evidence has been building for years, across multiple research disciplines.
A 2018 PLOS One paper tested machine-learning methods against traditional statistical forecasting across more than 1,000 time series. Traditional methods won on every accuracy measure, at every forecasting horizon, using less computing power. A more advanced model does not automatically produce a more useful forecast when the underlying signal is weak.
A 2025 paper in the European Journal of Operational Research pushed the finding further. Better forecast accuracy does not reliably improve inventory performance. The relationship depends on product mix, cost structure, and replenishment policy. Optimizing a forecast metric and optimizing the business are two different things.
A 2025 International Journal of Forecasting study pitted five large language models against 123 human forecasters in retail. The AI did not consistently win. Both humans and AI performed worst during promotional periods, exactly when delivery capacity, returns volumes, and customer expectations collide at once.
The pattern across all three studies points the same way: AI forecasting disappoints most where operations are most complex, and closing that gap requires architectural changes to the data the model can reach.
25 studies
Zero integrate demand forecasting with returns forecasting
Management Review Quarterly, 2024 systematic review
3 days, 2 ratings
Identical delivery speed rated differently depending on the checkout promise
MSOM, millions of deliveries tracked on a major ecommerce platform
Late = longer gap
Late deliveries measurably increase time between orders
Journal of Service Research, 2025 Western European quick-commerce study
Returns belong inside the forecasting model
Most retailers treat returns as a reverse-logistics cost center. A 2024 systematic review in Management Review Quarterly examined 25 studies on ecommerce returns forecasting and found something stark: no paper in the field integrates demand forecasting and returns forecasting. The two are treated as separate problems managed by separate teams.
In fashion and general merchandise, that gap is expensive. Returns already represent a major operating cost across ecommerce. In fashion, they can be much higher. Each return creates inbound carrier volume, warehouse labor, inspection time, and resale decisions, none of which the outbound demand model anticipated.
The delivery management layer is where forward and reverse flows meet. If returns stay outside the forecasting model, the carrier network absorbs volume that no forecast accounted for. The model calls it noise. The warehouse calls it Tuesday.
Hunkemoller, one of Europe's largest lingerie retailers, faced exactly this problem. Before digitizing returns with nShift, the company had no advance visibility into how many returns would arrive on any given day, what was driving them, or how to prepare warehouse capacity. After connecting returns data across six European markets, warehouse teams now see expected volumes days in advance. "We've made returns part of a seamless omnichannel customer experience with increased returns control and insights," says Robin Visser, Omni Channel Business Development Manager at Hunkemoller. "What was a historical pain point for the company and our customers has been changed into something that adds real value."
Checkout promises are rewriting your historical data
Carrier performance data looks clean at first glance, sorted into on-time or late, rated or not. Customers, though, do not experience delivery against those metrics. They experience it against what was promised at checkout.
Research published in Manufacturing & Service Operations Management tracked logistics ratings across millions of deliveries on an ecommerce platform. Ratings were shaped as much by the promised delivery speed as by actual performance. A three-day delivery rated well when customers expected four. The same delivery rated poorly when they expected two.
That creates a compounding data-quality problem. Every time a retailer changes its delivery promise, adds same-day options, adjusts cutoffs, or enters new regions, historical carrier data stops measuring what it used to. An AI model trained on that history thinks it is learning carrier quality. It is actually learning the gap between promise and expectation, and that gap keeps moving.
Connecting checkout promise logic to the data that trains the model is the only way to stabilize the signal.
Failed deliveries suppress demand, and most models miss it
A failed delivery does more than generate a support ticket. It pushes the next order further out.
A 2025 Journal of Service Research study tracked purchase behavior on a Western European quick-commerce platform and found that late deliveries measurably increase the time between orders. Early deliveries compress it. The negative effect from a late delivery is stronger than the positive effect from an early one of the same magnitude.
Most AI demand models treat purchase history as a clean signal of customer intent. After a stretch of carrier disruption or missed delivery windows, the model quietly learns that demand is lower than it really is. The business trims capacity and inventory. Then conditions improve and the model is still reading the wrong baseline.
When tracking and exception data feeds back into the demand model, the AI can distinguish between a customer who stopped buying and a customer whose last delivery went wrong.
The fix is a connected architecture
In practice, four capabilities keep showing up in the organizations where forecasting actually drives operational decisions.
Connected data. Demand, inventory, promotions, delivery promises, carrier events, failed deliveries, returns, and refunds need to be linked in a way that preserves cause and context. A sales dip from a stockout and a sales dip from weak demand look identical in a time series. They require completely different responses.
Probabilistic outputs. Operations teams need ranges, thresholds, and action triggers, not a single number. The difference between "we expect 80,000 orders next week" and "there is a meaningful probability that parcel volume exceeds carrier capacity in these regions if promotion conversion comes in above plan" is the difference between a number on a slide and a decision the ops team can act on.
Post-deployment monitoring. A 2026 NIST report on deployed AI systems makes the issue explicit. Pre-deployment testing happens in controlled conditions. Deployed models face a world that keeps changing: customer behavior shifts, carrier networks degrade, promotional strategy evolves. A model that passed validation six months ago may be quietly wrong today.
Governance. Someone owns the model, the inputs, the override logic, and the call to retrain or roll back. In Europe, AI governance is increasingly a compliance question as well as an operational one. The EU AI Act entered into force in August 2024 and applies progressively, with stricter obligations for certain high-risk systems.
Four questions to test any AI forecast
Before trusting an AI forecast, ask:
- Can the model see the delivery promise the customer was given at checkout?
- Can it separate weak demand from stockouts, late deliveries, failed deliveries, and poor service availability?
- Can it account for returns as future parcel volume, warehouse labor, inventory movement, and customer friction?
- Can the forecast trigger a real operational decision: changing delivery promises, adjusting carrier rules, protecting capacity, or communicating earlier with customers?
If the answer to any of these is no, the AI is predicting outcomes from disconnected evidence. The companies that get forecasting right will not necessarily have the most sophisticated models. They will have the most connected delivery architecture.
AI forecasting will keep disappointing until the delivery layer is part of the forecast, not downstream from it.
This is the conversation at DELIVER Europe 2026
nShift is at DELIVER Europe in Amsterdam on June 3-4 (Stand B39). The session on the Solar Stage, Thursday June 4 at 10:30 CET, picks up exactly where this argument lands: when AI agents start mediating discovery, comparison, and checkout on the shopper's behalf, the delivery layer becomes one of the last places where the brand earns trust in public.
If you are working on forecasting, carrier orchestration, or connected delivery data, book a 30-minute meeting with the team.
FAQ
Why does AI-driven forecasting often disappoint in ecommerce?
Why are returns important for AI forecasting?
How do delivery promises affect forecasting accuracy?
Can failed deliveries affect future demand?
What should ecommerce teams do before trusting an AI forecast?
Where does delivery management fit into AI forecasting?
About the author
Thomas Bailey
Thomas plays a key role in shaping how new features and platform improvements deliver real value to customers. With a background spanning product, tech, and go-to-market strategy, he brings a pragmatic view of what innovation looks like in practice and how to make delivery experiences work harder for your business.