Network-effect fraud detection: every merchant makes every other merchant safer

Most fraud-prevention apps on Shopify operate as per-merchant tools. Each merchant's account, each merchant's data, each merchant's risk model. The tool knows what its current merchant has seen. It does not know what any other merchant has seen.

That model is the safe default for a B2B SaaS app. It is also the wrong default for fraud detection.

A customer who commits fraud on store A is statistically very likely to attempt fraud on store B. A customer who has been chargeback-flagged across three different stores is more useful information than a customer who has been chargeback-flagged on one. A return-fraud operator running their pattern across ten DTC apparel stores is invisible to any single store and obvious to a system that joins the data across all ten.

The structural decision behind RefundSentry is that the database is built so the data flows between merchants. The decision is small in the schema and load-bearing in the product. This post walks through what we did, why we did it, and how the privacy story works.

The merchant who taught us

A streetwear merchant with under $1M in annual sales had been on RefundSentry for about three months when they got their first cross-shop signal. The signal said "this customer's email hash has been flagged for chargeback at another shop in the network within the last 90 days."

The customer was placing an order for $480, which would have been a normal order on this merchant's distribution. The pre-ship score with the cross-shop signal contributing came in at 71. HIGH zone. The merchant held the order, asked for a phone verification, and the customer walked away. Two days later the customer's same email hash showed up at a different RefundSentry merchant attempting an order for $620.

That sequence is the structural payoff of cross-shop network detection. Three months ago, on a closed per-merchant tool, the customer's first-time order would have scored at 25 and shipped without question. The fraud-flagged history would have lived inside one merchant's fraud-prevention tool and never escaped.

The structural choice in the database

The thing that makes cross-shop signals possible is a one-line decision in the database schema: the indexes on customer-identifier hashes deliberately do not include a shopId prefix.

In Prisma syntax it looks like this:

model CustomerIdentity {
  id        String   @id @default(cuid())
  shopId    String
  emailHash String?
  phoneHash String?
  // ... other fields ...

  @@index([emailHash])
  @@index([phoneHash])
  @@index([shopId, emailHash])
}

The default Shopify-app pattern would be to scope the index as @@index([shopId, emailHash]) only. That index is fast for "find a customer with this email at this shop" queries. It is useless for "find every shop where this email hash appears" queries.

We added the bare @@index([emailHash]) index alongside the scoped one. The bare index exists specifically so we can ask cross-shop questions cheaply. The scoped index still exists for per-shop queries. Both indexes are maintained on every write.

That one extra index is the structural moat. Every additional merchant who installs RefundSentry contributes their customer-hash records to the cross-shop index. Every other merchant on the network gets a slightly more populated cross-shop signal as a result. The flywheel is in the indexing strategy.

The privacy story

The reason we can do cross-shop joins safely is that the indexed columns are sha256 hashes, not raw PII. The emailHash is the sha256 of the normalized lowercase email. The phoneHash is the sha256 of the normalized E.164 phone string. The hashes are non-reversible. A merchant cannot look at another merchant's customer hashes and learn the actual emails or phone numbers of those customers.

What a merchant can learn through the cross-shop signal is statistical: "this hashed identifier has been flagged X times across N merchants in the network in the last Y days." The signal does not name the merchants. The signal does not show the underlying email or phone. The signal does not let a merchant query another merchant's customer list.

When a customer's data is redacted via the GDPR customers/redact flow, the cascade removes that customer's hashes from every merchant's CustomerIdentity rows AND from the cross-shop aggregation rows. The redaction is global. The cross-shop signal stops contributing the redacted customer's history immediately upon the redaction commit.

The reason we use sha256 hashes specifically (rather than HMAC with a per-merchant key, or some other privacy-preserving scheme) is that sha256 hashes are still useful for cross-shop joins because they are deterministic. An email hashed at shop A produces the same hash at shop B. HMAC with a per-merchant key would produce different hashes per shop and break the cross-shop signal entirely. The privacy property we need is "non-reversibility of a single hash" rather than "no shop can recognize another shop's data," because the cross-shop signal explicitly requires shops to recognize each other's hashed identifiers.

What signals we run on it

The cross-shop signals fall into three categories:

Cross-shop chargeback history: this email hash has been flagged for chargeback at any RefundSentry merchant within the last 90 days. The signal contributes hard evidence to the score because chargebacks are labeled outcomes, not just suspicious patterns.

Cross-shop address velocity: this address fingerprint has had an unusually high number of orders across the network in a short window. Velocity at the network level is a different signal from velocity at the per-shop level; an operator distributing fraud across multiple shops looks low-velocity to each shop and high-velocity to the network.

Cross-shop return-rate anomaly: this email hash has a return rate across the network that is statistically anomalous compared to the global distribution. This signal catches the operator who returns 80% of their orders across ten shops where each individual shop sees only one or two returns.

There is no cross-shop signal for "this customer is good." We do not have positive labels of similar quality at the network level. Cross-shop signals contribute exclusively as risk-elevating evidence. The merchant's local data dominates the score for normal customers; cross-shop only kicks in when the network sees something the local shop has not yet seen.

How the Signifyd network compares

Signifyd, the largest fraud-prevention SaaS in the Shopify space, has its own network. The Signifyd network is the data flow between Signifyd's enterprise customers, who tend to be very large merchants. The minimum order volume to qualify for Signifyd's enterprise contract is high enough that most Shopify merchants under $10M in annual revenue cannot use it.

The Signifyd network is closed in two ways: (a) it is closed to small merchants because they do not qualify, and (b) it is closed to non-Signifyd-customer data because Signifyd does not buy or trade fraud signals with other vendors. Both choices are reasonable for an enterprise product.

The choice we made for RefundSentry is the opposite: open to small merchants from the day they install (no minimum volume), no shared-network gating. The smallest merchant on RefundSentry contributes the same cross-shop signal weight as the largest. The result is a network that fills in faster on the small-merchant tail of Shopify, which is exactly the segment Signifyd does not serve.

We are not competing with Signifyd directly. The two products serve different segments. But the shape of the network effect is the structural difference between them and us, and small Shopify merchants deciding between fraud-prevention tools should know it.

Engineer detail. The cross-shop indexes live in prisma/schema.prisma. The relevant lines are:
@@index([emailHash])
@@index([phoneHash])
Both indexes are global B-tree indexes on the hashed column. The cardinality is high (most customers have unique emails) so the indexes are efficient. The space cost is modest because the underlying data is already indexed for per-shop queries; the cross-shop index is incremental space, not a doubling.

We do not yet have an addressHash index. The reason is that address hashing requires a normalized representation of the address that survives spelling variation, abbreviations, and component reordering ("123 Main St" vs "123 Main Street" vs "Main St 123"). The libpostal-backed parser shipping in spec 205 is the first piece of that work; spec 206 is layering address fingerprinting on top of it. Until the parser is in place, address-based cross-shop signals run on address1 substring matching, which is brittle and produces lower-quality signals than email-hash and phone-hash matching. The cross-shop address signal will move to the libpostal-derived fingerprint as soon as the parser stabilizes.

What this unblocks

Cross-shop network signals are the substrate for the per-merchant ML model that ships in this series' next post. The model takes cross-shop history as input features. It does not predict anything about cross-shop behavior on its own; it consumes the cross-shop signal as one of many features for refund-propensity prediction. The cross-shop signal makes the model's prediction quality scale with the size of the merchant network. As more merchants install, every per-merchant model in the network gets a richer cross-shop feature.

Spec 198 (open at the time of this post) is exploring what additional cross-shop features should be added: cross-shop velocity at finer time granularities, cross-shop return-reason coherence, cross-shop product-affinity. Each one is a question of "is this signal worth the index cost." The bare indexes on emailHash and phoneHash are the foundation; everything else is an aggregation question.

Take-away

The cross-shop network effect is not a marketing claim. It is a database structural decision (the indexes do not have a shopId prefix), a privacy decision (sha256 hashing makes the cross-shop join safe), and an open-network business decision (no minimum volume to qualify). Every merchant who installs makes the network slightly better for every existing merchant. That kind of compounding does not happen in a per-merchant tool no matter how good the model is.

If you are evaluating fraud-prevention tools, the question to ask is: does the data flow between merchants on the network, and how. The answer for most tools on Shopify is "no." For RefundSentry, the answer is "yes, through hashed identifiers, on every install."

RefundSentry is an intelligence layer for Shopify return fraud. See pricing for plans during the private beta.

Network-effect fraud detection: every merchant makes every other merchant safer

Network-effect fraud detection: every merchant makes every other merchant safer

The merchant who taught us

The structural choice in the database

The privacy story

What signals we run on it

How the Signifyd network compares

What this unblocks

Take-away

Stop return fraud before it costs you

RefundSentry Team

Continue Reading

Returns and refunds aren't the same thing, and your fraud model needs to know

Why small orders are where card-testing fraud hides

Day one of using a fraud tool shouldn't be a blank dashboard