Does Data Location Still Matter?
Jyothsna Santosh
Jyothsna Santosh
AI & Data Science Leader | Human-Centered Innovation | Banking, Retail & Healthcare | Shaping Scalable, Trusted Intelligence Systems
July 31, 2025
Enterprises Can derive Value Without Moving Every Byte
Enterprise data is everywhere, and often everywhere at once. It lives across:
Legacy systems (IBM mainframes, Oracle DBs, Teradata)
Multiple clouds (AWS, Azure, GCP)
SaaS platforms (Salesforce, Workday, ServiceNow)
For decades, the default approach was to centralize everything into a data warehouse, a data lake, or some hybrid. But these migrations can take years, millions of dollars, and armies of ETL jobs. Worse, by the time the project is complete, the business and its priorities may have already shifted.
So, should enterprises stop obsessing over where the data lives and start focusing on how to deliver value faster?
From Migration-First to Value-First
Rather than hauling every byte into a new destination, leading organizations are embracing abstraction and stitching:
Stitching data across disparate systems
Querying virtually without heavy replication
Exposing insights through APIs and microservices
Activating data through event streams or federated models
it’s a strategic shift. The aim is speed, adaptability, and value creation rather than mere consolidation.
Handling Legacy Data: The Toughest Part
Legacy systems like IBM mainframes, Oracle, and on-prem transactional databases often hold the most valuable but least accessible data. Instead of full-scale migrations, enterprises are leveraging:
Connectors and Wrappers: Lightweight connectors (ODBC/JDBC, API gateways, COBOL adapters) make legacy tables queryable.
Change Data Capture (CDC): Tools like Debezium, GoldenGate, or Fivetran replicate changes in near-real time without heavy ETL.
Metadata & Canonical Models: Legacy fields are mapped into canonical data models (via tools like Talend, Collibra, or custom mappings) so they can be stitched with modern systems.
Caching Layers: Frequently accessed legacy data is cached (e.g., Redis, Memcached) to reduce mainframe load.
These methods keep the source systems intact, while still making their data available for analytics and applications.
Common Integration Points
Stitching isn’t just mere connections; it’s about creating shared seams where data can meet consistently. Common integration patterns include:
Canonical Data Model (CDM): A standardized schema for key entities like Customer, Product, Transaction, and Account—allowing Salesforce, Oracle, and AWS Redshift to speak the same “language.”
API Gateways: Unified endpoints (e.g., Kong, Apigee) for accessing multiple backends with consistent contracts.
Event Streams: Kafka topics or Pulsar streams that carry changes from all systems, normalizing them into event-driven architectures.
Virtual Data Catalogs: Business users discover and query datasets via tools like Atlan, Collibra, or Alation, regardless of where the data sits.
Key Approaches
1. Data Virtualization
Platforms like Denodo, Starburst, and Dremio allow querying across heterogeneous databases and clouds without physically moving the data.
Pro: Rapid deployment, no duplication.
Con: Dependent on source system performance; can hit latency for massive queries.
2. Microservices & APIs
Wrapping legacy and SaaS systems in API layers enables modular, controlled access and decouples consumers from fragile backends.
Often combined with event streaming (Kafka, Pulsar) for near-real-time delivery.
Pro: Modular, reusable, supports agility.
Con: Requires robust governance and versioning.
3. Data Mesh Principles
Treating data as a product, owned by domain teams, accessible via APIs and catalogs.
Pro: Reduces bottlenecks, aligns with business domains.
Con: Requires cultural and process transformation.
4. Federated Learning & Query
Models are trained or queries are executed across distributed sources without centralizing raw data.
Pro: Preserves privacy, great for regulated industries (finance, healthcare).
Con: Complex orchestration, performance tradeoffs.
Pros and Cons of “Don’t Move, Just Stitch”
Advantages:
Rapid time-to-insight (no years-long ETL projects)
Lower cost (avoids data duplication)
Data remains governed in its native environment
Flexibility to onboard new sources quickly
Challenges:
Latency and performance for heavy analytics
Source system uptime becomes critical
More complex query and API governance
Difficult to optimize for massive-scale batch workloads
The Bottom Line
Data location still matters, for performance, governance, and certain use cases BUT it shouldn’t dictate your ability to deliver value.
With the right mix of virtualization, APIs, microservices, event streaming, data mesh, federated approaches, and canonical models, enterprises can render insights and build products faster without getting stuck in endless migration cycles.
The real question isn’t where your data lives, but:
Are you architecting for your systems’ history or for the value your business needs now?
References
Debezium: https://debezium.io/
GoldenGate: https://www.oracle.com/middleware/technologies/goldengate.html
Fivetran: https://www.fivetran.com/
Talend: https://www.talend.com/
Collibra: https://www.collibra.com/
Redis: https://redis.io/
Memcached: https://memcached.org/
Kong: https://konghq.com/
Apigee: https://cloud.google.com/apigee
Kafka: https://kafka.apache.org/
Pulsar: https://pulsar.apache.org/
Atlan: https://atlan.com/
Alation: https://www.alation.com/
Denodo: https://www.denodo.com/
Starburst: https://www.starburst.io/
Dremio: https://www.dremio.com/
