Until now, you’ve made manually gathering data from individual databases work. However, the company has plateaued in its ability to make more complex, data-backed decisions with siloed information. You need to level up your data management system and analysis capabilities so stakeholders can get a holistic view of the company’s customers and make more advanced business decisions.
It’s time to invest in a data warehouse. One that will save your engineering team time by having all historical data in one central repository so they can run analyses in one place.
What is a data warehouse?
A data warehouse, or enterprise data warehouse (EDW), is a system to aggregate your data from multiple sources so it’s easy to access and analyze. Data warehouses typically store large amounts of historical data that can be queried by data engineers and business analysts for the purpose of business intelligence.
Instead of only having access to your data in individual sources, a data warehouse will funnel all your data from disparate sources (like transactional systems, relational databases, and operational databases) into one place. Once it’s in the warehouse, it’s accessible and usable across the business to get a holistic view of your customers. When your data is in one place, you can analyze related data from different sources, make better predictions, and ultimately make better business decisions.
There are two ways to go about implementing a new data warehouse. You can have one on-premise, designed and maintained by your team at your physical location, or you can use a cloud data warehouse—one that lives entirely online and doesn’t require any physical hardware. Cloud data warehouse architecture makes it easier to implement and scale, and they’re typically less expensive than on-premise data warehouse systems. We’ll go more into what to consider and your options for the best data warehouses below.
Data warehouse vs. database: how they're different
Databases and data warehouses are related but not the same.
A database is a way to record and access information from a single source. A database is often handling real-time data to support day-to-day business processes like transaction processing.
A data warehouse is a way to store historical information from multiple sources to allow you to analyze and report on related data (e.g., your sales transaction data, mobile app data, and CRM data). Unlike a database, the information isn’t updated in real-time and is better for data analysis of broader trends.
Data warehouse vs. data lake: how they're different
A data lake is for storing any and all raw data that may or may not yet have an intended use case. A data warehouse, on the other hand, holds data that has already been processed and filtered, so it’s ready to be used and analyzed.
A data lake, hosted on big data platforms like IBM or Hadoop, is ideal for data scientists and analysts to store raw data until they know what they want to do with it, or as a repository to store large amounts of unstructured data.
A data warehouse is perfect for giving access to structured and semi-structured data to multiple business users so they can run queries against it and make decisions quickly.
When and why to use a data warehouse
If you’ve outgrown the insights your current analytics tools can provide, it’s time to integrate a data warehouse into your tech stack. You’ll be able to dive deeper than you can with individual database management.
When to use a data warehouse
You should consider a data warehouse if you want to:
store all of your historical data in a central repository
analyze your web, mobile, CRM, and other applications together in a single place
get deeper business insights than traditional analytics tools by querying data directly with SQL
provide multiple people access to the same data set simultaneously
Why to use a data warehouse
Data warehousing helps you answer those tough analytical questions that your board may be asking that aren’t possible to address with your standard data analytics tool. Reports and analyses you run in data warehouses can include elements from every one of the data sources you’ve connected to it—pretty powerful stuff! This data integration means you can analyze data from your website and app, as well as other platforms you may use like Salesforce, Zendesk, Stripe, and more.
For example, Google Analytics can give you a good sense of what actions customers are taking on your website or app. However, you’re limited to asking questions that can be answered with the number of variables, properties, and types of charts that it provides. When you hook up Google Analytics to your data warehouse, you can tie that information to data from your CRM, sales platform, and so on for a complete view of your customers.
When you have all your data in one place, you can easily run queries directly in your warehouse or through a business intelligence tool like Tableau, Looker, or Mode to automate and visualize those queries and aid in decision making.
6 factors to consider when choosing a data warehouse
Now you know the benefits of a data warehouse—but how do you go about picking one? Consider these factors when figuring out which data warehouse will best suit your business needs.
1. Data types
There are three types of data that you might want to store for your business: structured, unstructured, and semi-structured. Most data warehouses support structured and semi-structured data management, but unstructured data is a better fit for data lakes.
Structured data is quantifiable data that can be organized neatly into rows and columns (e.g., sales records or customer contacts).
Unstructured data is data that can’t be easily managed and analyzed. Think written content (like blog posts or answers to open-ended survey questions), images, videos, audio files, and PDFs. If you’re looking to store purely unstructured data, you should consider a data lake instead of a data warehouse.
Semi-structured data is a mix of structured and unstructured data. Take an email, for example. The content of that email is unstructured, but there are quantifiable aspects to the email, such as who sent it, when they sent it, when it was opened, etc. Similarly, an image itself is unstructured, but you also often have access to structured data like the time the photo was taken, device type, photo size, geotags, etc.
2. Scaling for data storage
Most data warehouses typically allow you to store massive amounts of data without much overhead cost. You probably won’t need more than what they offer, especially if analytics is the primary use case.
However, you’ll want to consider how a particular warehouse scales data storage during times of demand. For example, Amazon Redshift will require you to manually add more nodes (the basic structures in data warehousing that store data and execute queries) when you need more storage and computing power. On the other hand, Snowflake offers an auto-scale function that adds and removes clusters of nodes dynamically as needed.
3. Scaling for performance
The performance of a data warehouse refers to how fast your queries can run and how you maintain that speed in times of high demand. As you can imagine, scaling for performance and data storage are closely connected. Like storage, performance will increase as you scale up the nodes in your warehouse.
These days, speed is a non-issue. Every warehouse is about as fast as the others. What you really want to consider in regards to performance is how much control you want over your speed.
Similar to how a data warehouse’s storage scales, you can add and remove nodes for faster queries. For some warehouses, like Redshift, you need to do that manually, but you’ll be able to tune it as precisely as you like. For others, like Snowflake, it will happen automatically for a hands-off experience.
You likely want your engineers focused on building and maintaining your products instead of worrying about ETL pipelines and day-to-day management of your warehouse—especially if you have a small team. In that case, you’ll want a data warehouse that is self-optimizing like BigQuery, Snowflake, or IBM Db2.
However, by maintaining your warehouse manually, experienced data warehouse architects can have greater control and flexibility to optimize it precisely for your company’s needs. If you want this level of control over your warehouse’s performance and cost, Redshift and PostgreSQL are your best options.
Consider using a data warehouse that is within the ecosystem of the applications you already use. For example, Azure Synapse Analytics is in the ecosystem of Microsoft products, Redshift within AWS, and BigQuery within the Google Cloud ecosystem. This will simplify implementation since you already have an infrastructure in place.
Otherwise, you’ll need your engineers to develop multiple custom ETL pipelines to get your data where it needs to be. You may still need to write a custom ETL to get data into your warehouse from certain data sources, but the goal is to minimize that work.
Many factors go into data warehouse pricing, including storage, warehouse size, run time, and queries. For Redshift, you pay per hour based on nodes or per bytes scanned. BigQuery, on the other hand, has both a flat-rate model and a per-query model. Snowflake, IBM Db2, and Azure are all based on storage and compute time.
Ultimately, you want to choose the data warehouse that will do what you need it to do, not just choose the cheapest option.
For companies with a limited budget, PostgreSQL is a great free option and still has plenty of features. When you’re ready to upgrade, it’s easy to switch data warehouses, especially if you’re using a customer data platform like Segment that can communicate between the two warehouses seamlessly.
The best data warehouses in 2021
You should have a good idea of what you need based on the factors above. Here’s how the top data warehouses on the market compare. The good news: They all integrate with Segment so you can load your data in minutes.
Find the right data warehouse for your business
Ready to add a data warehouse to your stack? Before you sign up for a tool, don’t forget to consider:
the type and amount of data you want to store
how dynamically you need it to scale
how fast you need your queries
whether you want manual or automatic maintenance
the compatibility of the data warehouse with your existing tech stack
Still not sure which data warehouse is best for your business? Do some demos with your top picks to make sure you’ll get what you need from the warehouse you choose. Bonus: They all offer free trials.
Once you’ve picked a data warehouse, you’ll be well on your way to having better access to your business data. You’ll be able to analyze it, identify trends, and make better predictions for the future—and ultimately make better business decisions.
For more information on the data warehouses that Segment offers, check out our catalog of integrations.
New to Segment?
Segment is the industry-leading customer data platform to collect, standardize, and forward data to 400+ marketing and analytics tools. It sits between your data sources (like your website, email marketing, and data warehouse) and the destinations you want to utilize your data (like advertising, customer success, and BI tools). Segment enables high-performance teams to make data-driven decisions using a complete data toolkit.
Sign up for a demo to learn how Segment can help you better understand your customers and engage with them effectively.
Frequently asked questions
How to collaborate across marketing & engineering teams when purchasing new technology
Learn how to align and collaborate across marketing and engineering teams, especially when it comes to launching new marketing software that benefits them both.
AI + Personalization: 5 Ways to Use It
Companies that use AI and a CDP can create strong, personalized campaigns that are unique to their customers. This blog explores 5 use cases, complete with examples.
Four recent GDPR changes your business needs to know about
We cover GDPR changes: AI's impact, updated cookie banners, cross-border enforcement law, EU-U.S. Data Privacy Framework; Twilio Segment's role in GDPR compliance through consent, PII protection, local data processing.