You Should Use Snowplow

(This article was originally published on Viget.com.)

What is Snowplow?

Snowplow is a flexible and dynamic data collection platform that gives you complete ownership of your event data pipeline. It runs on the cloud of your choice, Amazon Web Services (AWS) or Google Cloud Platform (GCP).

Snowplow events, entities–users, products, transactions, and their associated behavior across your multiple platforms—and fully custom events are tracked to scalable data warehouses like Amazon Redshift, Google Cloud BigQuery, or Snowflake. You can consume data in real-time from Amazon Kinesis streams, Google Cloud Pub/Sub topics, and Elasticsearch clusters. From there, analyze the data using business intelligence tools like Mode and Looker, aggregate the data for holistic reporting, or use the hit-level data to power machine learning applications.

There are six main components to a Snowplow pipeline, Trackers, Collectors, Validation, Enrichment,Storage, and Data Modeling. You can use one of Snowplow’s sixteen trackers in your website, app, or server, or make use of a Snowplow webhook to capture third-party data. The tracker sends events to the collector, which then sinks events to services like Amazon S3 or Kinesis. Snowplow validates your data using predefined schemas and parses the data using Hadoop, Kinesis, or Kafka. Snowplow enriches the data with third party data sources such as weather data from OpenWeatherMap. Enriched and validated data is saved in your data store of choice. (Viget prefers using AWS’s data warehouse, Redshift.) Lastly, you are able to model the data and perform analysis that will help your company make data-informed decisions.

Snowplow comes in two flavors, Snowplow Insights and Snowplow Open Source. Snowplow Open Source requires you to have a team of analysts, data engineers, and developers who are familiar with cloud platforms like AWS or GCP and advanced data modeling. Setting up Snowplow Open Source is not like setting up Google Analytics (GA) where you add a bit of JavaScript to the site and get up and running.

If you don’t have those capabilities in-house and still want to use Snowplow’s world-class event data platform, Snowplow Insights and Viget will be a better option than Snowplow Open Source. With Snowplow Insights, we (Snowplow and Viget) can set up all of the cloud-based components, implement data collection models that align with your business model, send data to your business intelligence tool of choice, and get you to data maturity quickly.

Why should you use Snowplow?

When your analytics data is siloed in an analytics platform and your business analysts are running queries on production databases, then your data scientist may be sitting on their hands because they can’t access data in the format they need. Snowplow can help solve this issue.

There are three main reasons why we recommend Snowplow to our clients.

  1. Ownership: Snowplow gives you complete ownership of your data. The data doesn’t need to go to Google or Adobe when you run a query or want to pull a report. Your data is always on your cloud platform, giving you more flexibility and control.

  2. Multi-platform consistency: Because you aren’t constrained to Google’s or Adobe’s data model, you can more easily track consistently across platforms based on your business logic. You can add trackers on your backend databases, servers, and mobile apps, sending event data to the same place with the same structure. This setup allows you to more easily identify your users across multiple owned properties and analyze their behavior using consistently formatted data.

  3. Real-time: Snowplow gives you a world-class real-time event data platform, using tools like Amazon Kinesis. Depending on your industry or business, real-time data applications–for tasks such as fraud detection, recommendation, personalization, defect tracking, or short term forecasting–on top of Snowplow data can be a game changer. The standard Google Analytics account may have a processing latency of up to 48 hours.

Basics of Event Data Modeling on Snowplow

Part of the benefit of using Snowplow is that you can use your own data model. Let’s choose what type of data you need to collect and how to collect it.

Start by asking yourself and your team, “What events are mission-critical to our business? What types of contextual data (metadata, timestamps, geographic, browser data, etc.) do we need around key events? What decisions can we make based on learning more about these events?” After you answer these questions and understand your company’s event data situation, you will be better equipped to use Snowplow to its fullest.

Document what these key events look like and the context in which they happen. These details will help you–or Viget and Snowplow Insights–choose the appropriate Snowplow trackers to set up for your event data collection.

If you are familiar with GA and Adobe Analytics, you may be thinking about events purely from a client-side standpoint. But when you use Snowplow, I encourage you to think about and use some of Snowplow’s server-side trackers–Python, Java, Scala, Ruby, C++, or PHP–for critical pieces of data. You can then stitch server-side data to your client-side data, giving your a more thorough picture about a session or user.

Snowplow supports a lot of the standard events we recognize from GA or Adobe like page views, link clicks, form submissions and transactions. Snowplow also accepts custom events, structured with five (very familiar) parameters, Category, Action, Label, Value, and Property. Custom events could include sign in, account creation, and search events.

One reason you need a fully structured event data collection plan is because Snowplow validates your data during enrichment. It uses JSON schema to validate that your event data is “good data.” This means that your data has to conform to your predefined event schema. But don’t worry, event schemas can evolve over time as your business needs change.

Example Industries

Below I have included a few examples of industries and companies I believe could benefit from using Snowplow, but there are many more examples of organizations that could benefit from advanced event collection and analysis. And I promise this isn’t just a list of companies I admire:

Financial platforms like Robinhood, TDAmeritrade, or Fundrise can benefit from implementing Snowplow on their websites and applications. Snowplow can answer simple analysis questions such as, “Are users using your platforms to do research prior to making a trade or purchasing an investment vehicle? How long does it take them to purchase an investment vehicle? Are there user characteristics or features that are correlated to higher trade volumes?” With rich Snowplow event data, these financial platforms can also answer more complex user-behavior questions such as “How likely is a user to drop off the platform or churn, based on account usage patterns and user attributes?” Snowplow data could power on-platform personalization, enable advanced anomaly detection and fraud protection, and predict future platform usage.

Large content sites like BiggerPockets or The Verge–with tons online resources–could get more data on who is using the platform, for what, and when. BiggerPockets could answer questions like, “What are a few articles we could recommend for new pro members in the San Francisco area? What users are about to become pro members and just need a nudge? What pro members are about to churn and need to get more engaged with the platform?” Subscriptions like BiggerPockets’ Pro Membership are difficult to analyze with standard GA because “subscriptions” don’t cleanly map to sessions or page views, necessarily. Churn prediction and personalized outreaches to prevent churn can have a significant impact on the bottom-line of subscription-based businesses.

Retail companies like Warby Parker or Patagonia can use Snowplow to increase conversion rates across web and mobile platforms. Snowplow event data allows retail companies to build robust customer acquisition funnels, track users as they move through those funnels, and massage users to continue moving down the funnels if they get stuck by automatically notifying them about an ongoing sale via email. Check out Snowplow’s awesome blog post on retail use-cases to learn about how a retail company could utilize the platform!

In all likelihood, Snowplow and advanced event data analysis can be used in nearly every vertical and companySmall companies, video game companies, non-profits, higher education, and more can leverage Snowplow’s pipelines to learn more about their users and help develop a more user-friendly environment.

Conclusion

How do you learn more about Snowplow?

  • Visit the Snowplow website and check out their GitHub.

  • Attend a Snowplow Meetup. In the US, they have meetups in New York and San Francisco. They also have a ton of meetups in Europe.

  • Give Viget a call! We are proud to partner with the team at Snowplow.

Snowplow is an amazing tool that can help your company move from data adolescence to data maturity. If you want to implement Snowplow, shoot us (Viget) a line and we can help you get up and running with Snowplow.

Data Science and Cloud Technology (Amazon Web Services)

Solving Problems with Data Science