Data processing

How and where is my data processed?

For security and compliance reasons, it is often important to understand which infrastructure processes which exact data points when you are using Datawisp.

Some of the information below depends on which plan and type of deployment you chose for Datawisp. For more information, check out our deployment overview.

Primary data processing infrastructure

In general, when using Datawisp, your data is processed in four places:

  1. Your existing data warehouse / database This is the primary place where most of the processing takes place. Datawisp processes, whenever possible, everything on your infrastructure, which is completely under your control. Your existing infrastructure already complies with all of your security / certification requirements.

  2. Your browser In order to display charts, tables, and other visualizations, your browser will process and cache some of the data locally. Most of the data is retained only until you close the browser tab, but some data (e.g. names of tables and columns) may be persisted until you log out. Your browser’s exact data retention can be controlled and managed by your organization’s IT department, and complies with any security / certification requirements you may have.

  3. Datawisp servers Some processing of the data can happen on the Datawisp server, for example, if you’re trying to combine data from a database with an excel sheet that you’ve uploaded to Datawisp. Except for files which you’ve explicitly uploaded to Datawisp, our servers generally do not retain a copy of your data, and all cached data is destroyed when you close the browser tab. However, some data points may still be stored by the Datawisp server. For example, the answers Wispy gives are stored in your account, and those may reference specific data points. The data for charts and other visualizations in dashboards are cached to prevent long loading times. For customers using the app.datawisp.io: All of Datawisp’s extensive protections for your data apply. For example, data is encrypted at rest, connections are isolated, and strict access controls are enforced. For managed / enterprise deployments: All processing and storage of your data is controlled and managed by your organization’s infrastructure / IT department, and complies with any security / certification requirements you may have. For more information, consult our deployment overview.

  4. Large Language Models / AI Datawisp minimizes the amount of data sent to LLMs. Datawisp only enables LLM vendors that will never use your data or your interactions with Wispy to train future models. Your data is retained for the minimal period required by your selected vendor, and then deleted. For a more detailed breakdown, refer to the Large Language Models section below.

Non data-processing vendors / infrastructure

In addition, some usage data is sent to other vendors, as necessary to provide specific features. Those vendors do not process or get access to your data. All information they would receive about your data would be incidental.

Nonetheless, in enterprise / managed deployments, those features can be turned off.

  1. AI Voice Input (Gladia) For voice input, your browser may send voice recordings directly to gladia.io. This only happens if you press the “microphone” icon in the text input fields. For managed / enterprise deployments, this feature can be turned off.

  2. Log collection (Honeycomb, Sentry, Datadog, Datawisp) By default, Datawisp collects some usage information via Honeycomb, Sentry, and Datadog. The Datawisp server also stores some usage logs. Logs generally do not contain specific data points. However, they may sometimes reference table or column names. For managed / enterprise deployments, automatic log collection can be turned off. In that case, individual server- and client log collections can be sent to the Datawisp team to resolve potential issues.

  3. Login: Your respective SSO provider (Google, Microsoft, LinkedIn, GitHub, Stytch) If you are logging in using a third party, Datawisp verifies which user is trying to authenticate with them. From this information, they could, for example, derive when and how often you use Datawisp. If you are logging in with email / password, the relevant provider is Stytch. For managed / enterprise deployments, we recommend only enabling your existing / trusted SSO provider.

Large Language Models

To enable Wispy to analyze your data, create charts, give answers and create dashboards, Datawisp leverages large language models.

Supported Vendors / Models

The current generation of state-of-the-art large language models are often proprietary and/or difficult to deploy. Therefore, the best models for Datawisp are generally not deployable directly on your infrastructure. If you do have access to a private instance of any of the supported models, Datawisp can, of course, connect to those.

By default, Datawisp currently supports models provided by:

  • Microsoft Azure (GPT)

  • OpenAI (GPT)

Note: Your data or chats are never used to train AI, ever. In general, data is retained at most for 30 days. Both Azure and OpenAI have some options for zero-retention. For exact and up-to-date information, refer to the respective vendors. (Azure, OpenAI)

Specifically, for organizations with strict security and/or compliance needs, we recommend using models available via Microsoft Azure. Azure has all necessary certifications (incl. HIPAA, SOC2, …), and is trusted by the largest organizations around the world.

Other Models

If neither Microsoft Azure nor OpenAI deployments are a possibility for your organization, other model providers can be explored, for example:

  • Google Cloud (Gemini)

  • AWS (Claude, DeepSeek, llama, …)

  • Anthropic (Claude)

  • Self-hosted (DeepSeek, llama, …)

In general, for enterprise customers, Datawisp can be set up to work with the mentioned model families.

However, sometimes this may have implications for performance or accuracy. This is not always because those models are “worse”. In many cases, it is because Datawisp has been more extensively tested with and optimized for the quirks and behaviors of the GPT-family of models.

If another vendor is chosen, the relevant processing and retention terms by the vendor would apply. (e.g. Google Cloud, AWS, …)

Some of those models are also available as “open weight” models, and can be self-hosted by your infrastructure team. In those cases, please be prepared to host the best performing / biggest model of the respective model family.

What data is transmitted to the LLM?

To make sure Wispy gives the best answer possible, Datawisp carefully balances the amount of data that is sent to the LLM. For best performance, LLMs, like human data analysts, do need to be able to see some data to understand its layout, or summarize the results of queries.

However, sharing too much data is undesirable for both accuracy and security. Therefore, Datawisp often limits the amount of data it sends to the LLM

In general, Datawisp may transmit some of the following data to the LLM you selected:

  • the prompt / question entered by the user

  • your organization's global prompt / memories

  • informations about the selected tables (names, column names, data types, relations, data dictionaries, ...)

  • a small, random sample of rows from those tables

  • the first few rows of each result set for queries the AI writes

However, as Wispy improves, the exact makeup may change. Any subscription of Datawisp has access to our AI inspector. Using that, your team can see exactly what kind of your data specifically was sent to the LLM for any prompt by the user, and what model was used.

Datawisp's AI inspector

In general, our recommendation is to choose a trustworthy LLM provider, instead of trying to minimize which exact data points are transmitted. The correct amount of sensitive data to send a LLM provider you can not trust is, unfortunately, zero.

Last updated