Lexio gets data from a prospect or customer's data source via "push" or "pull" data flows then processes and stores the data as "extracts" in AWS S3. 


Extracts are compressed snapshots of data optimized for aggregation and loaded into system memory or cache to be quickly recalled for analyses.



Pull Flows

We use Stitch, AWS Glue, and custom data source extractors to ingest and transform data from a suite of SaaS tools, databases, Google BigQuery, and Tableau. These are flows where we "pull" data from a prospect or customer, and with these data sources, Lexio handles Extract/Transform logic instead of analysts, unlike traditional BI. 


Available Pull Flows

  1. Stitch extracts the customer's data into its platform
  2. Stitch loads the data into our Data Lake S3 bucket
    • Data is stored as JSON line format
    • Loads happen at most every 30 minutes
    • Loads occur for each data source independently
    • We provide Stitch with AWS IAM user credentials that only have write-access to the S3 bucket
    • The Data Lake S3 bucket has public access blocked and encrypts the data at rest with AES-256 encryption
    • Access to the bucket is restricted to AWS IAM users with the proper permissions. Permissions are controlled by a central Cloud Engineering team.
  3. BigQuery Data Pulls
  4. Tableau Data Extractor
    • A Narrative Science application that translates dashboards into Lexio content
    • Runs on Lexio infrastructure
    • Tableau credentials are stored in AWS Secrets Manager



Push Flows

Alternatively, customers can build their own extracts and push them to our drop zone, either directly from their data source (for example, Snowflake and Looker), via ETL tools that have this capability (for example, Alteryx and Informatica), or via data flow automation tools (for example, Apache Airflow and Prefect).  With these data sources, the prospect / customer handles Extract/Transform logic, as they would in traditional BI. 


Available Push Flows

  1. Lexio Drop Zone
    • The Lexio Drop Zone is an AWS S3 bucket where customers can send data
    • Customers use their own tooling to push data to the drop zone
    • Customers are provided credentials to a portion of an S3 bucket that is unique to that customer.
    • The Data Lake S3 bucket has public access blocked and encrypts the data at rest with AES-256 encryption
    • Access to the bucket is restricted to AWS IAM users with the proper permissions. Permissions are controlled by a central Cloud Engineering team.
  2. Data Cache Upload API
    • Smaller datasets can be uploaded directly to the Data Cache via the API


Data Loading

  1. Lexio runs a Data Load process every 4 hours
    • Redshift Spectrum is used to:
      1. Deduplicate records from the Data Lake
      2. Create optimized Parquet files in the Data Warehouse
    • Data is not stored in Redshift; we use the compute nodes of the cluster to perform queries
    • The Data Warehouse S3 bucket has public access blocked and encrypts the data at rest with AES-256 encryption
    • Access to the bucket is restricted to AWS IAM users with the proper permissions. Permissions are controlled by a central Cloud Engineering team.
  2. Lexio transforms the raw data into denormalized tables
    • Athena is used to coerce, project, and filter rows according to how the user configured the data source in Lexio
    • Athena outputs Parquet files into the Data Marts S3 bucket
    • The Data Marts S3 bucket has public access blocked and encrypts the data at rest with AES-256 encryption
    • Access to the bucket is restricted to AWS IAM users with the proper permissions. Permissions are controlled by a central Cloud Engineering team.
  3. Lexio publishes the transformed data to the Data Cache so it can be used by our story-writing and analytic components
    • Application code running in an ECS Fargate task loads into the Data Cache.
    • Datasets stored here are timestamped. At least once per day, old datasets are removed.
    • The Data Cache is a Lexio Service for running applications. Data is stored on encrypted EBS volumes that are accessed by application code running on AWS ECS backed by EC2 instances. Data is encrypted at rest and in transit.


Data Retention

Specific Stores

  • Data Lake: Datasets are added or updated from external systems. Stitch only adds files, it does not change or remove existing files. Other systems like external data warehouses and the Tableau Data Extractor overwrite files.
  • Data Warehouse: Datasets stored here are immutable and timestamped. At least once per day, old datasets are removed.
  • Data Marts: Datasets stored here are immutable and timestamped. At least once per day, old datasets are removed.
  • Data Cache: Datasets stored here are immutable and timestamped. At least once per day, old tables are removed. Lexio stores backups of our RDS instances for 14 days.


When a data source is deleted in the Lexio UI

  • If the data source was created in Lexio (Salesforce, CSV source types), the Stitch connection as well as Data Lake S3 prefix, Data Warehouse S3 prefix, Data Marts S3 prefix, and Data Cache tables are promptly deleted.
  • If the data source was created directly in Stitch, set up as a dropzone integration, or set up as a Tableau integration — by working with our Customer Success team — then the Data Warehouse S3 prefix, Data Marts S3 prefix, and Data Cache tables are promptly deleted. The Stitch connection, Tableau connection, and Data Lake S3 prefix are deleted manually by our Customer Success team within 30 days.


In addition, customer data is not logged. Metadata related to queries such as column names and customer IDs are logged to CloudWatch. Our logs expire after 60 days.