Skip to main content

Speaking "Data Maintenance"An intro to Data-related presales

Glossary of Partners

Mulesoft

Salesforce-owned company.

Salesforce will tell you about Mulesoft, and forget to tell you if they’re speaking about Anypoint, Composer, or anything else about Mulesoft. Make sure they are targeting the discussion in ways that serve the client see data volumes, mappings, complexity) rather than the product.

Talend

The Free version is limited and generally is just used to cross load CSV files. The paid version is very powerful but requires an IT team to wield properly, and has setup costs for us regarding how to set the environment in place.

Jitterbit

Is bad. Run away. It used to be the king, but lack of updates, bad infrastructure and bad support lead to it losing ground over the last years.

Boomi

A paid ETL by Dell. Powerful, used by US corporations, but paid. Rarely seen in the wild unless the client already has a license.

Informatica

A paid ETL by Informatica. Powerful, used by US corporations, but paid. Extremely rarely seen in the wild unless the client already has a license.

Kafka

An event bus by Apache. Used by Event Driven Systems

Glossary of Technologies

API

An Application Programming Interface (API) is a set of functions, procedures, methods or classes used by computer programs to request services from the operating system, software libraries or any other service providers running on the computer. A computer programmer uses the API to make application programs.

MDM

Master data management[1] (MDM) is a technology-enabled discipline in which business and information technology work together to ensure the uniformity, accuracy, stewardship, semantic consistency and accountability of the enterprise's official shared master data assets.[2][3]

In simpler terms, it is the act of defining which system has the correct data, where, when, and how it is kept up to date.

Clients often request an “MDM”, which actually just means a “centralized system of record”, meaning a database where they know the data is correct and should always prime in case of data differences with other systems

ETL (Extract-Transform-Load)

A software tool that extracts data from a source system, transforms the data (using rules, lookup tables, and other functionality) to convert it to the desired state, and then loads (writes) the data to a target database.

Web service

A Web service is defined as "a software system designed to support interoperable machine-to-machine interaction over a network". Web services are frequently just Web APIs that can be accessed over a network, such as the Internet, and executed on a remote system hosting the requested services.

REST

Representational state transfer (REST) is a software architectural that was made to guide the development of the World Wide Web. Systems which implement REST are called 'RESTful' systems. REST documents a way for computer systems to communicate with each other using HTTP requests.
It is supported by most recent players, is flexible and cheap.

It is also less secure than SOAP by design, and for high volumes, Events can be better suited.

SOAP

SOAP is a protocol used in computing. Web services use this protocol to communicate. SOAP uses XML to encode a message. It uses other application-layer protocols, for transport, and content negotiation, for example HTTP and Remote procedure call.

It is less flexible than REST and harder to implement, but it offers more security and some calls are specific to SOAP.

GraphQL

An API type that’s similar to REST but has technical differences in implementation and scope of data recovery. Great if multiple calls need to be done of varying scopes on the same endpoint.

Web socket

Much like REST, it is an HTTP API protocol. It has way less flexibility but is great if you want to “just push a message somewhere”, if that message corresponds to a very specific format.

Events

Events operate on the opposite of REST/SOAP calls. In REST/SOAP you tell a system what you want it to do, and add information needed for the action. Events just say “something happened, here’s the data about that”. It becomes the receiving system’s job to interpret the action to do.

Events are asynchronous, and by nature harder to manipulate and ensure than REST/SOAP calls. It’s great for high-volume, low-latency situations, but expensive.

ESB (Enterprise Service Bus)

REST and SOAP historically integrate two different platforms directly. These platforms become “coupled” - if one changes, the other must change to allow the integration to continue.

An Enterprise Service Bus is a platform that sits in the middle of these integrations. All platforms speak to the ESB, and the ESB then manipulates data, streams, events, and whatever else is necessary to allow the platforms to get the information they need back.

Setting up an ESB is costly, and generally leads to restructures in existing integrations so they leverage the new ESB. It does however lower the cost of future integrations, and lowers platform coupling.

It is a good idea to implement an ESB when you have at least 4 platforms speaking together, and it can be valuable to look at it for lower numbers.

Batch

The default Data Loading mode for Data Loader and REST calls.
Accepts Data passed via REST, in batches. Processes these batches synchronously and then returns the results as a response with the same number of records as in the original batch, with a status code.

The default batch size in Data Loader is 200. The number of batches submitted for a data manipulation operation (insert, update, delete, etc) depends on the number of records and batch size selected.
One API call is used per batch, which can lead to limit issues for big loads.

Bulk

A different Data Loading mode, usable via Data Loader or REST calls.

Accepts Data passed as a CSV file which must be sent to the server in a series of REST calls. Once all the data has been received, a final call tells the bulk to start. It then processes these batches asynchronously and returns the results to the batch, which must then be downloaded via REST calls.

The default BULK size in Data Loader is 2000. The amount of records loadable is by nature very high (a few million), and as such this API is recommended for big data transfers.

Event-Driven Architecture

A situation where the client already uses Event-based systems and expects you to implement a receiving Event Bus and get Events for integrations. See Events.

Database

Often conflated with Relational Database Management System, actually just means a place where data is stored. Can be relational, graph based, events based, whatever. If “database” is said, try to see which kind.

Data Warehouse

Often conflated for “lots of tables”. Actually means place where data frim multiple systems are stored. Doesn’t have to mean that the data is transformed to serve an MDM - you can just store multiple systems and call it a day.

Data Lake

Often conflated for “lots of tables”. Actually has nothing to do with tables, and defines an architecture for data storage, with a heavy focus on data “flatness”, hence “lake”. The data can be structured, semi-structured, unstructured - meaning a Data Analysis team will be needed to use it properly.

If the client is misusing this term, it’s fine. If they’re using it correctly, the complexity of your project just went up.

Data Archival

Taking data from a system and storing it in another when it’s no longer useful but you don’t want to lose it. Generally done for Cost considerations - storing in a local postgredb is cheap.

Glossary of Volumes

Data Storage

Amount of records salesforce stores. Records in Salesforce are generally (exceptions apply to tasks, events, email messages) abstracted to 2kb per record. Storage is expensive in Salesforce, keeping it reasonable generally lowers project cost.

File Storage

Amount of ContentDocument Salesforce stores. Very expensive, and Salesforce does document management poorly. You might want to look into third party solutions.

LDV (Large Data Volumes)

Above 500000 rows in a single table, LDV applies. This is a key word for Architects that will understand they need to watch out for volumes, flows, api calls, storage over time etc.