Description
Imagine you work for a tech company which handles a lot of data. Suddenly you have a requirement to implement an advanced search mechanism that allows your users to filter documents (or something else really) by a customised arrangements of parameters.
Have you ever wondered how you would implement this? Treat this as a interview question if you may and you would find yourself choosing the common approach of ElasticSearch, since that is one of the most common uses for it.
ElasticSearch can be used for different types of applications, such as e-commerce websites, document management systems, and social networks, to provide fast and accurate search results.
This is not the case for me. Today let me guide you through a list of poorly made decisions that impacted a business, so that you can take this information and act on it, implementing better code and architecture.
The Before
Let’s say you have a lot of data on a Postgres RDS cluster which has data replicated in a lot of other RDS clusters, although in different formats.
Why would this be the case? Well, it had been previously decided that the business should follow an ELT (Extract — Load — Transform) data approach instead of ETL (Extract — Transform — Load). Both approaches have their own advantages and disadvantages but for that you can read this AWS article.
What if you had a lot of micro-projects that also depend on said data, and there is no clear way of telling wether it is actually being used? To add more pressure onto this, could you implement the advanced search on said data in a 3 month schedule? Though to say.
To remove the trouble of having to connect to different databases (all SQL), lets say we have successfully replicated the data we need to search into a single Postgres cluster, same database, and the same schema across multiple tables.
Let’s establish a baseline for costs. Initially the RDS cluster was costing 1600USD per month on the dev environment and 4900USD per month on the prod environment. This totals to 6500USD per month (excluding tax and cost of management).
The Proposed Approach
Given all the constraints raised before, I have put forward an approach / architecture to get this implementation started. Since there are a couple of important constraints to keep track of:
- We don’t have a lot of time to implement and test;
- We can’t change current data format or tables;
- We can’t have downtime;
- Search results should be somewhat “quick” (less than 5s to user).
So let us go through the architecture for the approach:
Although it might seem complicated, this is a pretty simple architecture and quite quick to implement. This approach leverages LogStash to replicate data from the RDS tables into their own indexes on OpenSearch, keeping replication in sync automatically every 5 minutes (can be easily adjusted). Thats it!
Doing so, we can keep the costs and size of the RDS cluster as is, while developing new components around it to enable a faster processing of data with OpenSearch, making use of it’s indexation and search capabilities.
The following steps where followed so that I could deploy a testable MVP and validate cost, and query times shown above. Steps where:
1) Setup a Dockerfile and docker-compose with the desired LogStash image.
version: "3.8"
services:
logstash:
# For testing this on a mac you will need to change the dockerfile image from amd64 to arm64
# Make sure you have your db tunel up and running (postgres on port 5432)
# Run with ´docker-compose up --build´
hostname: logstash
build:
context: ./
dockerfile: Dockerfile
ports:
- "5432:5432"
environment:
- LS_JAVA_OPTS=-Xmx2g -Xms2g
- RDS_CONNECTION_STRING=jdbc:postgresql://host.docker.internal:5432/db
- OPENSEARCH_HOST=https://opensearch-node1:9200
networks:
- default
opensearch-node1:
image: opensearchproject/opensearch:2.11.0
hostname: opensearch-node1
container_name: opensearch-node1
environment:
- cluster.name=opensearch-cluster # Name the cluster
- node.name=opensearch-node1 # Name the node that will run in this container
ports:
- 9200:9200 # REST API
- 9600:9600 # Performance Analyzer
networks:
- default
...
2) Implement the Pipeline configuration file for LogStash. I used an JDBC consumer and a ES HTTP producer. Also implement the mutations from the SQL format to their desired one on as a OpenSearch document;
input {
jdbc {
jdbc_connection_string => "${RDS_CONNECTION_STRING}"
jdbc_user => "${RDS_USERNAME}"
jdbc_password => "${RDS_PASSWORD}"
jdbc_driver_class => "org.postgresql.Driver"
jdbc_paging_enabled => true
use_column_value => true
tracking_column => "last_update"
tracking_column_type => "timestamp"
last_run_metadata_path => "/usr/share/logstash/config/sql_last_value/table.yml"
schedule => "*/5 * * * *"
statement => "
SELECT
...
FROM table t
LEFT JOIN table_2 t2
ON t.id = t2.id
ORDER BY t.last_update ASC
"
}
}
filter {
mutate {
remove_field => ["@timestamp", "@version", "_score", "_type"]
}
}
output {
opensearch {
hosts => ["${OPENSEARCH_HOST}"]
index => "os-table"
action => "update"
doc_as_upsert => true
document_id => "%{id}"
user => "${OPENSEARCH_USERNAME}"
password => "${OPENSEARCH_PASSWORD}"
ssl => true
ssl_certificate_verification => false
manage_template => true
template => "/usr/share/logstash/os_templates/table.json"
template_name => "table-template"
template_overwrite => true
}
}
3) Set up OpenSearch index pattern and adjust the appropriate data types
{
"index_patterns": ["os-table"],
"mappings": {
"properties":{
"attributes": {
"type": "object"
},
"category_id":{
"type":"integer"
},
"display_name":{
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
},
"type":"text"
},
...
}
}
}
4) Create a CDK wrapper to deploy ECR image with said docker image and LogStash configuration
// deploy docker image to ecr with :latest tag
new ECRDeployment(scope, 'Latest', {
src: new DockerImageName(logstashImageAsset.imageUri),
dest: new DockerImageName(logstashRepository.repositoryUri),
});
5) Deploy the ECS cluster with the LogStash image.
6) Setup an OpenSearch cluster and link them both.
This is what was done as a PoC, everything got deployed, the data replication worked and you can see on the architecture diagram the results. Given the current RDS size and number / capacity of reader instances the sample queries that were going to be made took roughly 500ms on OpenSearch while taking from 2-12min on RDS as select statements. With this approach we would have to incur an extra 200 USD of cost per month to maintain the OpenSearch cluster with all the data loaded. Lets adjust this value to 2 000 USD just to be safe in case we need to add 10x the number of current worker nodes or increase the storage size by a couple hundred GB.
The total cost of this approach would be estimated at the current 6 500 USD + (2000 * 2 envs) = 10 500 USD per month for all environments, 4 000 USD more than currently.
The Implemented Approach
Unfortunately although the suggestion above was validated and shared with co-workers who where interested in the approach, the proposed solution was NOT the one implemented. There was management pressure to pursue and explore other approaches, no reasoning was provided.
There were a couple of extra PoCs for different / alternative solutions that were tested, but in the end the advanced search was implemented with extra tables, we called them flat tables. This means that all the other tables data (roughly 400GB) were added into a single very wide (>40columns) and very tall (>200 Million records) flat table, also it needed “a couple” of indexes!
The problem now was the response times for the select statements. They were about 30 minutes at the beginning, which is unacceptable unless customers want to search for something over lunch. Through some compromises, the queries could be shortened to about 12 minutes. The compromise being the type of data to be searched, because instead of having raw data, we would have booleans if a record matched a filter condition or not! This was getting tricky to manage already.
Although noted multiple times that this would be a tactical and very temporary approach, everyone knew this was not the case. At the beginning, when work started, other reader instances started stalling since loads of IO requests were being made to the cluster. The solution? Well scale the cluster up, of course!
Bellow you see said result of scaling up the cluster with a release to enable search on only 3 different conditions.
Cost was on the rise. We were now on 5 100 USD on dev (from 1 600) and 8 600 USD on prod (from 4 900). This was now totalling 13 700 USD instead of the regular 6 500 USD from before. Yikes, the OpenSearch approach had already proved more cost effective while allowing a more flexible search.
The issues started to be more noticeable when we had to add more stuff into the filtering logic. The more filters / columns we added the more space the table would take, the heaviest the query got, and the more expensive the computing power for the cluster would be, but this was not all. At this time there were discussions on new requirements, we wanted new search capabilities like fuzzy string search for a couple of fields!
I will not go through that struggle, but let’s just say that in the end it was not done because the cluster wouldn’t scale up more and query times were just not acceptable. For the better, no manager though of horizontal scaling.
We have now been over 6 months into the approach. As more customers use it, more IO operations are being requested to the cluster. The economic impact of the implemented solution is inversely proportional to the advantages of the approach taken.
The After
This article is getting out of hands at this point but I wouldn’t leave you without a current cost analysis!
See for yourself.
Yes. Dev costs have increased from the original 1 600 USD to 12 000 USD (a 7.5x increase) while prod costs have seen their original 4 900 USD monthly cost jump up to 32 000 USD (a 6.5x increase). This is insane, and represents fully the way a wrong solution can affect the cost of implementation.
The total cost of the advanced search ended up being roughly 34 000 USD more than the OpenSearch approach would have costed, PER MONTH! By years end this solution will cost 400 000 USD more.
You can see the trend for June where we still don’t have the full data, only about 2/3. It’s not looking great.
The conclusion
Trust your engineers to make the best decisions, otherwise, why did you hire them in the first place?!
If you are in a place of power, you should guide your team members, but do not enforce your opinions on something without first carefully analysing their consequences. In this example, it led to a precarious solution that is not only hard to adapt to new requirements but also to maintain.
Now, I am looking forward to implement the initially proposed approach!
See you around.