We have in our company a robust tool (written in Go) that allows us to do a parallel export of a MySQL table(s) to a parquet files on hdfs clusters.
This tool currently is a CLI tool and resided on a dedicated MySQL ETL server – triggered remotely by our ETL application.
Our wish is to make it a distributes service (we have dozens of MySQL clusters),
so a call to this service will be answered always on the ETL machine, and the work load will be distributed amongst other machines (which are currently ideal!).
Would appreciate any insight to this kind of architecture.
Questions we currently trying to answer:
API – to gRPC or to not gRPC
Sync – message broker vs DB
Configuration – DB vs Config file
Parallel export: for a single table request we create a slice of 16/32 queries from the original query (if its possible), otherwise its considered as a regular export, then all queries are executed in “parallel” on the same machine. by making this a distributed service we can achieve true, or better, parallelism