this more of investigatory post proper way in spark streaming. have spark streaming app takes in kafka stream. in kafka stream each message receive calling 2 api's hit spring boot server running ontop of postgres database.
the issue getting on 1 million messages day hit our api server @ least 2 million times currently. scale growing. planning on adding 2 more calls doubling amount of calls server. reason need hit api server rules apply each message change on time. thing has come mind table behind api call put in variable streaming application call upon. setup agent poll postgres table changes , have update variables streaming job call upon.
the issue using broadcast variable can refresh restarting spark streaming application. know framework or tool can place in-between api server/spark streaming app allow grow without fear of ddos'ing.
i guess on top of mind, there 3 options:
use caching proxy, offloading problem db cache. comes problem of cache invalidation. can if know queries expect , cache materialised values. way avoid making multiple calls cache server.
have database log available kafka topic. join these 2 streams in spark streaming app. way changes in database records come directly spark application. see if can - written martin kleppmann: https://github.com/confluentinc/bottledwater-pg
use off-heap memory alluxio (which spark integrates nicely with). setup quite complex, since memory grid has span executor nodes. not sure of viability of solution, investigate.
Comments
Post a Comment