Stargate v2 Design #1381
dougwettlaufer
started this conversation in
RFCs
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Stargate v2
At a high level Stargate v2 will consist of three logical parts
Services
The services component of Stargate will be composed of the discrete, user facing services running as independent pods. Each service will be responsible for serving requests on its respective interface by 1) accepting requests, 2) transforming them into CQL, 3) passing them along to the Bridge via a gRPC request, 4) transforming the gRPC response back into its format (like json or protobuf) and returning the payload to the user.
REST, GraphQL, and gRPC continue to convert their respective request types into CQL strings. These CQL strings are then passed along to the persistence module via the Bridge. The Bridge will then pass the CQL string into a DataStore-like interface for the query to be executed. By going through the intermediate gRPC this will 1) provide the services with a gRPC interface rather than relying on drivers, 2) take advantage of cross-cutting functionality provided by the Bridge, 3) leave the door open for future support of other persistence types via the payload abstraction in the Bridge.
Bridge
The Bridge will be a gRPC service that will connect the various user facing services with the Persistence. It’s responsibilities will be to accept gRPC requests in addition to performing authentication and authorization.
The gRPC service (aka Bridge) is implemented in much the same manner as our user facing gRPC which allows for a payload of different types with CQL being the default. This service will live within the same JVM as the persistence module and support cross cutting concerns such as 1) authn/z, 2) request/response filters (e.g. encryption, data masking, …), and 3) metrics.
As a cross cutting concern it makes sense to push authn/z into this layer rather than expecting each service to reimplement the same functionality.
Running bundled together with Persistence will reduce complexity introduced by the number of running pods but may result in code duplication across the various Persistence implementations and tie it to the lowest common denominator in terms of language support. Initially it will be best to combine with Persistence for simplicity and then it can be broken out later if it's found to be beneficial.
CQL Service
The CQL service will continue to function in the same way as it does today. It will be considered part of the bridge and live within the same JVM as the persistence module. As the “native” protocol for the particular instance it would be up to future persistence implementations to provide access to their respective “native” protocol. Since it is part of the Bridge it will also be able to take advantage of the same cross-cutting libraries as the Bridge.
REST API Service
The current
rest-api
module will be refactored to a separate microservice that uses the internal gRPC API. Dependence on OSGI will be removed (although we will likely reuse the health checker module for this and the other HTTP-based API services to provide endpoints for liveness/readiness.)The REST API service will continue to support both the REST v1 and v2 APIs. Any future deprecation of the v1 API as determined by the PM should have a horizon of several months which means removal is not removal of scope for this effort.
GraphQL API Service
The current GraphQL API implementation will be factored out from its current place under the
graphqlapi
module into a separate microservice that uses the internal gRPC API. Dependence on OSGI will be removed.Document API Service
The current document API implementation will be factored out from its current place under the
rest-api
module into a separate microservice that uses the internal gRPC API. Dependence on OSGI will be removed. Initial analysis has indicated only a small portion of code overlap between the REST and Document API implementations, we’ll create or add code to a common module (such ascore
) if that proves necessary.Because the Docs API operates at a higher level than REST or GraphQL it will not be able to follow the same pattern as those two services. Instead it will need to either pull back more data than it needs and perform filtering on its side or send a different sort of payload to the Bridge which is then interpreted and executed. An example of this payload would be an OR query. Both sides have their benefits. By pulling back more data than necessary it reduces complexity of the Bridge and moves the resource penalty (high memory usage) to the Docs API service. On the other hand, if we were to handle this on the Bridge/persistence-module side we would then be able to reuse this functionality in other services. An application of this would be to add ORs or JOINs to GraphQL. Resolution of this question is not required for the first two Milestones at least. Learning from this Docs API effort will be valuable for future complex APIs such as SQL or improvements to GraphQL.
Persistence
There will be little difference in terms of the persistence modules offered by Stargate today. They will still operate as coordinators and retain all their existing functionality. The primary changes required will be cosmetic such as removing OSGi (if deemed necessary).
The Bridge will be packaged with each Persistence layer to produce deployable containers for each supported backend (C* 4.0, DSE 6.8, etc.).
Load Balancing, Authentication, and other cross-cutting concerns
We do not assume usage of ingress for OSS deployments, but should encourage this and perhaps add a sample configuration to documentation.
For HTTP services we could move some of the cross cutting functionality into a loadbalancer by exposing endpoints for rate limiting and ext_authz but we would then leave it up to each user to implement this for their particular load balancer (assuming that their loadbalancer supports that functionality). For CQL, we still need to implement this logic within Stargate since loadbalancers would not be able to act on the binary protocol.
Auth will continue to be pluggable as it is today.
Local development considerations
An important aspect of this new design will be ensuring ease of use on a developer’s local machine. We must avoid the need to spin up multiple services in order to test a change to a single part of Stargate.
For the full experience it should be straightforward for someone to deploy the entirety of Stargate to a local Kubernetes environment while also being able to have a “developer mode” that is lightweight and debugger friendly. Given that all components will be independently Dockerized one option is for the developer to run the minimal set of Docker containers locally for integration testing. Another option will be to allow isolation of a single service by providing a mock or in-memory backend that is capable of satisfying the gRPC interface without the full complexity of a coordinator.
Operationalizing
Deployment
Breaking apart Stargate will require changes to how we do our deployments which will require the need for us to create a Helm chart. As we iterate this chart can become more advanced and allow for more of a customizable deployment (e.g. just REST or some other subset of services)
In the base case the chart will deploy each of our user facing services as a separate pod in addition to the coordinator pod (persistence, CQL, and Bridge). For resource constrained environments or environments that are sensitive to the number of pods we can do various combinations of pods and containers.
Monitoring
Stargate can be monitored in much the same manner as it is today by exposing a REST service in each of the services. We can either implement this from scratch or most Java frameworks have this built in. If we choose to do it from scratch we should be able to reuse the existing health-checker module for much of what we need. This REST endpoint can also be used for Kubernetes readiness and liveness checks. By checking each of the services independently we’ll be able to reduce the blast radius if something goes wrong. If gRPC fails a liveness check then the Documents API won’t be impacted. Although if the coordinator pod goes down, obviously all other services will need to be marked down as well.
Beta Was this translation helpful? Give feedback.
All reactions