-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(observability): API health checks + GraphQL #3514
Conversation
The actual default is 10, not 1000. Signed-off-by: Jesse Szwedko <[email protected]>
Signed-off-by: Kirill Fomichev <[email protected]>
Signed-off-by: Duy Do <[email protected]>
* Add fix Signed-off-by: Kruno Tomola Fabro <[email protected]> * Bump Signed-off-by: ktf <[email protected]> * Propagate panic Signed-off-by: Kruno Tomola Fabro <[email protected]> * Propagate panic Signed-off-by: Kruno Tomola Fabro <[email protected]>
…3492) Signed-off-by: Ana Hobden <[email protected]>
* Add process events Signed-off-by: ktf <[email protected]> * Use past tense Signed-off-by: Kruno Tomola Fabro <[email protected]> * One last metric capture Signed-off-by: Kruno Tomola Fabro <[email protected]> * One more move Signed-off-by: Kruno Tomola Fabro <[email protected]> * Remove unused use Signed-off-by: Kruno Tomola Fabro <[email protected]>
* Beep Signed-off-by: Kruno Tomola Fabro <[email protected]> * Comment Signed-off-by: Kruno Tomola Fabro <[email protected]>
* chore(add_tags transform): Add events to add_tags transform Signed-off-by: Ana Hobden <[email protected]> * Use refs Signed-off-by: Ana Hobden <[email protected]> * fmt Signed-off-by: Ana Hobden <[email protected]>
Bumps [pin-project](https://github.com/taiki-e/pin-project) from 0.4.22 to 0.4.23. - [Release notes](https://github.com/taiki-e/pin-project/releases) - [Changelog](https://github.com/taiki-e/pin-project/blob/master/CHANGELOG.md) - [Commits](taiki-e/pin-project@v0.4.22...v0.4.23) Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [indexmap](https://github.com/bluss/indexmap) from 1.3.1 to 1.5.1. - [Release notes](https://github.com/bluss/indexmap/releases) - [Commits](indexmap-rs/indexmap@1.3.1...1.5.1) Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
let server = new_pieces.api.take().unwrap(); | ||
|
||
info!( | ||
message = format!("{} API server", message).as_str(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be really nice if we could include the URL as a copy/pastable string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally it would be better to add the generator
events in a different PR, but given the simplicity of that source I don't see an issue.
This code is lacking documentation adequate to satisfy the review requirements listed: https://github.com/timberio/vector/blob/master/REVIEWING.md#documentation.
Otherwise, I think this is a very good first implementation step of this great epic!
@Hoverbear, thanks for the review! Subscriptions should work. I can't tell from your screenshots, but it may be that your client isn't attempting a WebSocket handshake and is instead defaulting to a regular POST/GET? A query with a |
So I should do like this? query foo {
subscription bar {
...
}
} |
The original subscription {
heartbeat(interval: 1000) {
utc
}
} The issue is likely that whichever GraphQL client you're using isn't switching to WebSockets when it detects a To give you an example of how we'd typically distinguish between regular queries and subscriptions in the UI, we'd split logic based on the query type, like this: return new ApolloClient({
cache,
link: ApolloLink.from([
// Split on HTTP and WebSockets
split( // <-- this is a helper from the Apollo Client to branch logic based on a boolean test
({ query }) => {
const definition = getMainDefinition(query);
return (
definition.kind === "OperationDefinition" &&
definition.operation === "subscription" // <-- determines the operation type is 'subscription'
);
},
// Use WebSockets for subscriptions
new WebSocketLink(ws),
// ... fall-back to HTTP for everything else
authHttp
),
]),
}); WebSockets are the de facto subscription protocol, but since it's opt-in and not always implemented, it often takes some configuration in the client. The GraphQL Playground electron app is one that should work out-the-box. Alternatively, you can go to When the schema stabilises, we'll wind up putting this into the vector-ui playground too, so we'll be able to test this stuff out without having to mess around with third-party clients and within our own UI. |
@leebenson Perfect, thanks. Great answer! |
Signed-off-by: MOZGIII <[email protected]>
Signed-off-by: Lee Benson <[email protected]>
Signed-off-by: Lee Benson <[email protected]>
Signed-off-by: Lee Benson <[email protected]>
Signed-off-by: Lee Benson <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking really interesting! For now, I'd really like to focus on the pros and cons of GraphQL, understanding its implementation and runtime characteristics, etc. It's hard to look at this and not think that we could do something simpler, so I'm curious how you weigh the complexity here.
|
||
// health handler, responds with { ok: true } | ||
pub async fn health() -> Result<impl Reply, Rejection> { | ||
Ok(json(&Health { ok: true })) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could probably be simplified a bit with the serde_json::json
macro. Something along the lines of
Ok(json!({"ok": true}))
// - balanced_or_tree!(@internal a, b ; c, d, e ; e) // now only one elem in counter | ||
// - balanced_or_tree!(a, b, c).or(balanced_or_tree(d, e)) // recurse on each sublist | ||
#[macro_export] | ||
macro_rules! balanced_or_tree { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I would really like to see numbers for our use case before we add such a complex optimization.
|
||
/// Returns a stream of `Metric`s, collected at the provided millisecond interval | ||
fn get_metrics(interval: i32) -> impl Stream<Item = Metric> { | ||
let controller = get_controller().unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main reason I'd caution against this is that it ties us pretty directly to the current implementation of metrics and doesn't give us a clear path to exposing non-metric data. The idea of making internal metrics a source is that they become less special (e.g. represented by the same data type as user metrics) and any improvements we make to vector's internal observability tools should help improve our users' ability to handle their own metrics.
Basically, internal metrics as a source was an intentional design decision so let's not sidestep it.
pub struct ConfigDiff { | ||
#[cfg(feature = "api")] | ||
pub api: Option<api::Difference>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would honestly drop all of this Difference
stuff for now. We can stick an on/off switch on GlobalOptions
and leave it at that for now. I don't care about people being able to turn this on and off dynamically.
#[derive(Deserialize, Serialize, Debug)] | ||
#[serde(deny_unknown_fields)] | ||
pub struct Config { | ||
#[serde(flatten)] | ||
pub global: GlobalOptions, | ||
#[cfg(feature = "api")] | ||
#[serde(default)] | ||
pub api: api::Options, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, I'd prefer we push this down into GlobalOptions
.
future::ready(high > 0) | ||
}, | ||
5, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer that we try to avoid changing unrelated code for the time being, just to help focus the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't taken the time to follow some of the API discussions in depth, so this is mostly just a surface scan.
A lot of the comments you have started conversations with in the PR would be good in-code explanations of why you did things that way for those who will read it after the PR is resolved. Consider adding them to the code.
let mut interval = tokio::time::interval(Duration::from_millis(interval as u64)); | ||
|
||
stream! { | ||
while let _ = interval.tick().await { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This appears to be an infinite loop (Interval::tick
cannot fail), so:
while let _ = interval.tick().await { | |
loop { | |
interval.tick().await; |
#[derive(Default, Debug, Deserialize, Serialize, PartialEq)] | ||
pub struct Options { | ||
#[serde(default = "default_enabled")] | ||
pub enabled: bool, | ||
|
||
#[serde(default = "default_bind")] | ||
pub bind: Option<SocketAddr>, | ||
|
||
#[serde(default = "default_playground")] | ||
pub playground: bool, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to use #[derivative(Default)]
to have the same defaults for serde and within Rust?
#[derive(Debug)] | ||
pub struct GeneratorEventProcessed; | ||
|
||
impl InternalEvent for GeneratorEventProcessed { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll be rebasing that after some other things are merged, so no worries at this point.
@@ -87,6 +88,8 @@ impl GeneratorConfig { | |||
.lines | |||
.iter() | |||
.map(|line| { | |||
emit!(GeneratorEventProcessed); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might not be the best place, but it should work fine.
Thanks for your review, @lukesteensen. Will address the points you raised more specifically on Monday, but I wanted to touch base on this real quick:
I'd summarise the pros as:
And the cons as:
I think it's important a strong emphasis is played on both the client, as much as the server, when considering how the data we're exposing is being consumed. You comment about opting for something simpler, I think, is weighted toward Vector; the UI has to solve connection handling, data fetching, retries, state management, caching, propogation across multiple components, single and real-time queries (often multiple in-flight at once), error handling and other concerns that we have to define an approach for, irrespective of protocol, with make many of these 'simpler approaches' anything but. GraphQL simplifies many of those areas by providing an ecosystem of tooling, and a well-defined spec, to avoid the need for DIY implementations. I've found other choices lack in one or more of these areas, namely:
This doesn't encompass real-time, which is where much of the complexity is added. My original PR featured a DIY WebSockets approach, but after speaking more about this with @sghall, it became clearer that we'd ultimately be re-writing much of what GraphQL was intended to solve and likely costing ourselves (what I stab-in-the-dark estimate to be) several weeks/months of dev time. The client libs we're swapping a custom approach for, by comparison, have a 2-5+ year commit history. |
@lukesteensen, just on this point:
There's a Slack thread with some extra comments on that point that might be worth weighing in on. If we're ultimately surfacing internal configs and event lines, which the UI designs seem to suggest will be the case, insecure-by-default and 'not able to flick the off switch' are positions we should consider carefully. |
@lukesteensen - Apologies, GH isn't letting me inline respond to all comments, hence the break-out:
To be clear, this means of consuming metrics data isn't intended to limit the consumption of non-metric data. It doesn't, for example, prevent the API from reading topology data, or exposing It's simply a way for a an individual client subscription (which may exist in its own green thread, detached from all other subscribing clients) to return a stream that's unique to that one connection, and more specifically, to serve the query that was executed against it -- factoring both the requested fields, and any params that control the stream (such as the interval by which to return data.) If the intention is to both detach the API from the topology module and also consume it as a source, I'm not clear how those two objectives coincide. Treating the API as a quasi-sink seems to require it to be a component of toplogy, unless I'm missing something. Reading from the controller directly, per-subscription, and being able to granularly define an Again, apologies if I'm misunderstanding something fundamental here, I'm both trying to learn how the code works and catch up on design decisions that may impact the future landscape, which isn't a balance I have right yet! |
@lukesteensen I think it would be better for the GraphQL debate (if there is going to be another one) to go back to the RFC process. As @leebenson pointed out, this has many implications for client/server communication, tooling and the way the UI would be built. This IMO is outside the scope of this PR. Everyone working on this project should be included in the process and have a chance to comment 👍 |
Closing this and moving to a separate PR to avoid a painful rebase. The API is being pulled out of topology anyway so v2 should be slimmer. |
Initial integration of the API server. Closes #1073. Closes #3225.
Motivation
This forms the foundation of making a Vector instance observable, and exposing metrics via
vector top
(#3211) and wider work on an upcoming Vector UI.Demo
https://www.loom.com/share/a2a83e4a6c0847b7b820af8b439c23f7
Features
/playground
GraphQL playground/health
endpoint, returning 200 response +{"ok": true}
/graphql
GraphQL endpoint, taking the following health query:{ health }
And responding with:
Caveats / blockers
Due to the use ofEdit: Switched to async-graphql for stability.weak_into_raw
in Juniper, the current Rust 1.44 toolchain is incompatible.TODOS
lib/api
. Note: later metrics subscriptions will rely on chore: Move the Event and related types into a library #3357JoinHandle
fromapi.stop()
, instead of the current fire-and-forget for graceful shutdown of the API (may be overkill) Update: I don't think it's worth it. Punting.tls
to a separate crate?)heartbeat
)Later
/graphql
+/playground
configurable