Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(observability): API health checks + GraphQL #3514

Closed
wants to merge 118 commits into from
Closed

Conversation

leebenson
Copy link
Contributor

@leebenson leebenson commented Aug 21, 2020

Initial integration of the API server. Closes #1073. Closes #3225.

Motivation

This forms the foundation of making a Vector instance observable, and exposing metrics via vector top (#3211) and wider work on an upcoming Vector UI.

Demo

https://www.loom.com/share/a2a83e4a6c0847b7b820af8b439c23f7

Features

{
  health
}

And responding with:

{
  "data": {
    "health": true
  }
}

Caveats / blockers

Due to the use of weak_into_raw in Juniper, the current Rust 1.44 toolchain is incompatible. Edit: Switched to async-graphql for stability.

TODOS

  • Move to a cargo workspace crate in lib/api. Note: later metrics subscriptions will rely on chore: Move the Event and related types into a library #3357
  • Add tests. This should include an integration test for spawning the server and asserting health checks.
  • Consider whether it's worth returning a JoinHandle from api.stop(), instead of the current fire-and-forget for graceful shutdown of the API (may be overkill) Update: I don't think it's worth it. Punting.
  • Punt on TLS for now; tackle later (possibly moving tls to a separate crate?)
  • Add a health subscription (in the form of heartbeat)

Later

  • Enable TLS binding. Separately? Globally?
  • Improve error handling. Currently binding to an existing port will panic.
  • Make /graphql + /playground configurable

leebenson and others added 17 commits August 18, 2020 14:50
The actual default is 10, not 1000.

Signed-off-by: Jesse Szwedko <[email protected]>
Signed-off-by: Kirill Fomichev <[email protected]>
* Add fix

Signed-off-by: Kruno Tomola Fabro <[email protected]>

* Bump

Signed-off-by: ktf <[email protected]>

* Propagate panic

Signed-off-by: Kruno Tomola Fabro <[email protected]>

* Propagate panic

Signed-off-by: Kruno Tomola Fabro <[email protected]>
* Add process events

Signed-off-by: ktf <[email protected]>

* Use past tense

Signed-off-by: Kruno Tomola Fabro <[email protected]>

* One last metric capture

Signed-off-by: Kruno Tomola Fabro <[email protected]>

* One more move

Signed-off-by: Kruno Tomola Fabro <[email protected]>

* Remove unused use

Signed-off-by: Kruno Tomola Fabro <[email protected]>
* Beep

Signed-off-by: Kruno Tomola Fabro <[email protected]>

* Comment

Signed-off-by: Kruno Tomola Fabro <[email protected]>
* chore(add_tags transform): Add events to add_tags transform

Signed-off-by: Ana Hobden <[email protected]>

* Use refs

Signed-off-by: Ana Hobden <[email protected]>

* fmt

Signed-off-by: Ana Hobden <[email protected]>
Bumps [pin-project](https://github.com/taiki-e/pin-project) from 0.4.22 to 0.4.23.
- [Release notes](https://github.com/taiki-e/pin-project/releases)
- [Changelog](https://github.com/taiki-e/pin-project/blob/master/CHANGELOG.md)
- [Commits](taiki-e/pin-project@v0.4.22...v0.4.23)

Signed-off-by: dependabot[bot] <[email protected]>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [indexmap](https://github.com/bluss/indexmap) from 1.3.1 to 1.5.1.
- [Release notes](https://github.com/bluss/indexmap/releases)
- [Commits](indexmap-rs/indexmap@1.3.1...1.5.1)

Signed-off-by: dependabot[bot] <[email protected]>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
@leebenson leebenson added type: enhancement A value-adding code change that enhances its existing functionality. domain: observability Anything related to monitoring/observing Vector labels Aug 21, 2020
@leebenson leebenson self-assigned this Aug 21, 2020
@leebenson leebenson marked this pull request as ready for review August 28, 2020 07:20
@leebenson leebenson requested a review from bruceg as a code owner August 28, 2020 07:20
let server = new_pieces.api.take().unwrap();

info!(
message = format!("{} API server", message).as_str(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be really nice if we could include the URL as a copy/pastable string.

@Hoverbear
Copy link
Contributor

I presume subscriptions are for a future PR?

Otherwise, looking good. :)

image

image

Copy link
Contributor

@Hoverbear Hoverbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally it would be better to add the generator events in a different PR, but given the simplicity of that source I don't see an issue.

This code is lacking documentation adequate to satisfy the review requirements listed: https://github.com/timberio/vector/blob/master/REVIEWING.md#documentation.

Otherwise, I think this is a very good first implementation step of this great epic!

@leebenson
Copy link
Contributor Author

@Hoverbear, thanks for the review!

Subscriptions should work. I can't tell from your screenshots, but it may be that your client isn't attempting a WebSocket handshake and is instead defaulting to a regular POST/GET? A query with a subscription block should do a WS upgrade.

@Hoverbear
Copy link
Contributor

So I should do like this?

query foo {
  subscription bar {
    ...
  }
}

@leebenson
Copy link
Contributor Author

leebenson commented Aug 28, 2020

The original subscription block you have is correct:

subscription {
  heartbeat(interval: 1000) {
    utc 
  }
}

The issue is likely that whichever GraphQL client you're using isn't switching to WebSockets when it detects a subscription block.

To give you an example of how we'd typically distinguish between regular queries and subscriptions in the UI, we'd split logic based on the query type, like this:

return new ApolloClient({
  cache,
  link: ApolloLink.from([
    // Split on HTTP and WebSockets
    split(  // <-- this is a helper from the Apollo Client to branch logic based on a boolean test
      ({ query }) => {
        const definition = getMainDefinition(query);
        return (
          definition.kind === "OperationDefinition" &&
          definition.operation === "subscription" // <-- determines the operation type is 'subscription'
        );
      },
      // Use WebSockets for subscriptions
      new WebSocketLink(ws),
      // ... fall-back to HTTP for everything else
      authHttp
    ),
  ]),
});

WebSockets are the de facto subscription protocol, but since it's opt-in and not always implemented, it often takes some configuration in the client.

The GraphQL Playground electron app is one that should work out-the-box. Alternatively, you can go to https://localhost:8686/playground, which this PR includes. The electron app is a slightly more recent version which has 'tail' scrolling and is slightly easier to test with subscriptions, but otherwise it's identical.

When the schema stabilises, we'll wind up putting this into the vector-ui playground too, so we'll be able to test this stuff out without having to mess around with third-party clients and within our own UI.

@Hoverbear
Copy link
Contributor

@leebenson Perfect, thanks. Great answer!

MOZGIII and others added 5 commits August 28, 2020 19:27
Signed-off-by: Lee Benson <[email protected]>
Signed-off-by: Lee Benson <[email protected]>
Signed-off-by: Lee Benson <[email protected]>
Signed-off-by: Lee Benson <[email protected]>
Copy link
Member

@lukesteensen lukesteensen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking really interesting! For now, I'd really like to focus on the pros and cons of GraphQL, understanding its implementation and runtime characteristics, etc. It's hard to look at this and not think that we could do something simpler, so I'm curious how you weigh the complexity here.


// health handler, responds with { ok: true }
pub async fn health() -> Result<impl Reply, Rejection> {
Ok(json(&Health { ok: true }))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could probably be simplified a bit with the serde_json::json macro. Something along the lines of

Ok(json!({"ok": true}))

// - balanced_or_tree!(@internal a, b ; c, d, e ; e) // now only one elem in counter
// - balanced_or_tree!(a, b, c).or(balanced_or_tree(d, e)) // recurse on each sublist
#[macro_export]
macro_rules! balanced_or_tree {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I would really like to see numbers for our use case before we add such a complex optimization.


/// Returns a stream of `Metric`s, collected at the provided millisecond interval
fn get_metrics(interval: i32) -> impl Stream<Item = Metric> {
let controller = get_controller().unwrap();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main reason I'd caution against this is that it ties us pretty directly to the current implementation of metrics and doesn't give us a clear path to exposing non-metric data. The idea of making internal metrics a source is that they become less special (e.g. represented by the same data type as user metrics) and any improvements we make to vector's internal observability tools should help improve our users' ability to handle their own metrics.

Basically, internal metrics as a source was an intentional design decision so let's not sidestep it.

pub struct ConfigDiff {
#[cfg(feature = "api")]
pub api: Option<api::Difference>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would honestly drop all of this Difference stuff for now. We can stick an on/off switch on GlobalOptions and leave it at that for now. I don't care about people being able to turn this on and off dynamically.

#[derive(Deserialize, Serialize, Debug)]
#[serde(deny_unknown_fields)]
pub struct Config {
#[serde(flatten)]
pub global: GlobalOptions,
#[cfg(feature = "api")]
#[serde(default)]
pub api: api::Options,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I'd prefer we push this down into GlobalOptions.

future::ready(high > 0)
},
5,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer that we try to avoid changing unrelated code for the time being, just to help focus the PR.

Copy link
Member

@bruceg bruceg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't taken the time to follow some of the API discussions in depth, so this is mostly just a surface scan.

A lot of the comments you have started conversations with in the PR would be good in-code explanations of why you did things that way for those who will read it after the PR is resolved. Consider adding them to the code.

let mut interval = tokio::time::interval(Duration::from_millis(interval as u64));

stream! {
while let _ = interval.tick().await {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears to be an infinite loop (Interval::tick cannot fail), so:

Suggested change
while let _ = interval.tick().await {
loop {
interval.tick().await;

Comment on lines +5 to +15
#[derive(Default, Debug, Deserialize, Serialize, PartialEq)]
pub struct Options {
#[serde(default = "default_enabled")]
pub enabled: bool,

#[serde(default = "default_bind")]
pub bind: Option<SocketAddr>,

#[serde(default = "default_playground")]
pub playground: bool,
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to use #[derivative(Default)] to have the same defaults for serde and within Rust?

#[derive(Debug)]
pub struct GeneratorEventProcessed;

impl InternalEvent for GeneratorEventProcessed {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll be rebasing that after some other things are merged, so no worries at this point.

@@ -87,6 +88,8 @@ impl GeneratorConfig {
.lines
.iter()
.map(|line| {
emit!(GeneratorEventProcessed);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might not be the best place, but it should work fine.

@leebenson
Copy link
Contributor Author

leebenson commented Aug 28, 2020

Thanks for your review, @lukesteensen. Will address the points you raised more specifically on Monday, but I wanted to touch base on this real quick:

For now, I'd really like to focus on the pros and cons of GraphQL, understanding its implementation and runtime characteristics, etc. It's hard to look at this and not think that we could do something simpler, so I'm curious how you weigh the complexity here.

I'd summarise the pros as:

  • A strongly typed API, front and back. Clients know what's available, and the shape of data at compile time.
  • Trivial to consume in the UI. This is perhaps less obvious in the back-end, but the UI benefits from this enormously. There's a glut of complexity that we side-step by opting for a well-defined spec and relatively well-understood tooling that will lighten the burden of what will undoubtedly be a very complex front-end app. I'd defer to @sghall's thoughts here too.
  • A year's worth of experience in Alloy working with GraphQL in the UI + back-end (albeit Go) in Alloy (my personal experience with GraphQL goes back to 2015.), and a compatible toolchain in vector-ui.
  • Payloads that lend themselves well to highly structured data.
  • Regular POST semantics for one-time queries. No magic over HTTP.
  • Subscriptions over WebSockets; a well-tested web protocol.
  • JSON payloads.
  • Relatively lightweight to overlay existing types and libs.
  • Leverages existing libs; Warp in our case, for example.
  • Strong tooling: A heavily used React client, GraphQL Code Generator to generate Typescript types + declarative queries/subscriptions, Chrome extensions, a built-in GraphQL Playground, etc.
  • Heavy use in production across a variety of APIs; not niche.

And the cons as:

  • Generally geared toward graph data, although this isn't a hard requirement.
  • Queries with unknown levels of nesting can't be queried statically; generally, a need for queries with flatter structures arise.
  • Versioning is trickier, given the lack of distinct endpoints. Nothing stopping multiple endpoints representing versions, although typically this is done in the schema.
  • Some annoying rough edges -- the lack of input unions, for example, typically worked around with more specific queries.
  • Reliance on a relatively young library rather than a 'pure' client like a stdlib tcp server.
  • A need to invest, longer-term, in understanding and contributing to that external library.

I think it's important a strong emphasis is played on both the client, as much as the server, when considering how the data we're exposing is being consumed.

You comment about opting for something simpler, I think, is weighted toward Vector; the UI has to solve connection handling, data fetching, retries, state management, caching, propogation across multiple components, single and real-time queries (often multiple in-flight at once), error handling and other concerns that we have to define an approach for, irrespective of protocol, with make many of these 'simpler approaches' anything but. GraphQL simplifies many of those areas by providing an ecosystem of tooling, and a well-defined spec, to avoid the need for DIY implementations.

I've found other choices lack in one or more of these areas, namely:

  • REST. Not type safe by default, moving the burden to runtime. Heavy type duplication during dev. Swagger or JSON schemas can be utilised, but the tooling tends to be less geared toward our UI libs and often requires augmentation at the server level to generate the required spec (and if we're augmenting it, why not just go with GraphQL?)

  • gRPC. Issues mentioned in the PR. Namely, gRPC proper and its lack of web compatibility; issues with gRPC-Web, both deployment and tooling, that add complexity/limitations.

  • Protobufs. Handles the payload, but comms and the other listed concerns are unanswered.

This doesn't encompass real-time, which is where much of the complexity is added. My original PR featured a DIY WebSockets approach, but after speaking more about this with @sghall, it became clearer that we'd ultimately be re-writing much of what GraphQL was intended to solve and likely costing ourselves (what I stab-in-the-dark estimate to be) several weeks/months of dev time. The client libs we're swapping a custom approach for, by comparison, have a 2-5+ year commit history.

@leebenson
Copy link
Contributor Author

@lukesteensen, just on this point:

I don't care about people being able to turn this on and off dynamically.

There's a Slack thread with some extra comments on that point that might be worth weighing in on.

If we're ultimately surfacing internal configs and event lines, which the UI designs seem to suggest will be the case, insecure-by-default and 'not able to flick the off switch' are positions we should consider carefully.

@leebenson
Copy link
Contributor Author

leebenson commented Aug 28, 2020

@lukesteensen - Apologies, GH isn't letting me inline respond to all comments, hence the break-out:

The main reason I'd caution against this is that it ties us pretty directly to the current implementation of metrics and doesn't give us a clear path to exposing non-metric data. The idea of making internal metrics a source is that they become less special (e.g. represented by the same data type as user metrics) and any improvements we make to vector's internal observability tools should help improve our users' ability to handle their own metrics.

Basically, internal metrics as a source was an intentional design decision so let's not sidestep it.

To be clear, this means of consuming metrics data isn't intended to limit the consumption of non-metric data. It doesn't, for example, prevent the API from reading topology data, or exposing vector.toml validation tooling, or later introducing the 'tap' command, which I imagined would be unrelated to the internal_metrics source.

It's simply a way for a an individual client subscription (which may exist in its own green thread, detached from all other subscribing clients) to return a stream that's unique to that one connection, and more specifically, to serve the query that was executed against it -- factoring both the requested fields, and any params that control the stream (such as the interval by which to return data.)

If the intention is to both detach the API from the topology module and also consume it as a source, I'm not clear how those two objectives coincide. Treating the API as a quasi-sink seems to require it to be a component of toplogy, unless I'm missing something. Reading from the controller directly, per-subscription, and being able to granularly define an interval outside of the context of a sink, seemed to fit the API paradigm better.

Again, apologies if I'm misunderstanding something fundamental here, I'm both trying to learn how the code works and catch up on design decisions that may impact the future landscape, which isn't a balance I have right yet!

@sghall
Copy link
Member

sghall commented Aug 30, 2020

@lukesteensen I think it would be better for the GraphQL debate (if there is going to be another one) to go back to the RFC process. As @leebenson pointed out, this has many implications for client/server communication, tooling and the way the UI would be built. This IMO is outside the scope of this PR. Everyone working on this project should be included in the process and have a chance to comment 👍

@leebenson
Copy link
Contributor Author

Closing this and moving to a separate PR to avoid a painful rebase. The API is being pulled out of topology anyway so v2 should be slimmer.

@leebenson leebenson closed this Sep 4, 2020
@leebenson leebenson deleted the leebenson/api branch October 1, 2020 08:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain: observability Anything related to monitoring/observing Vector type: enhancement A value-adding code change that enhances its existing functionality.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Real-time communication R&D and POC Add HTTP ready check