GCP Developer Exam Study Guide - Part V
Part 5 of 6
Last month I took the Google Cloud Platform Professional Developer Exam. To prepare, I put together a study guide. I'm posting it here in five parts. Hopefully, it will help someone else with the exam. You can see the full study guide at my GitHub.
Section 5: Managing Application Performance Monitoring
5.1 Installing the logging and monitoring agent:
-
The agent can be installed with the following:
curl -sSO https://dl.google.com/cloudagents/install-monitoring-agent.sh sudo bash install-monitoring-agent.sh
There must be a workspace/project for stackdriver setup. When using stackdriver for AWS resources, there must be a GCP connector project. The agent must be authorized (the service account must have correct rights). More
5.2 Managing VMs:
- Debugging a custom VM image using a serial port: Use the
gcloud compute connect-to-serial-port [INSTANCE_NAME]
command to connect. More details. - Analyzing a failed Compute Engine VM startup: First, check the serial port output (above). The BIOS, bootloader, and kernel will print debug messages there. You can enable interactive access to log in to an instance that's not fully booted. Next, verify that the file system on the disk is valid. You can detach the disk (or delete the vm and keep the disk with
gcloud compute instances delete old-instance --keep-disks boot
), create a new vm, attach the disk, ssh in, identify the root partition of the disk in question, and run a system check, mount the file system, and check that the disk has kernel files. If this works, verify the disk has a valid master boot record. Adapted from -
Sending logs from a vm to stackdriver: Stackdriver is the sink for all GCP infra logs automatically (start, stop, etc). To send logs from the app or os level, the stackdriver agent must be installed.
5.3 Viewing application performance metrics using Stackdriver.
- Creating a monitoring dashboard: Stackdriver SLI metrics can be used for monitoring the four 'golden signals' according to the SRE book: Latency, Traffic, Errors, Saturation. Determine what metrics will best show the telemetry for those signals. Once you know what metrics you want, go to the Stackdriver Monitoring UI, select
Dashboard
=>Create dashboard
. Click theAdd Chart
button to, well, add a chart. Build a chart for each of the metrics determined to be needed. (Based on) - Viewing syslogs from a VM: Add the agent to the vm (see above), add the custom fluentd config pointing to the log location. Reload the
google-fluentd
service, and the logs should be sent to stackdriver. Based on -
Writing custom metrics and creating metrics from logs: docs. You can create and import metrics with OpenCensus. When doing it by hand, the metric must be assigned a unique name, it must begin with
custom.googleapis.com/
. It can be a string or a path, to organize metrics logically. Provide aMetricDescriptor
with information about the metric: name, project, value/type/units, and what resources will be included in the metrics time series data points. More, and an exampleLog based metrics are metrics that are based on log entries. They can be based on number of a specific log type, or latency information.
- Graphing metrics: You can view metrics as charts. Charts can be created on any metric, including custom metrics. You can specify the data to appear, and the configuration of the chart. You can also use a 3rd party data viz/log platform like grafana if you want.
- Using Stackdriver Debugger: This tool allows you to debug production applications. You can insert snapshots, which capture state (local vars and call stack) of an application at a specific line in the code. The snapshot will be taken when that line of code is hit while running. You can also request specific info, like
self.request.environ['HTTP_USER_AGENT']
in a snapshot. You can inject a debug logpoint, which lets you inject logging into a running app without restarting it. It can be configured for all GCP compute environments with most runtimes. More - Streaming logs from GCP console: Viewing logs in the console. Go to the Logs Viewer page in the console. Logs are scoped to the project level in the console. Logs have be filtered with the basic or advanced interface. Use the play arrow icon to stream incoming logs as they are received.
- Reviewing stack traces for error analysis: Tracing allows users to see what path through various services a call takes, with information about latency, what functions were invoked, and other details to identify bottlenecks. More
- Setting up log sinks: You can export logs outside of stackdriver. This allows you to store logs for longer than stackdriver's retention period. You can perform big data analysis of logs. Logs can be exported to other logging tools. If you want to export your logs, you will need a sink to hold the data. Sinks have identifiers, parent resources (usually a project, but can be folder, billing account, or org), a filter to determine which logs are exported to the sink (for example, on errors, or only those related to a specific service), and a destination (bucket, bigquery dataset, or pub/sub topic to stream to another application).
- Viewing logs in the GCP console: Like streaming logs in the console, navigate to the stackdriver>logging>logs. Logs will be displayed there. They can be search or filtered.
- Profiling performance of request-response: profiler can show where in the request-response lifecycle the most resources are being used to determine where source code may need to be optimized. There was nothing in the docs that focused on the particular use case so ymmv
- Profiling services: You can use stackdriver profiler to gather information about cpu and memory allocation from apps, and maps the consumption back to source code to identify intensive operations and other information about the source code.
- Reviewing application performance using Stackdriver Trace and Stackdriver Logging: Use trace to see the span of https requests in a SOA app. You can see what calls are taking the most time, and where the bottlenecks are (similar to jaeger, the open source network tool). Logging provides a single pane of glass to view platform and application logs. Based on bottlenecks identified in trace, you can filter the logs to view those related to the specific service that is performing poorly to determine what changes would best address issues.
-
Monitoring and profiling a running application: after configuring profiler in an app, you can view the app in the profiler console. It will generate a flame graph for examining the data. Data can be viewed by service, and filtered on a number of catagories. The levels in the graph represent all processes, from the entire executable (100% of all resources used), down through the modules, into the specific functions. The exact breakout will vary by runtime/language. Using profiler, you can identify specific functions in an application that are consuming the most resources. These may be candidates for refactoring or other optimization.
5.4 Diagnosing and resolving application performance issues.
- Setting up time checks and other basic alerts: An uptime check is a GET request on a URL at a specified interval. The results of the check are written into stackdriver logs (and can be ported to another logging platform). You can set up alerting to take action- either sending an email or other notification channels.
- Setting up logging and tracing: Enable logging by installing and configuring the stackdriver agent on the relevant vm/service. See above for details. Trace is enabled by default on app engine standard. It can be configured in other compute resources using: c#, java, go, node.js, php, python, and ruby.
- Setting up resources monitoring: GCP monitors a staggaring number of resource types. Monitoring of GCP resources' default metrics is set up in stackdriver. Custom metrics can be created- see above. AWS resources can be monitored, but must be configured though a connector project.
- Troubleshooting network issues: Trace is the service best suited to identifying network issues from origin to completion, each part of the lifecycle can be viewed to identify latency and other issues.
- Debugging/tracing cloud apps: Use trace to follow calls through your app to identify what calls what, and where bottlenecks occur. If a bottleneck or other issue is identified, use debugger to create a snapshot of the 'state' of the app at that point, to understand what the issue is.
- Troubleshooting issues with the image/OS: If a root drive is not working as intended, detach it, and mount it as a secondary volume on another vm. From there, you can search for corrupted files or configuration issues that may be impacting the vm/image. If the stackdriver agent is installed, the logs may be useful as well.
- GCP docs are pretty good, relevant links to docs, tutorials, blogs, and other resources are peppered in throughout.