Access Service: Temporary Access to the Cloud
Jun 1, 2021
By Andy Li
Access is always changing. When you start at a new company, you usually are given access to a set of apps provisioned to you on day one, based on your team and role. Even on day one, there can be a difference between the access you are granted and the access you need to do your job. This results in two outcomes: underprovisioned or overprovisioned access.
For the IT and security teams who manage cloud infrastructure accounts, securing access to them can be difficult and scary; the systems are complex, and the stakes are high. If you grant too much access, you might allow bad actors access to your tools and infrastructure, which at best results in a breach notification; at worst, it results in a company-ending, game-over scenario. If you grant too little access, you put roadblocks between your colleagues and the work they need to do, meaning you are decreasing your company’s productivity.
Overprovisioned access
A common approach taken by startups and small companies is to grant access permissively. In these companies, early productivity can be critical to the success of the business. An employee locked out of a system because of missing access means lost productivity and lost income for the business.
If you give employees permanent admin access to every system, you optimize for velocity, but at the expense of increased risks from compromised employee accounts and insider threats. This results in an increased attack surface. As your company grows, it becomes more important to secure access to critical resources, and this requires a different approach.
Underprovisioned access
If you give employees too little access, it forces them to request access more often. Although new employees are initially given access based on their team and role, new duties and new projects can quickly increase the scope of the access they need. Depending on your company’s process for providing access, this can be cumbersome for the requester, for the approver, or oftentimes, for both.
Here at Segment, we have production environments across Amazon Web Services (AWS) and Google Cloud Platform (GCP). We need to secure access to these accounts thoughtfully so that our engineers can continue to build fast and safely. At many companies, you might rely on a centralized team to manage internal access. While this is a simple approach, it does not scale – team members have a limited amount of context surrounding requests, and might accidentally over-provision the requester’s access. At Segment, we approached the problem of managing least-privilege cloud access by building Access Service: a tool that enables time-based, peer-reviewed access.
Setting the stage: access at Segment
At Segment, we have hundreds of roles across dozens of SaaS apps and cloud providers representing different levels of access. In the past, we used to have to log in to each app or system individually to grant a user access. Our IT team managed to “federate” our cloud access and use Okta as our Identity Provider. This gave us a single place to manage which users have access to which roles and applications. The rest of this blog post builds on this federated access system.
If your organization hasn’t built something similar, the following resources that can help you build and set up your own federated cloud access system.
Blog posts:
Docs:
Mapping Okta apps to AWS roles
By configuring Okta applications to cloud provider roles, engineers are one click away from authenticating to a cloud provider with single sign-on (SSO) with appropriate permissions.
Each Okta app is mapped to a “Cloud Account Role” (or “Cloud Project Role” for GCP). For example, in AWS, we have a Staging account with a Read role that provides read access to specific resources. In Okta, we have a corresponding app named “Staging Read - AWS Role” that allows engineers to authenticate to the AWS Staging Account and assume the Read role.
This requires configuring an Okta app for each “Cloud Account Role” combination, which at the time of writing is 150+ Okta apps.
Configuring GCP with Okta is slightly different, and technical details for how to do this are at the bottom of this blog.
Mapping Okta groups to SaaS app groups
In addition to authentication, Identity Providers can also help with authorization. Users get understandably frustrated when they get access to an application, but don’t have the correct permissions to do their job.
Identity Providers have agreed upon a common set of REST APIs for managing provisioning, deprovisioning, and group mapping called SCIM (the System for Cross-domain Identity Management).
If an application supports SCIM, you can create groups within your Identity Provider (e.g. Okta), which will map user membership into the application. With this setup, adding users to the Okta group will automatically add them to the corresponding group in the application. Similarly, when a user is unassigned from the application in Okta, their membership in the application group will also be lost.
SCIM allows us to provide granular, application-level access, all while using our Identity Provider as the source of truth.
With a single place to manage access for all of our cloud providers, the problem should be solved, right? Not quite…
While the underlying Okta apps and groups system worked great, we quickly ran into more human problems.
Pitfalls of centralized access management
Even with our awesome new Okta+AWS system, we still needed a process for a centralized team to provision access through Okta. At many companies, this team would be IT. At Segment, this was a single person named Boggs. Requests would go into his inbox, and he would manually review the request reason, and decide if there was a more suitable level of access for the task. Finally, he would go to the Okta admin panel and provision the appropriate app to the user. Although this system worked for a time, it was not scalable and had major drawbacks.
Permanent access
Once an app was provisioned to a user, they would have access until they left Segment. Despite having permanent access, they might not need permanent access. Unfortunately, our manual provisioning process did not have a similar scalable way to ensure access was removed after it was no longer needed. People granted access for one-off tasks now had permanent access that hung around long after they actually needed it.
Difficulty scaling due to limited context
As an engineering manager, Boggs had a strong sense of available IAM roles and their access levels. This allowed him to reduce unnecessary access by identifying opportunities to use less sensitive roles. This context was difficult to replicate and was a big reason why we could not simply expand this responsibility to our larger IT team.
Most centralized IT teams don’t work closely with all of the apps that they provision, and this makes it difficult for them to evaluate requests. Enforcing the principle of least privilege can require intimate knowledge of access boundaries within a specific app. Without this knowledge, you’ll have a hard time deciding if a requester really needs “admin”, or if they could still do the work with “write” permissions, or even just with “read” access.
Kyle from the Data Engineering team is requesting access to the Radar Admin role to “debug”. What do they actually need access for? Would a Read only role work? And wait… who is Kyle?! Did they start last week? They say that they need this access to do their job and I still need to do mine… APPROVE.
It was slow
Despite being better equipped than most people to handle access requests, Boggs was a busy engineering manager. Although at first provisioning access was an infrequent task, as the company grew, it began to take up valuable chunks of time and became increasingly difficult to understand the context of each request.
We considered involving extra team members from our IT team, but this would still take time, as they would need to contact the owners of each system to confirm that access should be granted. Ultimately, having a limited pool of centralized approvers working through a shared queue of requests made response times less than ideal.
Breaking Boggs
Boggs tried automating parts of the problem away using complex scripted rules based on roles and teams, but there were still situations that broke the system. How would he handle reorgs where teams got renamed, switched, merged, or split? What happens when a user switches teams? What happens when a team had a legitimate business need for short-term access to a tool they didn’t already have? Using that current system, any access Boggs provisioned lasted forever - unless somebody went in and manually audited Okta apps for unused access.
Ultimately, we found ourselves in a situation where we had a lot of over-provisioned users with access to sensitive roles and permissions. To make sure we understood how bad the problem actually was, we measured the access utilization of our privileged roles. We looked at how many privileged roles each employee had access to, and compared them to how many privileged roles had actually been used in the last 30 days.
The results were astonishing: 60% of access was not being used.
Managing long-lived access simply did not scale. We needed to find a way to turn our centralized access management system into a distributed one.
Access Service
In the real world of access, we shouldn’t see a user's access footprint as static, but instead view it as amorphous and ever-changing.
When we adopted this perspective, it allowed us to build Access Service, an internal app that allows users to get the access they really need, and avoid the failure modes of provisioning too little or too much access.
Access Service allows engineers to request access to a single role for a set amount of time, and have their peers approve the request. The approvers come from a predefined list, which makes the access request process similar to GitHub pull requests with designated approvers.
As soon as the request is approved, Access Service provisions the user with the appropriate Okta app or group for the role. A daily cron job checks if a request has expired, and de-provisions the user if it has.
At a high level this is a simple web app, but let’s look closer at some specific features and what they unlock.
Temporary access
The magic of Access Service is the shift from long-lived access to temporary access. Usually, an engineer only needs access temporarily to accomplish a defined task.
Once that task is done, they have access they no longer need, which violates the principle of least privilege. Fixing this using the old process would mean manually deprovisioning Okta apps – adding yet another task to a workflow that was already painfully manual.
With Access Service, users specify a duration with their access request. Approvers can refuse to approve the request if they think the duration is unnecessarily long for the task. This duration is also used to automatically deprovision their access once the request expires.
Access Service offers two types of durations: “time-based access” and “activity-based access”.
Time-based access is a specific time period, such as one day, one week, two weeks, or four weeks. This is ideal for unusual tasks such as:
fixing a bug that requires a role you don’t usually need
performing data migrations
helping customers troubleshoot on production instances you don’t usually access
Activity-based access is a dynamic duration that extends the access expiration each time you use the app or role you were granted. This is ideal for access that you need for daily job functions – nobody wants to make a handful of new access requests every month. However, we don’t offer this type of access for our more sensitive roles. Broad-access roles, or roles that have access to sensitive data require periodic approvals to maintain access. Activity-based access provides a more practical balance between friction and access, aligning with our goals of enabling our engineers to build quickly and safely
Designated approvers
One of the biggest limitations with our previous process was that one person had to approve everything. In Access Service, each app has a vetted list of approvers who work closely with the system. By delegating decision making to experts, we ensure that access is approved by the people who know who should have it.
To start out with, you can’t approve your own access requests. (Sorry red team.) Each app has a “system owner” who is responsible for maintaining its list of approvers. When a user creates an access request, they select one or more approvers to review their request. Because the approvers list contains only people who work closely with the system, the approvers have better context and understanding of the system than a central IT team.
This makes it easier for approvers to reject unreasonable or too-permissive access requests, and encourages users to request a lower tier of access (for example, telling them to request a read-only role instead of a read/write role). Since incoming requests are “load balanced” between approvers, users also see a much faster response time to their requests.
Provisioning access always requires two people, much like a GitHub pull request. Users cannot select themselves as an approver, even if they are a system owner. Access Service also supports an “emergency access” mechanism with different approval requirements. This prevents Access Service from blocking an on-call or site reliability engineer if they need access in the middle of the night.
With system owners appointed for each app, our distributed pool of approvers continues to scale as we introduce new tools with new access roles and levels. This is what the security community calls “pushing left”.
When you “push left”, you introduce security considerations earlier in the development lifecycle, instead of trying to retrofit a system after it is in use. In the software engineering space, “pushing left” resulted in engineers learning more about security. This means that the people most familiar with the systems are the most knowledgeable people to implement security fixes. Since the engineers are the ones who designed and now maintain the software, they have much more context than the central security team. Similarly, Access Service unburdens the central IT team, and empowers system owners to make decisions about who should have access to their systems, and at what level. This significantly reduces the amount of time the IT team spends provisioning access, and frees them up to do more meaningful work.
How it works
Access Service, like many of our internal apps, is accessible to the open internet, but protected behind Okta.
The basic unit of Access Service is a “request”. A user who wants access creates a request that includes four pieces of information:
the application they want access to
the duration they want access for
a description for why they need access
the approver(s) they want to review the request
When they click “Request Access”, Access Services sends the selected approvers a Slack notification. Segment, like many modern companies, has a high degree of Slack presence. Using this platform makes Access Service a more natural, less disruptive part of people’s workflows. Even if the user requesting access is an approver for the particular app, they must receive approval from a different approver – every request must involve two people.
The access request is tracked in a web app, so you can see what requests you have open, and what roles you currently have access to.
The requester is notified via Slack when their request has been approved, so they know they can now get back to the task they needed access for in the first place.
The results
After we migrated our access process to Access Service, the result was zero long-lived access to any of our privileged cloud roles in AWS and GCP. All access granted to these roles expires if it is not actively used.
In the graph below, “Access Points” refers to the number of users with access to each admin role. After moving to Access Service, we reduced the number of people who had privileged access by 90%.
In the next graph below, “Active” refers to the number of people who used an app within the last 30 days. Because this number is higher than the number of Access Points, this shows that more access was used in the last 30 days than was currently provisioned.
That seems strange – how could admin apps have been used by more people than the total amount of people provisioned access? That’s because expired access had already been automatically deprovisioned, reducing the number of Access Points by the end of the 30 day window!
Conclusion
By acknowledging that access needs are constantly changing, we were able to create a more practical way to manage access control.
Access Service allows us to streamline the access approval process. By routing requests directly to designated approvers, we are able to get fast approvals from people with rich context. The time-based component of access requests allows the service to regularly remove unneeded access, preventing our access attack surface from growing too large. Finally, integrating Slack into the system makes approvals faster, ensures that you know immediately when your request has been approved, and reminds you when the request is expiring so you don’t run into unexpected access loss when just trying to do your job.
While it can be daunting to try to reinvent an existing, well-established process, the results can be incredibly rewarding. Start by writing down your goals, thinking about what you don’t like and what is painful about the current state, and reevaluate your core assumptions. Companies are always changing, and your processes have to keep up; the circumstances that led to the previous system may no longer be applicable today. Most importantly, remember to build with the user’s workflow in mind, because security depends on participation of the whole company.
Future development
Policies
Apps in Access Service are currently individually customizable. However, this can lead to issues with scalability if we want to make changes across multiple, similar apps. For example, if we decide that we want to limit access to several AWS accounts to no more than one week, we would currently have to edit the allowed durations for each individual role. With the introduction of policies, we would be able to map several roles to a single policy, allowing us to easily apply the change from the previous example.
Dynamic Roles
Currently, Access Service grants users access to predefined AWS roles. These roles are typically made to be general-purpose, but there may be use-cases not fully captured by an existing role. Instead of configuring a new role for one-off needs, or using an overly permissive role, Access Service could allow users to create a dynamic role. When making a request, users would check boxes corresponding to what permissions they wanted (e.g. “S3 Read”, “CloudWatch Full Access”, etc) to create a custom, dynamic role.
Special thanks to David Scrobonia for creating Access Service and setting up the foundation for this blog. Thank you to John Boggs, Rob McQueen, Anastassia Bobokalonova, Leif Dreizler, Eric Ellett, Pablo Vidal, Arta Razavi, and Laura Rubin, all of whom either built, designed, inspired, or contributed to Access Service along the way.
References
Configuring GCP roles in Okta
Connecting a GCP role to Okta is harder than with AWS, and after struggling to figure it out for a while, we thought it would be worth sharing. To connect a GCP role to our Okta instance, we had to use Google Groups in GSuite.
First, we created a single GSuite Group for each of our Project-Role pairs. In GCP, a Google Group is a member (principal) that can be assigned a role, and all users added to the group are also assigned that role.
We then assigned each GCP role to its corresponding Google Group. Next, we needed to connect the Google Groups to Okta.
You can do this by using Okta Push Groups, which link an Okta “group” to a Google Group. Adding a user to an Okta Push Group automatically adds the correct GSuite user to the Google group. We created an Okta Group for each of the roles and configured it as a Push Group to its corresponding Google Group.
To summarize, the flow looked like this:
Add Okta User david@segment.com to Okta Group “Staging Read - GCP Role”
Okta Push Groups adds the GSuite user david@segment.com to the “Staging Read” Google Group
Because he is a member of the “Staging Read” Google Group , david@segment.com is assigned the “Read” IAM role for the “Staging” project.
A BeyondCorp approach to internal apps
All of our internal apps use an OpenID Connect (OIDC) enabled Application Load Balancer (ALB) to connect to Okta. This provides a BeyondCorp approach to access for our internal apps: all are publicly-routable, but are behind Okta.
This is also nice from a tooling developer standpoint, because not only is authentication taken care of, but we can use the signed JSON web token (JWT) that Okta returns to the server through the ALB to get the identity of the user interacting with Access Service. This allows us to use Okta as a coarse authorization layer and manage which users have access to different internal apps.
The State of Personalization 2023
Our annual look at how attitudes, preferences, and experiences with personalization have evolved over the past year.
Get the reportThe State of Personalization 2023
Our annual look at how attitudes, preferences, and experiences with personalization have evolved over the past year.
Get the reportShare article
Recommended articles
How to accelerate time-to-value with a personalized customer onboarding campaign
To help businesses reach time-to-value faster, this blog explores how tools like Twilio Segment can be used to customize onboarding to activate users immediately, optimize engagement with real-time audiences, and utilize NPS for deeper customer insights.
Introducing Segment Community: A central hub to connect, learn, share and innovate
Dive into Segment's vibrant customer community, where you can connect with peers, gain exclusive insights, and elevate your success with expert guidance and resources!
Using ClickHouse to count unique users at scale
By implementing semantic sharding and optimizing filtering and grouping with ClickHouse, we transformed query times from minutes to seconds, ensuring efficient handling of high-volume journeys in production while paving the way for future enhancements.