Javac and Java Katas, Part 1: Class Path
Participate in DZone Research Surveys: You Can Shape Trend Reports! (+ Enter the Raffles)
Modern API Management
When assessing prominent topics across DZone — and the software engineering space more broadly — it simply felt incomplete to conduct research on the larger impacts of data and the cloud without talking about such a crucial component of modern software architectures: APIs. Communication is key in an era when applications and data capabilities are growing increasingly complex. Therefore, we set our sights on investigating the emerging ways in which data that would otherwise be isolated can better integrate with and work alongside other app components and across systems.For DZone's 2024 Modern API Management Trend Report, we focused our research specifically on APIs' growing influence across domains, prevalent paradigms and implementation techniques, security strategies, AI, and automation. Alongside observations from our original research, practicing tech professionals from the DZone Community contributed articles addressing key topics in the API space, including automated API generation via no and low code; communication architecture design among systems, APIs, and microservices; GraphQL vs. REST; and the role of APIs in the modern cloud-native landscape.
Open Source Migration Practices and Patterns
MongoDB Essentials
In any microservice, managing database interactions with precision is crucial for maintaining application performance and reliability. Usually, we will unravel weird issues with database connection during performance testing. Recently, a critical issue surfaced within the repository layer of a Spring microservice application, where improper exception handling led to unexpected failures and service disruptions during performance testing. This article delves into the specifics of the issue and also highlights the pivotal role of the @Transactional annotation, which remedied the issue. Spring microservice applications rely heavily on stable and efficient database interactions, often managed through the Java Persistence API (JPA). Properly managing database connections, particularly preventing connection leaks, is critical to ensuring these interactions do not negatively impact application performance. Issue Background During a recent round of performance testing, a critical issue emerged within one of our essential microservices, which was designated for sending client communications. This service began to experience repeated Gateway time-out errors. The underlying problem was rooted in our database operations at the repository layer. An investigation into these time-out errors revealed that a stored procedure was consistently failing. The failure was triggered by an invalid parameter passed to the procedure, which raised a business exception from the stored procedure. The repository layer did not handle this exception efficiently; it bubbled up. Below is the source code for the stored procedure call: Java public long createInboxMessage(String notifCode, String acctId, String userId, String s3KeyName, List<Notif> notifList, String attributes, String notifTitle, String notifSubject, String notifPreviewText, String contentType, boolean doNotDelete, boolean isLetter, String groupId) throws EDeliveryException { try { StoredProcedureQuery query = entityManager.createStoredProcedureQuery("p_create_notification"); DbUtility.setParameter(query, "v_notif_code", notifCode); DbUtility.setParameter(query, "v_user_uuid", userId); DbUtility.setNullParameter(query, "v_user_id", Integer.class); DbUtility.setParameter(query, "v_acct_id", acctId); DbUtility.setParameter(query, "v_message_url", s3KeyName); DbUtility.setParameter(query, "v_ecomm_attributes", attributes); DbUtility.setParameter(query, "v_notif_title", notifTitle); DbUtility.setParameter(query, "v_notif_subject", notifSubject); DbUtility.setParameter(query, "v_notif_preview_text", notifPreviewText); DbUtility.setParameter(query, "v_content_type", contentType); DbUtility.setParameter(query, "v_do_not_delete", doNotDelete); DbUtility.setParameter(query, "v_hard_copy_comm", isLetter); DbUtility.setParameter(query, "v_group_id", groupId); DbUtility.setOutParameter(query, "v_notif_id", BigInteger.class); query.execute(); BigInteger notifId = (BigInteger) query.getOutputParameterValue("v_notif_id"); return notifId.longValue(); } catch (PersistenceException ex) { logger.error("DbRepository::createInboxMessage - Error creating notification", ex); throw new EDeliveryException(ex.getMessage(), ex); } } Issue Analysis As illustrated in our scenario, when a stored procedure encountered an error, the resulting exception would propagate upward from the repository layer to the service layer and finally to the controller. This propagation was problematic, causing our API to respond with non-200 HTTP status codes—typically 500 or 400. Following several such incidents, the service container reached a point where it could no longer handle incoming requests, ultimately resulting in a 502 Gateway Timeout error. This critical state was reflected in our monitoring systems, with Kibana logs indicating the issue: `HikariPool-1 - Connection is not available, request timed out after 30000ms.` The issue was improper exception handling, as exceptions bubbled up through the system layers without being properly managed. This prevented the release of database connections back into the connection pool, leading to the depletion of available connections. Consequently, after exhausting all connections, the container was unable to process new requests, resulting in the error reported in the Kibana logs and a non-200 HTTP error. Resolution To resolve this issue, we could handle the exception gracefully and not bubble up further, letting JPA and Spring context release the connection to the pool. Another alternative is to use @Transactional annotation for the method. Below is the same method with annotation: Java @Transactional public long createInboxMessage(String notifCode, String acctId, String userId, String s3KeyName, List<Notif> notifList, String attributes, String notifTitle, String notifSubject, String notifPreviewText, String contentType, boolean doNotDelete, boolean isLetter, String groupId) throws EDeliveryException { ……… } The implementation of the method below demonstrates an approach to exception handling that prevents exceptions from propagating further up the stack by catching and logging them within the method itself: Java public long createInboxMessage(String notifCode, String acctId, String userId, String s3KeyName, List<Notif> notifList, String attributes, String notifTitle, String notifSubject, String notifPreviewText, String contentType, boolean doNotDelete, boolean isLetter, String loanGroupId) { try { ....... query.execute(); BigInteger notifId = (BigInteger) query.getOutputParameterValue("v_notif_id"); return notifId.longValue(); } catch (PersistenceException ex) { logger.error("DbRepository::createInboxMessage - Error creating notification", ex); } return -1; } With @Transactional The @Transactional annotation in Spring frameworks manages transaction boundaries. It begins a transaction when the annotated method starts and commits or rolls it back when the method completes. When an exception occurs, @Transactional ensures that the transaction is rolled back, which helps appropriately release database connections back to the connection pool. Without @Transactional If a repository method that calls a stored procedure is not annotated with @Transactional, Spring does not manage the transaction boundaries for that method. The transaction handling must be manually implemented if the stored procedure throws an exception. If not properly managed, this can result in the database connection not being closed and not being returned to the pool, leading to a connection leak. Best Practices Always use @Transactional when the method's operations should be executed within a transaction scope. This is especially important for operations involving stored procedures that can modify the database state. Ensure exception handling within the method includes proper transaction rollback and closing of any database connections, mainly when not using @Transactional. Conclusion Effective transaction management is pivotal in maintaining the health and performance of Spring Microservice applications using JPA. By employing the @Transactional annotation, we can safeguard against connection leaks and ensure that database interactions do not degrade application performance or stability. Adhering to these guidelines can enhance the reliability and efficiency of our Spring Microservices, providing stable and responsive services to the consuming applications or end users.
This article is part of a range of articles called “Mastering Object-Oriented Design Patterns.” The collection consists of four articles and aims to provide profound guidance on object-oriented design patterns. The articles address the introduction of the design patterns issues, their sources, and the advantages of their use. In addition, the tutorial series provides full explanations of the common design patterns. Every article starts with real-life analogies, discusses the pros and cons of each pattern, and provides a Java example implementation. Once you find the title, “Mastering Object-Oriented Design Patterns,” you can explore the whole series and master object-oriented design patterns. Once upon a time, there was a new notion called “design patterns” in software engineering. This concept has revolutionized how developers approach complex software design. Design patterns are verified solutions to frequently encountered problems. However, where did this idea originate, and how did it significantly contribute to object-oriented programming? Origin of Design Pattern Design patterns first appeared in architecture, not in software. An architect and design theorist, Christopher Alexander, introduced the idea in his influential work, “A Pattern Language: Towns, Buildings, Construction.” Alexander sought to develop a pattern language to solve some city spatial and communal problems. These patterns included several details, such as window heights and the organization of green zones within the neighborhoods. This way, it sets the ground for a design approach focusing on reusable solutions to the same problems. Captivated by the concept of Alexander, a group of four software engineers (Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides), also known as the Gang of Four (GoF), recognized the potential of using this concept in software development. In 1994, they published “Design Patterns: Book “Elements of Reusable Object-Oriented” Software that translated the pattern language of architecture into the world of object-oriented programming (OOP). This seminal publication presented twenty-three design patterns targeted at addressing typical design issues. It soon became a best-seller and a vital tool in software engineering instruction. Introduction to Design Patterns What Are Design Patterns? Design patterns are not recipes but recommendations and tips for solving typical design problems. They are a pool of bright ideas and experiences of the software development community. These patterns assist the developers in building flexible, low-maintenance, and reusable code. Design patterns guide common language and methodology for solving design problems, simplifying collaboration among developers, and speeding up the development process. Picture-making software is like assembling a puzzle, except that you can continuously be given the same piece. Design patterns are your map indicating how you can fit those pieces every time. Design patterns are helpful techniques for resolving common coding issues. They can be understood as a set of coding challenge cookbooks. Rather than giving you ready-made code snippets, they present ways to solve particular problems in your projects. The purpose of design patterns is to reduce coding complexities, help you solve problems faster, and keep your code as flexible as possible for the future. Design Patterns vs. Algorithms Nevertheless, both provide solutions, but an algorithm is a sequence of steps to reach a goal, just like a cooking recipe. On the other hand, a design pattern is more of a template. It provides the layout and major components of the solution but does not specify building details; consequently, it is flexible in how this solution is being implemented in your project. Both algorithms and design patterns provide solutions. An algorithm is like a process, a recipe in the kitchen that makes you reach a target. Alternatively, a design pattern is like a blueprint. It gives the framework and the factor elements of the solution but lets you select the structure details, which makes it flexible for your project demands. Inside a Design Pattern A design pattern typically includes: A design pattern typically includes: Intent: What the pattern does and what it solves. Motivation: The reason and the way it can help. Structure of classes: A schematic indicating how its parts communicate. Code example: Commonly made available in popular programming languages to facilitate comprehension. Some will also address when to use the pattern, how to apply it, and its interaction with other patterns, leaving you with a complete toolset for more innovative coding. Why Use Design Patterns? Design patterns in coding are a kind of secret toolset. They make solving common problems easier, and here’s why embracing design patterns can be a game-changer: They make solving common issues more accessible and that’s why embracing design patterns can be a game-changer: Proven and ready-to-use solutions: Imagine owning a treasure chest of brilliant hacks already worked out by professional coders. That’s what design patterns are—several clever, immediately applicable, professional-quality solutions that allow you to solve problems quickly and correctly. Simplifying complexity: Any great software is minimalistic in a sense. Design patterns assist you in splitting large and daunting problems into small and manageable chunks, thus making your code neater and your life simpler. Big picture focus: Design patterns allow you to spend less time on code structure and more time on doing cool stuff. This will enable you to concentrate more on producing great features rather than struggling with the fundamentals. Common language: Design patterns provide the developers with a common language, so when you say, “Let’s use a Singleton here,” everyone gets it. This leads to more efficient work and less confusion. Reusability and maintainability: The design patterns encourage code reuse via inheritance and interfaces, which allows classes to be adaptable and systems easy to maintain. This method shortens development cycles and keeps systems fortified over time. Improved scalability and flexibility: The MVC pattern allows for a more defined separation of the different parts of your code, making your system more flexible and able to grow with little adjustments. Boosted readability and understandability: Properly implemented design patterns increase the readability and understandability of your code, making it easier for other people to understand and contribute without too much explanation. In a nutshell, design patterns are all about making coding more comfortable, efficient, and even entertaining. They enable you to work on extension rather than invention, which allows you to improve the software without reinventing the wheel. Navigating the Tricky Side of Design Patterns Design patterns are secret ingredients that make writing code more accessible and practical. But they are not ideal. Here are a couple of things to be aware of: Not suitable for every programming language: However, using a design pattern may sometimes not be appropriate for a specific language in a programming language. For instance, a complex pattern may be redundant if the language has a simple feature that can do the job. It is just like employing an intelligent instrument while a simple one is sufficient. Being too rigid with patterns: Although design patterns are derived from best practices, their strict adherence may cause undesirable behavior. It’s similar to sticking to a recipe so rigidly that you do not make it according to your taste. At times, you need to modify to suit the particular requirements of your project. Overusing patterns: It is pretty simple to lose control and believe that every problem can be addressed through a design pattern. Yet, not all problems need a pattern. It is akin to using a hammer for all tasks when, at times, a screwdriver is sufficient. Adding unnecessary complexity: Design patterns can also introduce complexity to your code. If not handled with care, they can complicate your project. How To Avoid the Pitfalls However, despite the troubles, design patterns are still quite helpful. The key is to use them wisely: Choose the appropriate tool for the task: Not all problems need a design pattern. Sometimes, simpler is better. Adapt and customize: Never be afraid to adjust a pattern to make it suit you better. Please keep it simple: Do not make your code more complicated by using patterns that are not required. In summary, design patterns are similar to spices in cooking: applied correctly, they can improve your dish (or project). Yet, it’s necessary to employ them in moderation and not let them overcome the food. Types of Design Patterns Design patterns are beneficial methods applied in software design. They facilitate code organization and management during the development and preservation of applications. Regard them as clever construction techniques and improvements to your software projects. Let’s quickly check out the three main types: Creational Patterns: Building Blocks Creational patterns are equivalent to picking up the suitable LEGO blocks to begin your model building. Their attention is directed to simplifying the process of creating objects or groups of objects. This way, you can build up the software flexibly and efficiently, as if picking out the LEGO pieces that fit your design. Structural Patterns: Putting It All Together Structural patterns are all about how you build your LEGO bricks. They help you arrange the pieces (or objects) into more significant structures, with everything neat and well-arranged. It is akin to following a LEGO manual to guarantee your spaceship or castle will be sturdy and neat. Behavioral Patterns: Making It Work LEGO behavioral patterns are just about making your LEGO creation do extraordinary things. For instance, think about making the wings of your LEGO spaceship move. In software, these patterns enable various program components to interact and cooperate, ensuring everything functions as intended. Design patterns could be as simple as idioms that only run in a programming language or as complicated as architectural patterns that shape the entire application. They are your tool in the tool kit, available during a small function and throughout the software’s structure. Comprehending these patterns is like learning the tricks of constructing the most incredible LEGO sets. They make you a software genius; all your coding will seem relaxed and fun! Conclusion Our first module is finally over. It has been a fantastic trip into the principles behind design patterns and how the patterns are leveraged in software engineering. We found it fascinating to understand the concept of design patterns and their role in software engineering. Design patterns are not merely coding shortcuts but crystallized wisdom that provides reusable solutions for typical design issues. They simplify the object-oriented programming process and make it work faster, thus creating cleaner codes. On the other hand, they are not simple. We have pointed out that it is essential to know when and how to use them appropriately. In closing this chapter, we invite you to browse the other parts of the “Mastering Object-Oriented Design Patterns” series. Each part reinforces your comprehension and skill, making you more confident when applying design patterns to your projects. If you want to develop your architectural skills, speed up your development process, or improve the quality of your code, this series is here to help you. References Design Patterns Head First Design
Flyway is a popular open-source tool for managing database migrations. It makes it easy to manage and version control the database schema for your application. Flyway supports almost all popular databases including Oracle, SQL Server, DB2, MySQL, Amazon RDS, Aurora MySQL, MariaDB, PostgreSQL, and more. For the full list of supported databases, you can check the official documentation here. How Flyway Migrations Works Any changes to the database are called migrations. Flyway supports two types of migrations; versioned or repeatable migrations. Versioned migrations are the most common type of migration, they are applied once in the order they appear. Versioned migrations are used for creating, altering, and dropping tables, indexes or foreign keys. Versioned migration files use naming conventions using [Prefix][Separator][Migration Description][Suffix] for example, V1__add_user_table.sql and V2__alter_user_table.sql Repeatable migrations, on the other hand, are (re-)applied every time they change. Repeatable migrations are useful for managing views, stored procedures, or bulk reference data updates where the latest version should replace the previous one without considering versioning. Repeatable migrations are always applied last after all pending versioned migrations are been executed. Repeatable migration files use naming conventions such as R__add_new_table.sql The migration schemas can be written in either SQL or Java. When we start the application to an empty database, Flyway will first create a schema history table (flyway_schema_history) table. This table IS used to track the state of the database. After the flyway_schema_history table is created, it will scan the classpath for the migration files. The migrations are then sorted based on their version number and applied in order. As each migration gets applied, the schema history table is updated accordingly. Integrating Flyway in Spring Boot In this tutorial, we will create a Spring Boot application to deal with MySQL8 database migration using Flyway. This example uses Java 17, Spring Boot 3.2.4, and MySQL 8.0.26. For the database operation, we will use Spring boot JPA. Install Flyway Dependencies First, add the following dependencies to your pom.xml or your build.gradle file. The spring-boot-starter-data-jpa dependency is used for using Spring Data Java Persistence API (JPA) with Hibernate. The mysql-connector-j is the official JDBC driver for MySQL databases. It allows your Java application to connect to a MySQL database for operations such as creating, reading, updating, and deleting records. The flyway-core dependency is essential for integrating Flyway into your project, enabling migrations and version control for your database schema. The flyway-mysql dependency adds the Flyway support for MySQL databases. It provides MySQL-specific functionality and optimizations for Flyway operations. It's necessary when your application uses Flyway for managing database migrations on a MySQL database. pom.xml XML <dependencies> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-data-jpa</artifactId> </dependency> <dependency> <groupId>com.mysql</groupId> <artifactId>mysql-connector-j</artifactId> <scope>runtime</scope> </dependency> <dependency> <groupId>org.flywaydb</groupId> <artifactId>flyway-core</artifactId> </dependency> <dependency> <groupId>org.flywaydb</groupId> <artifactId>flyway-mysql</artifactId> </dependency> <!-- Other dependencies--> </dependencies> Configure the Database Connection Now let us provide the database connection properties in your application.properties file. # DB properties spring.datasource.url=jdbc:mysql://localhost:3306/flyway_demo spring.datasource.username=root spring.datasource.password=Passw0rd spring.datasource.driver-class-name=com.mysql.cj.jdbc.Driver #JPA spring.jpa.show-sql=true Create Database Changelog Files Let us now create a couple of database migration schema files inside the resources/db/migrations directory. V1__add_movies_table SQL CREATE TABLE movie ( id bigint NOT NULL AUTO_INCREMENT, title varchar(255) DEFAULT NULL, headline varchar(255) DEFAULT NULL, language varchar(255) DEFAULT NULL, region varchar(255) DEFAULT NULL, thumbnail varchar(255) DEFAULT NULL, rating enum('G','PG','PG13','R','NC17') DEFAULT NULL, PRIMARY KEY (id) ) ENGINE=InnoDB; V2__add_actor_table.sql SQL CREATE TABLE actor ( id bigint NOT NULL AUTO_INCREMENT, first_name varchar(255) DEFAULT NULL, last_name varchar(255) DEFAULT NULL, PRIMARY KEY (id) ) ENGINE=InnoDB; V3__add_movie_actor_relations.sql SQL CREATE TABLE movie_actors ( actors_id bigint NOT NULL, movie_id bigint NOT NULL, PRIMARY KEY (actors_id, movie_id), KEY fk_ref_movie (movie_id), CONSTRAINT fk_ref_movie FOREIGN KEY (movie_id) REFERENCES movie (id), CONSTRAINT fl_ref_actor FOREIGN KEY (actors_id) REFERENCES actor (id) ) ENGINE=InnoDB; R__create_or_replace_movie_view.sql SQL CREATE OR REPLACE VIEW movie_view AS SELECT id, title FROM movie; V4__insert_test_data.sql SQL INSERT INTO movie (title, headline, language, region, thumbnail, rating) VALUES ('Inception', 'A thief who steals corporate secrets through the use of dream-sharing technology.', 'English', 'USA', 'inception.jpg', 'PG13'), ('The Godfather', 'The aging patriarch of an organized crime dynasty transfers control of his clandestine empire to his reluctant son.', 'English', 'USA', 'godfather.jpg', 'R'), ('Parasite', 'A poor family, the Kims, con their way into becoming the servants of a rich family, the Parks. But their easy life gets complicated when their deception is threatened with exposure.', 'Korean', 'South Korea', 'parasite.jpg', 'R'), ('Amélie', 'Amélie is an innocent and naive girl in Paris with her own sense of justice. She decides to help those around her and, along the way, discovers love.', 'French', 'France', 'amelie.jpg', 'R'); -- Inserting data into the 'actor' table INSERT INTO actor (first_name, last_name) VALUES ('Leonardo', 'DiCaprio'), ('Al', 'Pacino'), ('Song', 'Kang-ho'), ('Audrey', 'Tautou'); -- Leonardo DiCaprio in Inception INSERT INTO movie_actors (actors_id, movie_id) VALUES (1, 1); -- Al Pacino in The Godfather INSERT INTO movie_actors (actors_id, movie_id) VALUES (2, 2); -- Song Kang-ho in Parasite INSERT INTO movie_actors (actors_id, movie_id) VALUES (3, 3); -- Audrey Tautou in Amélie INSERT INTO movie_actors (actors_id, movie_id) VALUES (4, 4); These tables are mapped to the following entity classes. Movie.java Java @Entity @Data public class Movie { @Id @GeneratedValue(strategy = GenerationType.IDENTITY) private Long id; private String title; private String headline; private String thumbnail; private String language; private String region; @Enumerated(EnumType.STRING) private ContentRating rating; @ManyToMany Set<Actor> actors; } public enum ContentRating { G, PG, PG13, R, NC17 } Actor.java Java @Entity @Data public class Actor { @Id @GeneratedValue(strategy = GenerationType.IDENTITY) Long id; String firstName; String lastName; } Configure Flyway We can control the migration process using the following properties in the application.properties file: application.properties spring.flyway.enabled=true spring.flyway.locations=classpath:db/migrations spring.flyway.baseline-on-migrate=true spring.flyway.validate-on-migrate=true Property Use spring.flyway.enabled=true Enables or disables Flyway's migration functionality for your application spring.flyway.validate-on-migrate=true When this property is set to true, Flyway will validate the applied migrations against the migration scripts every time it runs a migration. This ensures that the migrations applied to the database match the ones available in the project. If validation fails, Flyway will prevent the migration from running, which helps catch potential problems early. spring.flyway.baseline-on-migrate=true Used when you have an existing database that wasn't managed by Flyway and you want to start using Flyway to manage it. Setting this to true allows Flyway to baseline an existing database, marking it as a baseline and starting to manage subsequent migrations. spring.flyway.locations Specifies the locations of migration scripts within your project. Run the Migrations When you start your Spring Boot application, Flyway will automatically check the db/migrations directory for any new migrations that have not yet been applied to the database and will apply them in version order. ./mvnw spring-boot:run Reverse/Undo Migrations in Flyway Flyway allows you to revert migrations that were applied to the database. However, this feature requires you to have a Flyway Teams (Commercial) license. If you're using the community/free version of Flyway, the workaround is to create a new migration changelog file to undo the changes made by the previous migration and apply them. For example, V5__delete_movie_actors_table.sql DROP TABLE movie_actors; Now run the application to apply the V5 migration changelog to your database. Using Flyway Maven Plugin Flyway provides a maven plugin to manage the migrations from the command line. It provides 7 goals. Goal Description flyway:baseline Baselines an existing database, excluding all migrations up to and including baselineVersion. flyway:clean Drops all database objects (tables, views, procedures, triggers, ...) in the configured schemas. The schemas are cleaned in the order specified by the schemas property.. flyway:info Retrieves the complete information about the migrations including applied, pending and current migrations with details and status flyway:migrate Triggers the migration of the configured database to the latest version. flyway:repair Repairs the Flyway schema history table. This will remove any failed migrations on databases without DDL transactions flyway:undo Undoes the most recently applied versioned migration. Flyway teams only flyway:validate Validate applied migrations against resolved ones on the classpath. This detect accidental changes that may prevent the schema(s) from being recreated exactly. To integrate the flyway maven plugin into your maven project, we need to add flyway-maven-plugin plugin to your pom.xml file. XML <properties> <database.url>jdbc:mysql://localhost:3306/flyway_demo</database.url> <database.username>YOUR_DB_USER</database.username> <database.password>YOUR_DB_PASSWORD</database.password> </properties> <build> <plugins> <plugin> <groupId>org.flywaydb</groupId> <artifactId>flyway-maven-plugin</artifactId> <version>10.10.0</version> <configuration> <url>${database.url}</url> <user>${database.username}</user> <password>${database.password}</password> </configuration> </plugin> <!-- other plugins --> </plugins> </build> Now you can use the Maven goals ./mvnw flyway:migrate Maven allows you to define properties in the project's POM and pass the value from the command line. ./mvnw -Ddatabase.username=root -Ddatabase.password=Passw0rd flyway:migrate
Tech teams do their best to develop amazing software products. They spent countless hours coding, testing, and refining every little detail. However, even the most carefully crafted systems may encounter issues along the way. That's where reliability models and metrics come into play. They help us identify potential weak spots, anticipate failures, and build better products. The reliability of a system is a multidimensional concept that encompasses various aspects, including, but not limited to: Availability: The system is available and accessible to users whenever needed, without excessive downtime or interruptions. It includes considerations for system uptime, fault tolerance, and recovery mechanisms. Performance: The system should function within acceptable speed and resource usage parameters. It scales efficiently to meet growing demands (increasing loads, users, or data volumes). This ensures a smooth user experience and responsiveness to user actions. Stability: The software system operates consistently over time and maintains its performance levels without degradation or instability. It avoids unexpected crashes, freezes, or unpredictable behavior. Robustness: The system can gracefully handle unexpected inputs, invalid user interactions, and adverse conditions without crashing or compromising its functionality. It exhibits resilience to errors and exceptions. Recoverability: The system can recover from failures, errors, or disruptions and restore normal operation with minimal data loss or impact on users. It includes mechanisms for data backup, recovery, and rollback. Maintainability: The system should be easy to understand, modify, and fix when necessary. This allows for efficient bug fixes, updates, and future enhancements. This article starts by analyzing mean time metrics. Basic probability distribution models for reliability are then highlighted with their pros and cons. A distinction between software and hardware failure models follows. Finally, reliability growth models are explored including a list of factors for how to choose the right model. Mean Time Metrics Some of the most commonly tracked metrics in the industry are MTTA (mean time to acknowledge), MTBF (mean time before failure), MTTR (mean time to recovery, repair, respond or resolve), and MTTF (mean time to failure). They help tech teams understand how often incidents occur and how quickly the team bounces back from those incidents. The acronym MTTR can be misleading. When discussing MTTR, it might seem like a singular metric with a clear definition. However, it actually encompasses four distinct measurements. The 'R' in MTTR can signify repair, recovery, response, or resolution. While these four metrics share similarities, each carries its own significance and subtleties. Mean Time To Repair: This focuses on the time it takes to fix a failed component. Mean Time To Recovery: This considers the time to restore full functionality after a failure. Mean Time To Respond: This emphasizes the initial response time to acknowledge and investigate an incident. Mean Time To Resolve: This encompasses the entire incident resolution process, including diagnosis, repair, and recovery. While these metrics overlap, they provide a distinct perspective on how quickly a team resolves incidents. MTTA, or Mean Time To Acknowledge, measures how quickly your team reacts to alerts by tracking the average time from alert trigger to initial investigation. It helps assess both team responsiveness and alert system effectiveness. MTBF or Mean Time Between Failures, represents the average time a repairable system operates between unscheduled failures. It considers both the operating time and the repair time. MTBF helps estimate how often a system is likely to experience a failure and require repair. It's valuable for planning maintenance schedules, resource allocation, and predicting system uptime. For a system that cannot or should not be repaired, MTTF, or Mean Time To Failure, represents the average time that the system operates before experiencing its first failure. Unlike MTBF, it doesn't consider repair times. MTTF is used to estimate the lifespan of products that are not designed to be repaired after failing. This makes MTTF particularly relevant for components or systems where repair is either impossible or not economically viable. It's useful for comparing the reliability of different systems or components and informing design decisions for improved longevity. An analogy to illustrate the difference between MTBF and MTTF could be a fleet of delivery vans. MTBF: This would represent the average time between breakdowns for each van, considering both the driving time and the repair time it takes to get the van back on the road. MTTF: This would represent the average lifespan of each van before it experiences its first breakdown, regardless of whether it's repairable or not. Key Differentiators Feature MTBF MTTF Repairable System Yes No Repair Time Considered in the calculation Not considered in the calculation Failure Focus Time between subsequent failures Time to the first failure Application Planning maintenance, resource allocation Assessing inherent system reliability The Bigger Picture MTTR, MTTA, MTTF, and MTBF can also be used all together to provide a comprehensive picture of your team's effectiveness and areas for improvement. Mean time to recovery indicates how quickly you get systems operational again. Incorporating mean time to respond allows you to differentiate between team response time and alert system efficiency. Adding mean time to repair further breaks down how much time is spent on repairs versus troubleshooting. Mean time to resolve incorporates the entire incident lifecycle, encompassing the impact beyond downtime. But the story doesn't end there. Mean time between failures reveals your team's success in preventing or reducing future issues. Finally, incorporating mean time to failure provides insights into the overall lifespan and inherent reliability of your product or system. Probability Distributions for Reliability The following probability distributions are commonly used in reliability engineering to model the time until the failure of systems or components. They are often employed in reliability analysis to characterize the failure behavior of systems over time. Exponential Distribution Model This model assumes a constant failure rate over time. This means that the probability of a component failing is independent of its age or how long it has been operating. Applications: This model is suitable for analyzing components with random failures, such as memory chips, transistors, or hard drives. It's particularly useful in the early stages of a product's life cycle when failure data might be limited. Limitations: The constant failure rate assumption might not always hold true. As hardware components age, they might become more susceptible to failures (wear-out failures), which the Exponential Distribution Model wouldn't capture. Weibull Distribution Model This model offers more flexibility by allowing dynamic failure rates. It can model situations where the probability of failure increases over time at an early stage (infant mortality failures) or at a later stage (wear-out failures). Infant mortality failures: This could represent new components with manufacturing defects that are more likely to fail early on. Wear-out failures: This could represent components like mechanical parts that degrade with use and become more likely to fail as they age. Applications: The Weibull Distribution Model is more versatile than the Exponential Distribution Model. It's a good choice for analyzing a wider range of hardware components with varying failure patterns. Limitations: The Weibull Distribution Model requires more data to determine the shape parameter that defines the failure rate behavior (increasing, decreasing, or constant). Additionally, it might be too complex for situations where a simpler model like the Exponential Distribution would suffice. The Software vs Hardware Distinction The nature of software failures is different from that of hardware failures. Although both software and hardware may experience deterministic as well as random failures, their failures have different root causes, different failure patterns, and different prediction, prevention, and repair mechanisms. Depending on the level of interdependence between software and hardware and how it affects our systems, it may be beneficial to consider the following factors: 1. Root Cause of Failures Hardware: Hardware failures are physical in nature, caused by degradation of components, manufacturing defects, or environmental factors. These failures are often random and unpredictable. Consequently, hardware reliability models focus on physical failure mechanisms like fatigue, corrosion, and material defects. Software: Software failures usually stem from logical errors, code defects, or unforeseen interactions with the environment. These failures may be systematic and can be traced back to specific lines of code or design flaws. Consequently, software reliability models do not account for physical degradation over time. 2. Failure Patterns Hardware: Hardware failures often exhibit time-dependent behavior. Components might be more susceptible to failures early in their lifespan (infant mortality) or later as they wear out. Software: The behavior of software failures in time can be very tricky and usually depends on the evolution of our code, among others. A bug in the code will remain a bug until it's fixed, regardless of how long the software has been running. 3. Failure Prediction, Prevention, Repairs Hardware: Hardware reliability models that use MTBF often focus on predicting average times between failures and planning preventive maintenance schedules. Such models analyze historical failure data from identical components. Repairs often involve the physical replacement of components. Software: Software reliability models like Musa-Okumoto and Jelinski-Moranda focus on predicting the number of remaining defects based on testing data. These models consider code complexity and defect discovery rates to guide testing efforts and identify areas with potential bugs. Repair usually involves debugging and patching, not physical replacement. 4. Interdependence and Interaction Failures The level of interdependence between software and hardware varies for different systems, domains, and applications. Tight coupling between software and hardware may cause interaction failures. There can be software failures due to hardware and vice-versa. Here's a table summarizing the key differences: Feature Hardware Reliability Models Software Reliability Models Root Cause of Failures Physical Degradation, Defects, Environmental Factors Code Defects, Design Flaws, External Dependencies Failure Patterns Time-Dependent (Infant Mortality, Wear-Out) Non-Time Dependent (Bugs Remain Until Fixed) Prediction Focus Average Times Between Failures (MTBF, MTTF) Number of Remaining Defects Prevention Strategies Preventive Maintenance Schedules Code Review, Testing, Bug Fixes By understanding the distinct characteristics of hardware and software failures, we may be able to leverage tailored reliability models, whenever necessary, to gain in-depth knowledge of our system's behavior. This way we can implement targeted strategies for prevention and mitigation in order to build more reliable systems. Code Complexity Code complexity assesses how difficult a codebase is to understand and maintain. Higher complexity often correlates with an increased likelihood of hidden bugs. By measuring code complexity, developers can prioritize testing efforts and focus on areas with potentially higher defect density. The following tools can automate the analysis of code structure and identify potential issues like code duplication, long functions, and high cyclomatic complexity: SonarQube: A comprehensive platform offering code quality analysis, including code complexity metrics Fortify: Provides static code analysis for security vulnerabilities and code complexity CppDepend (for C++): Analyzes code dependencies and metrics for C++ codebases PMD: An open-source tool for identifying common coding flaws and complexity metrics Defect Density Defect density illuminates the prevalence of bugs within our code. It's calculated as the number of defects discovered per unit of code, typically lines of code (LOC). A lower defect density signifies a more robust and reliable software product. Reliability Growth Models Reliability growth models help development teams estimate the testing effort required to achieve desired reliability levels and ensure a smooth launch of their software. These models predict software reliability improvements as testing progresses, offering insights into the effectiveness of testing strategies and guiding resource allocation. They are mathematical models used to predict and improve the reliability of systems over time by analyzing historical data on defects or failures and their removal. Some models exhibit characteristics of exponential growth. Other models exhibit characteristics of power law growth while there exist models that exhibit both exponential and power law growth. The distinction is primarily based on the underlying assumptions about how the fault detection rate changes over time in relation to the number of remaining faults. While a detailed analysis of reliability growth models is beyond the scope of this article, I will provide a categorization that may help for further study. Traditional growth models encompass the commonly used and foundational models, while the Bayesian approach represents a distinct methodology. The advanced growth models encompass more complex models that incorporate additional factors or assumptions. Please note that the list is indicative and not exhaustive. Traditional Growth Models Musa-Okumoto Model It assumes a logarithmic Poisson process for fault detection and removal, where the number of failures observed over time follows a logarithmic function of the number of initial faults. Jelinski-Moranda Model It assumes a constant failure intensity over time and is based on the concept of error seeding. It postulates that software failures occur at a rate proportional to the number of remaining faults in the system. Goel-Okumoto Model It incorporates the assumption that the fault detection rate decreases exponentially as faults are detected and fixed. It also assumes a non-homogeneous Poisson process for fault detection. Non-Homogeneous Poisson Process (NHPP) Models They assume the fault detection rate is time-dependent and follows a non-homogeneous Poisson process. These models allow for more flexibility in capturing variations in the fault detection rate over time. Bayesian Approach Wall and Ferguson Model It combines historical data with expert judgment to update reliability estimates over time. This model considers the impact of both defect discovery and defect correction efforts on reliability growth. Advanced Growth Models Duane Model This model assumes that the cumulative MTBF of a system increases as a power-law function of the cumulative test time. This is known as the Duane postulate and it reflects how quickly the reliability of the system is improving as testing and debugging occur. Coutinho Model Based on the Duane model, it extends to the idea of an instantaneous failure rate. This rate involves the number of defects found and the number of corrective actions made during testing time. This model provides a more dynamic representation of reliability growth. Gooitzen Model It incorporates the concept of imperfect debugging, where not all faults are detected and fixed during testing. This model provides a more realistic representation of the fault detection and removal process by accounting for imperfect debugging. Littlewood Model It acknowledges that as system failures are discovered during testing, the underlying faults causing these failures are repaired. Consequently, the reliability of the system should improve over time. This model also considers the possibility of negative reliability growth when a software repair introduces further errors. Rayleigh Model The Rayleigh probability distribution is a special case of the Weibull distribution. This model considers changes in defect rates over time, especially during the development phase. It provides an estimation of the number of defects that will occur in the future based on the observed data. Choosing the Right Model There's no single "best" reliability growth model. The ideal choice depends on the specific project characteristics and available data. Here are some factors to consider. Specific objectives: Determine the specific objectives and goals of reliability growth analysis. Whether the goal is to optimize testing strategies, allocate resources effectively, or improve overall system reliability, choose a model that aligns with the desired outcomes. Nature of the system: Understand the characteristics of the system being analyzed, including its complexity, components, and failure mechanisms. Certain models may be better suited for specific types of systems, such as software, hardware, or complex systems with multiple subsystems. Development stage: Consider the stage of development the system is in. Early-stage development may benefit from simpler models that provide basic insights, while later stages may require more sophisticated models to capture complex reliability growth behaviors. Available data: Assess the availability and quality of data on past failures, fault detection, and removal. Models that require extensive historical data may not be suitable if data is limited or unreliable. Complexity tolerance: Evaluate the complexity tolerance of the stakeholders involved. Some models may require advanced statistical knowledge or computational resources, which may not be feasible or practical for all stakeholders. Assumptions and limitations: Understand the underlying assumptions and limitations of each reliability growth model. Choose a model whose assumptions align with the characteristics of the system and the available data. Predictive capability: Assess the predictive capability of the model in accurately forecasting future reliability levels based on past data. Flexibility and adaptability: Consider the flexibility and adaptability of the model to different growth patterns and scenarios. Models that can accommodate variations in fault detection rates, growth behaviors, and system complexities are more versatile and applicable in diverse contexts. Resource requirements: Evaluate the resource requirements associated with implementing and using the model, including computational resources, time, and expertise. Choose a model that aligns with the available resources and capabilities of the organization. Validation and verification: Verify the validity and reliability of the model through validation against empirical data or comparison with other established models. Models that have been validated and verified against real-world data are more trustworthy and reliable. Regulatory requirements: Consider any regulatory requirements or industry standards that may influence the choice of reliability growth model. Certain industries may have specific guidelines or recommendations for reliability analysis that need to be adhered to. Stakeholder input: Seek input and feedback from relevant stakeholders, including engineers, managers, and domain experts, to ensure that the chosen model meets the needs and expectations of all parties involved. Wrapping Up Throughout this article, we explored a plethora of reliability models and metrics. From the simple elegance of MTTR to the nuanced insights of NHPP models, each instrument offers a unique perspective on system health. The key takeaway? There's no single "rockstar" metric or model that guarantees system reliability. Instead, we should carefully select and combine the right tools for the specific system at hand. By understanding the strengths and limitations of various models and metrics, and aligning them with your system's characteristics, you can create a comprehensive reliability assessment plan. This tailored approach may allow us to identify potential weaknesses and prioritize improvement efforts.
Services, or servers, are software components or processes that execute operations on specified inputs, producing either actions or data depending on their purpose. The party making the request is the client, while the server manages the request process. Typically, communication between client and server occurs over a network, utilizing protocols such as HTTP for REST or gRPC. Services may include a User Interface (UI) or function solely as backend processes. With this background, we can explore the steps and rationale behind developing a scalable service. NOTE: This article does not provide instructions on service or UI development, leaving you the freedom to select the language or tech stack that suits your requirements. Instead, it offers a comprehensive perspective on constructing and expanding a service, reflecting what startups need to do in order to scale a service. Additionally, it's important to recognize that while this approach offers valuable insights into computing concepts, it's not the sole method for designing systems. The Beginning: Version Control Assuming clarity on the presence of a UI and the general purpose of the service, the initial step prior to service development involves implementing a source control/version control system to support the code. This typically entails utilizing tools like Git, Mercurial, or others to back up the code and facilitate collaboration, especially as the number of contributors grows. It's common for startups to begin with Git as their version control system, often leveraging platforms like github.com for hosting Git repositories. An essential element of version control is pull requests, facilitating peer reviews within your team. This process enhances code quality by allowing multiple individuals to review and approve proposed changes before integration. While I won't delve into specifics here, a quick online search will provide ample information on the topic. Developing the Service Once version control is established, the next step involves setting up a repository and initiating service development. This article adopts a language-agnostic approach, as delving into specific languages and optimal tech stacks for every service function would be overly detailed. For conciseness, let's focus on a service that executes functions based on inputs and necessitates backend storage (while remaining neutral on the storage solution, which will be discussed later). As you commence service development, it's crucial to grasp how to run it locally on your laptop or in any developer environment. One should consider this aspect carefully, as local testing plays a pivotal role in efficient development. While crafting the service, ensure that classes, functions, and other components are organized in a modular manner, into separate files as necessary. This organizational approach promotes a structured repository and facilitates comprehensive unit test coverage. Unit tests represent a critical aspect of testing that developers should rigorously prioritize. There should be no compromises in this regard! Countless incidents or production issues could have been averted with the implementation of a few unit tests. Neglecting this practice can potentially incur significant financial costs for a company. I won't delve into the specifics of integrating the gRPC framework, REST packages, or any other communication protocols. You'll have the freedom to explore and implement these as you develop the service. Once the service is executable and tested through unit tests and basic manual testing, the next step is to explore how to make it "deployable." Packaging the Service Ensuring the service is "deployable" implies having a method to run the process in a more manageable manner. Let's delve into this concept further. What exactly does this entail? Now that we have a runnable process, who will initiate it initially? Moreover, where will it be executed? Addressing these questions is crucial, and we'll now proceed to provide answers. In my humble opinion, managing your own compute infrastructure might not be the best approach. There are numerous intricacies involved in ensuring that your service is accessible on the Internet. Opting for a cloud service provider (CSP) is a wiser choice, as they handle much of the complexity behind the scenes. For our purposes, any available cloud service provider will suffice. Once a CSP is selected, the next consideration is how to manage the process. We aim to avoid manual intervention every time the service crashes, especially without notification. The solution lies in orchestrating our process through containerization. This involves creating a container image for our process, essentially a filesystem containing all necessary dependencies at the application layer. A "Dockerfile" is used to specify the steps for including the process and dependencies in the container image. Upon completion of the Dockerfile, the docker build cli can be used to generate an image with tags. This image is then stored locally or pushed to a container registry, serving as a repository for container images that can later be pulled onto a compute instance. With these steps outlined, the next question arises: how does containerization orchestrate our process? This will be addressed in the following section on executing a container. Executing the Container After building a container image, the subsequent step is its execution, which in turn initiates the service we've developed. Various container runtimes, such as containerd, podman, and others, are available to facilitate this process. In this context, we utilize the "docker" cli to manage the container, which interacts with containerd in the background. Running a container is straightforward: "docker run" executes the container and consequently, the developed process. You may observe logs in the terminal (if not run as a daemon) or use "docker logs" to inspect service logs if necessary. Additionally, options like "--restart" can be included in the command to automatically restart the container (i.e., the process) in the event of a crash, allowing for customization as required. At this stage, we have our process encapsulated within a container, ready for execution/orchestration as required. While this setup is suitable for local testing, our next step involves exploring how to deploy this on a basic compute instance within our chosen CSP. Deploying the Container Now that we have a container, it's advisable to publish it to a container registry. Numerous container registries are available, managed by CSPs or docker itself. Once the container is published, it becomes easily accessible from any CSP or platform. We can pull the image and run it on a compute instance, such as a Virtual Machine (VM), allocated within the CSP. Starting with this option is typically the most cost-effective and straightforward. While we briefly touch on other forms of compute infrastructure later in this article, deploying on a VM involves pulling a container image and running it, much like we did in our developer environment. Voila! Our service is deployed. However, ensuring accessibility to the world requires careful consideration. While directly exposing the VM's IP to the external world may seem tempting, it poses security risks. Implementing TLS for security is crucial. Instead, a better approach involves using a reverse proxy to route requests to specific services. This ensures security and facilitates the deployment of multiple services on the same VM. To enable internet access to our service, we require a method for inbound traffic to reach our VM. An effective solution involves installing a reverse proxy like Nginx directly on the VM. This can be achieved by pulling the Nginx container image, typically labeled as "nginx:latest". Before launching the container, it's necessary to configure Nginx settings such as servers, locations, and additional configurations. Security measures like TLS can also be implemented for enhanced protection. Once the Nginx configuration is established, it can be exposed to the container through volumes during container execution. This setup allows the reverse proxy to effectively route incoming requests to the container running on the same VM, using a specified port. One notable advantage is the ability to host multiple services within the VM, with routing efficiently managed by the reverse proxy. To finalize the setup, we must expose the VM's IP address and proxy port to the internet, with TLS encryption supported by the reverse proxy. This configuration adjustment can typically be configured through the CSP's settings. NOTE: The examples of solutions provided below may reference GCP as the CSP. This is solely for illustrative purposes and should not be interpreted as a recommendation. The intention is solely to convey concepts effectively. Consider the scenario where managing a single VM manually becomes laborious and lacks scalability. To address this challenge, CSPs offer solutions akin to managed instance groups, comprising multiple VMs configured identically. These groups often come with features like startup scripts, which execute upon VM initialization. All the configurations discussed earlier can be scripted into these startup scripts, simplifying the process of VM launch and enhancing scalability. This setup proves beneficial when multiple VMs are required to handle requests efficiently. Now, the question arises: when dealing with multiple VMs, how do we decide where to route requests? The solution is to employ a load balancer provided by the CSP. This load balancer selects one VM from the pool to handle each request. Additionally, we can streamline the process by implementing general load balancing. To remove individual reverse proxies, we can utilize multiple instance groups for every service needed, accompanied by load balancers for each. The general load balancer can expose its IP with TLS configuration and route setup, ensuring that only service containers run on the VM. It's essential to ensure that VM IPs and ports are accessible solely by the load balancer in the ingress path, a task achievable through configurations provided by the CSP. At this juncture, we have a load balancer securely managing requests, directing them to the specific container service within a VM from a pool of VMs. This setup itself contributes to scaling our service. To further enhance scalability and eliminate the need for continuous VM operation, we can opt for an autoscaler policy. This policy dynamically scales the VM group up or down based on parameters such as CPU, memory, or others provided by the CSP. Now, let's delve into the concept of Infrastructure as Code (IaC), which holds significant importance in efficiently managing CSP components that promote scale. Essentially, IaC involves managing CSP infrastructure components through configuration files, interpreted by an IaC tool (like Terraform) to manage CSP infrastructure accordingly. For more detailed information, refer to the wiki. Datastore We've previously discussed scaling our service, but it's crucial to remember that there's typically a requirement to maintain a state somewhere. This is where databases or datastores play a pivotal role. From experience, handling this aspect can be quite tricky, and I would once again advise against developing a custom solution. CSP solutions are ideally suited for this purpose. CSPs generally handle the complexity associated with managing databases, addressing concepts such as master-slave architecture, replica management, synchronous-asynchronous replication, backups/restores, consistency, and other intricate aspects more effectively. Managing a database can be challenging due to concerns about data loss arising from improper configurations. Each CSP offers different database offerings, and it's essential to consider the specific use cases the service deals with to choose the appropriate offering. For instance, one may need to decide between using a relational database offering versus a NoSQL offering. This article does not delve into these differences. The database should be accessible from the VM group and serve as a central datastore for all instances where the state is shared. It's worth noting that the database or datastore should only be accessible within the VPC, and ideally, only from the VM group. This is crucial to prevent exposing the ingress IP for the database, ensuring security and data integrity. Queues In service design, we often encounter scenarios where certain tasks need to be performed asynchronously. This means that upon receiving a request, part of the processing can be deferred to a later time without blocking the response to the client. One common approach is to utilize databases as queues, where requests are ordered by some identifier. Alternatively, CSP services such as Amazon SQS or GCP pub/sub can be employed for this purpose. Messages published to the queue can then be retrieved for processing by a separate service that listens to the queue. However, we won't delve into the specifics here. Monitoring In addition to the VM-level monitoring typically provided by the CSP, there may be a need for more granular insights through service-level monitoring. For instance, one might require latency metrics for database requests, metrics based on queue interactions, or metrics for service CPU and memory utilization. These metrics should be collected and forwarded to a monitoring solution such as Datadog, Prometheus, or others. These solutions are typically backed by a time-series database (TSDB), allowing users to gain insights into the system's state over a specific period of time. This monitoring setup also facilitates debugging certain types of issues and can trigger alerts or alarms if configured to do so. Alternatively, you can set up your own Prometheus deployment, as it is an open-source solution. With the aforementioned concepts, it should be feasible to deploy a scalable service. This level of scalability has proven sufficient for numerous startups that I have provided consultation for. Moving forward, we'll explore the utilization of a "container orchestrator" instead of deploying containers in VMs, as described earlier. In this article, we'll use Kubernetes (k8s) as an example to illustrate this transition. Container Orchestration: Enter Kubernetes (K8s) Having implemented the aforementioned design, we can effectively manage numerous requests to our service. Now, our objective is to achieve decoupling to further enhance scalability. This decoupling is crucial because a bug in any service within a VM could lead to the VM crashing, potentially causing the entire ecosystem to fail. Moreover, decoupled services can be scaled independently. For instance, one service may have sufficient scalability and effectively handle requests, while another may struggle with the load. Consider the example of a shopping website where the catalog may receive significantly more visits than the checkout page. Consequently, the scale of read requests may far exceed that of checkouts. In such cases, deploying multiple service containers into Kubernetes (K8s) as distinct services allows for independent scaling. Before delving into specifics, it's worth noting that CSPs offer Kubernetes as a compute platform option, which is essential for scaling to the next level. Kubernetes (K8s) We won't delve into the intricacies of Kubernetes controllers or other aspects in this article. The information provided here will suffice to deploy a service on Kubernetes. Kubernetes (K8s) serves as an abstraction over a cluster of nodes with storage and compute resources. Depending on where the service is scheduled, the node provides the necessary compute and storage capabilities. Having container images is essential for deploying a service on Kubernetes (K8s). Resources in K8s are represented by creating configurations, which can be in YAML or JSON format, and they define specific K8s objects. These objects belong to a particular "namespace" within the K8s cluster. The basic unit of compute within K8s is a "Pod," which can run one or more containers. Therefore, a config for a pod can be created, and the service can then be deployed onto a namespace using the K8s CLI, kubectl. Once the pod is created, your service is essentially running, and you can monitor its state using kubectl with the namespace as a parameter. To deploy multiple pods, a "deployment" is required. Kubernetes (K8s) offers various resources such as deployments, stateful sets, and daemon sets. The K8s documentation provides sufficient explanations for these abstractions, we won't discuss each of them here. A deployment is essentially a resource designed to deploy multiple pods of a similar kind. This is achieved through the "replicas" option in the configuration, and you can also choose an update strategy according to your requirements. Selecting the appropriate update strategy is crucial to ensure there is no downtime during updates. Therefore, in our scenario, we would utilize a deployment for our service that scales to multiple pods. When employing a Deployment to oversee your application, Pods can be dynamically generated and terminated. Consequently, the count and identities of operational and healthy Pods may vary unpredictably. Kubernetes manages the creation and removal of Pods to sustain the desired state of your cluster, treating Pods as transient resources with no assured reliability or durability. Each Pod is assigned its own IP address, typically managed by network plugins in Kubernetes. As a result, the set of Pods linked with a Deployment can fluctuate over time, presenting a challenge for components within the cluster to consistently locate and communicate with specific Pods. This challenge is mitigated by employing a Service resource. After establishing a service object, the subsequent topic of discussion is Ingress. Ingress is responsible for routing to multiple services within the cluster. It facilitates the exposure of HTTP, HTTPS, or even gRPC routes from outside the cluster to services within it. Traffic routing is managed by rules specified on the Ingress resource, which is supported by a load balancer operating in the background. With all these components deployed, our service has attained a commendable level of scalability. It's worth noting that the concepts discussed prior to entering the Kubernetes realm are mirrored here in a way — we have load balancers, containers, and routes, albeit implemented differently. Additionally, there are other objects such as Horizontal Pod Autoscaler (HPA) for scaling pods based on memory/CPU utilization, and storage constructs like Persistent volumes (PV) or Persistent Volume Claims (PVC), which we won't delve into extensively. Feel free to explore these for a deeper understanding. CI/CD Lastly, I'd like to address an important aspect of enhancing developer efficiency: Continuous Integration/Deployment (CI/CD). Continuous Integration (CI) involves running automated tests (such as unit, end-to-end, or integration tests) on any developer pull request or check-in to the version control system, typically before merging. This helps identify regressions and bugs early in the development process. After merging, CI generates images and other artifacts required for service deployment. Tools like Jenkins (Jenkins X), Tekton, Git actions and others facilitate CI processes. Continuous Deployment (CD) automates the deployment process, staging different environments for deployment, such as development, staging, or production. Usually, the development environment is deployed first, followed by running several end-to-end tests to identify any issues. If everything functions correctly, CD proceeds to deploy to other environments. All the aforementioned tools also support CD functionalities. CI/CD tools significantly improve developer efficiency by reducing manual work. They are essential to ensure developers don't spend hours on manual tasks. Additionally, during manual deployments, it's crucial to ensure no one else is deploying to the same environment simultaneously to avoid conflicts, a concern that can be addressed effectively by our CD framework. There are other aspects like dynamic config management and securely storing secrets/passwords and logging system, though we won't delve into details, I would encourage readers to look into the links provided. Thank you for reading!
Filtering system calls is an essential component of many host-based runtime security products on Linux systems. There are many different techniques that can be used to monitor system calls, all of which have certain tradeoffs. Recently, kernel modules have become less popular in favor of user space runtime security agents due to portability and stability benefits. Unfortunately, it is possible to architect user space agents in such a way that they are susceptible to several attacks such as time of check time of use (TOCTOU), agent tampering, and resource exhaustion. This article explains attacks that often affect user space security products and how popular technologies such as Seccomp and eBPF can be used in such a way that avoids these issues. Attacks Against User Space Agents User space agents are often susceptible to several attacks such as TOCTOU, tampering, and resource exhaustion. These attacks all take advantage of the fact that the user space agent must communicate with the kernel before it makes a decision about system call or other action that occurs on the system. Generally, these attacks attempt to modify data passed in system calls in such a way that prevents a user space agent from detecting an attack or taking advantage of the fact that the agent does not protect itself from tampering. TOCTOU vulnerabilities present a substantial risk to user space security agents running on the Linux kernel. These vulnerabilities arise when security decisions are based on data that can be altered by an attacker between the check and the subsequent use. For instance, a user space security agent might check the arguments of a system call before allowing a certain operation, but during the time gap before the operation is executed, an adversary could change the system call’s arguments. This manipulation could lead to a divergence between the state perceived by the security agent and the actual state, potentially resulting in security breaches. Addressing TOCTOU challenges in user space security agents requires careful consideration of synchronization mechanisms, ensuring that checks and corresponding actions are executed atomically to prevent exploitation. Resource exhaustion poses a notable threat to user space security agents operating on the Linux kernel, often through the execution of an excessive number of system calls. In this scenario, attackers exploit the agent's requirement to check system calls in a manner that is non blocking. By initiating a barrage of system calls, such as file operations, network connections, or process creation, adversaries aim to overload the agent with benign events and exhaust the agent’s resources such as CPU, memory, or network bandwidth. user space security agents need to implement effective blocking mechanisms that enable them to perform a check on a system call before allowing the call to complete its execution. Tampering attacks are another common issue user space security agents must address. In these attacks, adversaries aim to manipulate the behavior or compromise the integrity of the user space security agent itself, rendering it ineffective or allowing it to be bypassed. Typically, tampering with the agent requires root level access to the system as most security agents run a root. Tampering can take various forms, including altering the configuration of the security agent, deleting or modifying the agent’s executable files on disk, injecting malicious code into its processes, and temporarily pausing or killing its processes with signals. By subverting the user space security agent, attackers can disable critical security features and evade detection. user space security agents must be aware of these attacks and have the appropriate detection mechanisms built in. Seccomp for Kernel Filtering Seccomp, short for “Secure Computing”, is a Linux kernel feature designed to filter system calls made by a process thread. It allows user space security agents to define a restricted set of allowed system calls, reducing the attack surface of an application. Options for system calls that violate the filter include killing the application and notifying another user space process such as a user space security agent. Traditional Seccomp operates by preventing all system calls except for read, write, and exit which significantly restricts the system calls a thread may execute. Seccomp BPF (Berkeley Packet Filter) is an evolution that provides a more flexible filtering mechanism compared to traditional seccomp. Unlike the original version, Seccomp-BPF allows for the dynamic loading of custom Berkeley Packet Filter programs, enabling more fine-grained control over filtering criteria. Seccomp BPF enables the restriction of specific system calls and enables inspection of system call parameters to inform filtering decisions. Seccomp-BPF cannot dereference pointers, so its system call argument analysis is focused on the value of the arguments themselves. By enforcing policies that exclude potentially risky system calls and interactions, Seccomp-BPF contributes significantly to enhancing application security, with the latter offering a more versatile approach to system call filtering. Seccomp avoids the TOCTOU problem by evaluating system call arguments directly. Because seccomp inspects arguments by value, it is not possible for an attacker to alter them after an initial system call. Thus, the attacker does not have an opportunity to modify the data inspected by seccomp after the security check is performed. It is important to note that user space applications that need to dereference pointers to inspect data such as file paths must do so carefully, as this approach can potentially be manipulated by TOCTOU attacks if appropriate precautions are not taken. For example, a security agent could change the value of a pointer argument to a system call to a non-deterministic location and explicitly set the memory it points to. This approach makes TOCTOU attacks more challenging because it prevents another malicious thread in the monitored process from modifying memory pointed to by the original system call arguments. Seccomp is designed with tampering in mind. Both seccomp and seccomp BPF are immutable. Once a thread has seccomp enabled, it cannot be disabled. Similarly, seccomp BPF filters are inherited by all child processes. If additional seccomp programs are added, they are executed in LIFO order. All seccomp BPF filters that are loaded are executed, and the most restrictive result returned by the filters is enacted on the thread. Because seccomp settings and filters are immutable and inherited by child processes, it is not possible for an attacker to bypass their defenses without a kernel exploit. It is important that seccomp BPF filters consider both 64-bit and 32-bit system calls as one technique sometimes used to evade filtering is to change the ABI to 32-bit on a 64-bit operating system. Seccomp avoids resource exhaustion because all system call checks occur inline and before the system call is executed. Thus, the thread executing the system call is blocked while the filter is inspecting the system call arguments. This approach prevents the calling thread from executing additional system calls while the seccomp filter is operating. Because seccomp BPF filters are pure functions, they cannot save data across executions. So, it is not possible to cause them to run out of working memory by storing data about previously executed system calls. This approach ensures seccomp will not cause a system to have reduced memory consumption. By avoiding TOCTOU, tampering, and resource consumption issues, seccomp provides a powerful mechanism for security teams and application developers to enhance their security posture. Seccomp provides a flexible approach to runtime detection and protection against various threats, from malware to exploitable vulnerabilities that works across Linux distributions. Thus, teams can use seccomp to enhance the security posture of their entire Linux workloads in the cloud, in the data center, and at the edge. eBPF for Kernel Filtering eBPF can mitigate TOCTOU vulnerabilities by executing filtering logic directly within the kernel, eliminating the need for transitions between user space and kernel space. This inline execution ensures that security decisions are made atomically, leaving no opportunity for attackers to manipulate the system state between security checks and system call execution. However, it is also dependent on where exactly the program hooks into the kernel. When hooking into system calls, the memory location with the pathname to be accessed belongs to user space, and user space can change it after the hook runs, but before the pathname is used to perform the actual open in-kernel operation. This is depicted in the image below, where the bpf hook checks the “innocent” path, but the kernel operation actually happens with the “suspicious” path. Hooking into a kernel function that happens after the path is copied from user space to kernel space avoids this problem because the hook operates on memory that the user space application cannot modify. For example in file integrity monitoring, instead of a system call, we could hook into the security_file_permission function, which is called on every file access or security_file_open and is executed whenever a file is opened. By accessing system call arguments within the kernel context, eBPF programs can ensure that security decisions are based on consistent and verifiable information, effectively neutralizing TOCTOU attack vectors. It is impossible to do proper enforcement without in-kernel filtering because by the time the event has reached user-space, it is already too late if the operation has already been executed. eBPF also provides robust mechanisms for preventing tampering attacks by executing filtering logic within the kernel. Unlike user space agents, which may be susceptible to tampering attempts targeting their executable files, memory contents, or configuration settings, eBPF programs operate within the highly privileged kernel context, where access controls and integrity protections are strictly enforced. For instance, an eBPF program enforcing integrity checks on critical system files can maintain cryptographic hashes of file contents within kernel memory, ensuring that any unauthorized modifications are detected and prevented in real time. With eBPF, the state of what is watched can be updated in the kernel inline with the operations, while doing this in user-space introduces race conditions. Finally, eBPF addresses resource exhaustion attacks by implementing efficient event filtering and resource management strategies within the kernel. Unlike user space agents, which may be overwhelmed by excessive system call traffic, eBPF programs can leverage kernel-level optimizations to efficiently process and prioritize incoming events, ensuring optimal utilization of system resources. Deciding at the eBPF hook whether the event is of interest to the user or not, means that no extraneous events will be generated and processed by the agent. The alternative, to do the filtering in user-space, tends to induce significant overhead for events that happen very frequently in a system (such as file access or networking) that can lead to resource exhaustion. Low overhead in-kernel filtering means security teams no longer have a resource concern driving decisions on how many files to monitor or whether to enable FIM on systems with extensive I/O operations such as on database servers. eBPF can filter out non-relevant events that are uninteresting to the policy, repetitive, or part of the normal expected behavior to minimize overhead. Thus, eBPF-based security agents can optimize resource utilization and ensure uninterrupted protection against resource exhaustion attacks. By leveraging eBPF's capabilities to mitigate TOCTOU vulnerabilities, prevent tampering attacks, and mitigate resource exhaustion risks, security teams can develop runtime security solutions that effectively protect Linux systems against a wide range of threats.
Editor's Note: The following is an article written for and published in DZone's 2024 Trend Report, Modern API Management: Connecting Data-Driven Architectures Alongside AI, Automation, and Microservices. Microservices-based applications are distributed in nature, and each service can run on a different machine or in a container. However, splitting business logic into smaller units and deploying them in a distributed manner is just the first step. We then must understand the best way to make them communicate with each other. Microservices Communication Challenges Communication between microservices should be robust and efficient. When several small microservices are interacting to complete a single business scenario, it can be a challenge. Here are some of the main challenges arising from microservice-to-microservice communication. Resiliency There may be multiple instances of microservices, and an instance may fail due to several reasons — for example, it may crash or be overwhelmed with too many requests and thus unable to process requests. There are two design patterns that make communication between microservices more resilient: retry and circuit breakers. Retry In a microservices architecture, transient failures are unavoidable due to communication between multiple services within the application, especially on a cloud platform. These failures could occur due to various scenarios such as a momentary connection loss, response time-out, service unavailability, slow network connections, etc. (Shrivastava, Shrivastav 2022). Normally, these errors resolve by themselves by retrying the request either immediately or after a delay, depending on the type of error that occurred. The retry is carried out for a preconfigured number of times until it times out. However, a point of note is that the logical consistency of the operation must be maintained during the request to obtain repeatable responses and avoid potential side effects outside of our expectations. Circuit Breaker In a microservices architecture, as discussed in the previous section, failures can occur due to several reasons and are typically self-resolving. However, this may not always be the case since a situation of varying severity may arise where the errors take longer than estimated to be resolved or may not be resolved at all. The circuit breaker pattern, as the name implies, causes a break in a function operation when the errors reach a certain threshold. Usually, this break also triggers an alert that can be monitored. As opposed to the retry pattern, a circuit breaker prevents an operation that’s likely to result in failure from being performed. This prevents congestion due to failed requests and the escalation of failures downstream. The operation can be continued with the persisting error enabling the efficient use of computing resources. The error does not stall the completion of other operations that are using the same resource, which is inherently limited (Shrivastava, Shrivastav 2022). Distributed Tracing Modern-day microservices-architecture-based applications are made up of distributed systems that are exceedingly complex to design, and monitoring and debugging them becomes even more complicated. Due to the large number of microservices involved in an application that spans multiple development teams, systems, and infrastructures, even a single request involves a complex network of communication. While this complex distributed system enables a scalable, efficient, and reliable system, it also makes system observability more challenging to achieve, thereby creating issues with troubleshooting. Distributed tracing helps us overcome this observability challenge by using a request-centric view. As a request is processed by the components of a distributed system, distributed tracing captures the detailed execution of the request and its causally related actions across the system's components (Shkuro 2019). Load Balancing Load balancing is the method used to utilize resources optimally and to ensure smooth operational performance. In order to be efficient and scalable, more than one instance of a service is used, and the incoming requests are distributed across these instances for a smooth process flow. In Kubernetes, load balancing algorithms are implemented in a more effective manner using a service mesh, which is based on recorded metrics such as latency. Service meshes mainly manage the traffic between services on the network, ensuring that inter-service communications are safe and reliable by enabling the services to detect and communicate with each other. The use of a service mesh improves observability and aids in monitoring highly distributed systems. Security Each service must be secured individually, and the communication between services must be secure. In addition, there needs to be a centralized way to manage access controls and authentication across all services. One of the most popular ways for securing microservices is to use API gateways, which act as proxies between the clients and the microservices. API gateways can perform authentication and authorization checks, rate limiting, and traffic management. Service Versioning The deployment of a microservice version update often leads to unexpected issues and breaking errors between the new version of the microservice and other microservices in the system, or even external clients using that microservice. While the team deploying the new version attempts to mitigate and reduce these breaks, multiple versions of the same microservice can be run simultaneously, thereby allowing requests to be routed to the appropriate version of the microservice. This is done using API versioning for API contracts. Communication Patterns Communication between microservices can be designed by using two main patterns: synchronous and asynchronous. In Figure 1, we see a basic overview of these communication patterns along with their respective implementation styles and choices. Figure 1. Synchronous and asynchronous communication with common implementation technologies Synchronous Pattern Synchronous communication between microservices is one-to-one communication. The microservice that generates the request is blocked until a response is received from the other service. This is done using HTTP requests or gRPC — a high-performance remote procedure call (RPC) framework. In synchronous communication, the microservices are tightly coupled, which is advantageous for less distributed architectures where communication happens in real time, thereby reducing the complexity of debugging (Newman 2021). Figure 2. Synchronous communication depicting the request-response model The following table shows a comparison between technologies that are commonly used to implement the synchronous communication pattern. Table 1. REST vs. gRPC vs. GraphQL REST gRPC GraphQL Architectural principles Uses a stateless client-server architecture; relies on URIs and HTTP methods for a layered system with a uniform interface Uses the client-server method of remote procedure call; methods are directly called by the client and behave like local methods, although they are on the server side Uses client-driven architecture principles; relies on queries, mutations, and subscriptions via APIs to request, modify, and update data from/on the server HTTP methods POST, GET, PUT, DELETE Custom methods POST Payload data structure to send/receive data JSON- and XML-based payload Protocol Buffers-based serialized payloads JSON-based payloads Request/response caching Natively supported on client and server side Unsupported by default Supported but complex as all requests have a common endpoint Code generation Natively unsupported; requires third-party tools like Swagger Natively supported Natively unsupported; requires third-party tools like GraphQL code generator Asynchronous Pattern In asynchronous communication, as opposed to synchronous, the microservice that initiates the request is not blocked until the response is received. It can proceed with other processes without receiving a response from the microservice it sends the request to. In the case of a more complex distributed microservices architecture, where the services are not tightly coupled, asynchronous message-based communication is more advantageous as it improves scalability and enables continued background operations without affecting critical processes (Newman 2021). Figure 3. Asynchronous communication Event-Driven Communication The event-driven communication pattern leverages events to facilitate communication between microservices. Rather than sending a request, microservices generate events without any knowledge of the other microservices' intents. These events can then be used by other microservices as required. The event-driven pattern is asynchronous communication as the microservices listening to these events have their own processes to execute. The principle behind events is entirely different from the request-response model. The microservice emitting the event leaves the recipient fully responsible for handling the event, while the microservice itself has no idea about the consequences of the generated event. This approach enables loose coupling between microservices (Newman 2021). Figure 4. Producers emit events that some consumers subscribe to Common Data Communication through common data is asynchronous in nature and is achieved by having a microservice store data at a specific location where another microservice can then access that data. The data's location must be persistent storage, such as data lakes or data warehouses. Although common data is frequently used as a method of communication between microservices, it is often not considered a communication protocol because the coupling between microservices is not always observable when it is used. This communication style finds its best use case in situations that involve large volumes of data as a common data location prevents redundancy, makes data processing more efficient, and is easily scalable (Newman 2021). Figure 5. An example of communication through common data Request-Response Communication The request-response communication model is similar to the synchronous communication that was previously discussed — where a microservice provides a request to another microservice and has to await a response. Along with the previously discussed protocols (HTTP, gPRC, etc.), message queues are used as well. Request-response is implemented as one of the following two methods: Blocking synchronous – Microservice A opens a network connection and sends a request to Microservice B along this connection. The established connection stays open while Microservice A waits for Microservice B to respond. Non-blocking asynchronous – Microservice A sends a request to Microservice B, and Microservice Bneeds to know implicitly where to route the response. Also, message queues can be used; they provide an added benefit of buffering multiple requests in the queue to await processing. This method is helpful in situations where the rate of requests received exceeds the rate of handling these requests. Rather than trying to handle more requests than its capacity, the microservice can take its time generating a response before moving on to handle the next request (Newman 2021). Figure 6. An example of request-response non-blocking asynchronous communication Conclusion In recent years, we have observed a paradigm shift from designing large, clunky, monolithic applications that are complex to scale and maintain to using microservices-based architectures that enable the design of distributed applications — ones that can integrate multiple communication patterns and protocols across systems. These complex distributed systems can be developed, deployed, scaled, and maintained independently by different teams with fewer conflicts, resulting in a more robust, reliable, and resilient application. Using the most optimal communication pattern and protocol for the exact operation that a microservice must achieve is a crucial task and has a huge impact on the functionality and performance of an application. The aim is to make the communication between microservices as seamless as possible to establish an efficient system. In-depth knowledge regarding the available communication patterns and protocols is an essential aspect of modern-day cloud-based application design that is not only dynamic but also highly competitive with multiple contenders providing identical applications and services. Speed, scalability, efficiency, security, and other additional features are often crucial in determining the overall quality of an application, and proper microservices communication is the backbone to achieving those capabilities. References: Shrivastava, Saurabh. Shrivastav, Neelanjali. 2022. Solutions Architect's Handbook, 2nd Edition. Packt. Shkuro, Yuri. 2019. Mastering Distributed Tracing. Packt. Newman, Sam. 2021. Building Microservices, 2nd Edition. O'Reilly. This is an excerpt from DZone's 2024 Trend Report, Modern API Management: Connecting Data-Driven Architectures Alongside AI, Automation, and Microservices.Read the Free Report
The slow Java startup problem is notorious in the Java community, but the meaning can confuse the observer. The slow startup problem relates to the process of starting a set of interconnected applications on complex Java frameworks. This process includes starting several applications in Spring Boot, and each of them takes around 10 seconds. So the start of such production as a whole will take a minute, but the start of a JVM in this set is 50 milliseconds. The widespread meaning of slow Java startup referred to this process is not exactly true, as technically this is not a Java problem, but a problem of the framework. The effect of slow start-up and warm-up is caused by complex frameworks that we use and dynamic features in the runtime. Java is unique in its functionalities, and thanks to its coding and ecosystem power, Java is very popular among enterprises. The same complexity, though, can make it clumsy in the cloud. Java application startup and warmup technically include several consecutive processes: JVM startup, application startup, and JVM warmup. In these processes, the JVM gets extra time to provide application peak performance. The warmup phase is taken by JVM to compile and optimize the code. This process is needed for code interpretation and optimization and lasts substantially longer than the startup in cases of large complex applications, taking up to several minutes. Every time you start your program, these processes begin from scratch. In practice, it means that we spend time running the application and use significant CPU and memory resources to ensure its performance at the startup point. Therefore, the slow startup and warmup leads to extra resources spent for the phase preparing the application to run rather than the resources that might be required for its operation. Consequently, with the slow startup and warmup, you get increased cloud costs and resource over utilization. Search for the Solutions There are several ways to deal with the issue. Java Optimization Migrating to a newer long-term release (LTS) version of Java can improve application performance slightly, bringing minor changes. Such optimization is a quick method, available immediately. GraalVM Using native images can be beneficial. However, using GraalVM may bring problems such as compilation difficulties, strange errors, and different flags, making it unsuitable for some projects. Project Leyden Its primary goal is to "improve the startup time, time to peak performance, and footprint of Java programs." This project still needs to be completed, and we cannot yet evaluate the effect and possible difficulties of adaptation. However, among all, the Leyden project is designed to solve the problem of slow startup and we follow the news with great expectations on results. Coordinated Restore at Checkpoint It is an OpenJDK project entirely focused on Java startup enhancement. The project's primary aim is to develop a new standard mechanism-agnostic API to notify Java programs about the checkpoint and restore events. Coordinated Restore at Checkpoint (CRaC) offers a Checkpoint/Restore API mechanism solution allowing the creation of an image of a running application at an arbitrary point in time ("checkpoint") and then starting the image from the checkpoint file (snapshot). This process restores the state of an application from the point when the checkpoint was made. Using the CRaC feature with Java runtime enables you to pause the application and restart it from the moment it was paused, and in addition, gives the option to distribute numerous replicas of this file, which is especially relevant for deployment on multiple instances. Amazon Lambda Amazon Lambda is a standalone product based on CRaC technology. Lambda runs your code on a high-availability compute infrastructure and performs all of the administration of the compute resources, including server and operating system maintenance, capacity provisioning, automatic scaling, and logging. Lambdas can be very convenient for your development goals, but they are also more expensive and less effective, compared to the JVMs. The Effectiveness and Your Runtime Sustainability The slow startup problem impacts the overall performance of your runtime, and to make your application sustainable and performant, you need to use one of these solutions. Among the above stated, the CRaC solution is the most popular for the Java community today. CRaC, just like Project Leyden, is targeted to solve the issue of slow startup. We cannot evaluate and test Leyden's results fully yet. The project introduced Class Data Sharing + AOT on steroids, which looks very promising for synergy with Java capable of delivering faster startup on JVM. However, there are no ready-made solutions that can be deployed with Java yet. The advantage of the CRaC feature is that it is already available, and getting spread around quickly. Today, you can get OpenJDK runtime and even containers that support the CRaC API. These solutions are ready to install and allow immediate significant improvements. OpenJDK runtimes and small containers with CRaC support will be especially relevant for Spring developers. Spring announced CRaC feature support in 2023, and their recommended runtime is Liberica JDK, which delivers the runtime version with CRaC. It should be noted that the Native Image technology is also highly relevant for Spring users to reach faster startup of their application. Native images can run with a smaller memory footprint and do not require Java Virtual Machine for deployment. However, GraalVM requires individual research given the specifics of your Java application, and it will not always be suitable for resolving the issue. In the case of Amazon Lambdas, you should consider the costs of this product and its effectiveness, as it might ultimately deliver an extra financial burden. Its main advantage is convenience. The key CRaC advantage today is its availability and ease of use, combined with an instant effect on the application performance and cloud costs. CRaC solves the problem immediately. OpenJDK runtime with support for Coordinated Restore at Checkpoint advances your application with a feature to quickly create and restore images of a running application, reducing the startup and warmup times from minutes to milliseconds. Enhancing your application with Linux-based containers supported with CRaC strengthens its performance even further. CRaC lowers the load on the processor and memory at the application startup, reducing the cloud costs and improving application performance and sustainability.
What Is Multi-Tenancy? Tenancy enables users to share cluster infrastructure among: Multiple teams within the organization Multiple customers of the organization Multi-environments of the application Shared clusters save costs and simplify administration. Security and isolation are key factors to consider when cluster resources are to be shared. Two prominent isolation models to achieve multi-tenancy are hard and soft tenancy models. The key difference between these models lies in the level of isolation provided between tenants. Soft tenancy has a lower level of isolation and uses mechanisms like namespaces, quotas, and limits to restrict tenant access to resources and prevent them from interfering with each other while hard tenancy has stronger isolation. Often involves separate clusters or virtual machines for each tenant, with minimal shared resources. Kubernetes Native Services in Multi-Tenant Implementations Kubernetes has a built-in namespace model to create logical partitions of the cluster as isolated slices. Though basic levels of tenancy can be achieved, using namespaces has some limitations: Implementing advanced multi-tenancy scenarios, like Hierarchical Namespaces (HNS) or exposing Container as a Service (CaaS) becomes complicated because of the flat structure of Kubernetes namespaces. Namespaces have no common concept of ownership. Tracking and administration challenges persist if the team controls multiple namespaces. Enforcing resource quotas and limits fairly across all tenants requires additional effort. Only highly privileged users can create namespaces. This means that whenever a team wants a new namespace, they must raise a ticket to the cluster administrator. While this is probably acceptable for small organizations, it generates unnecessary toil as the organization grows. To solve this problem, Kubernetes provides the Hierarchical Namespace Controller (HNC), which allows the user to organize the namespaces into hierarchies. Namespaces are organized in a tree structure, where child namespaces inherit resources and policies from parent namespaces. While HNC supports a soft-tenancy approach leveraging existing namespaces, however, is a newer project still under incubation in the Kubernetes community. Other wide projects that provide similar capabilities are Capsule, Rafay, Kiosk, etc. In this article series, we will discuss implementing multi-tenant solutions using the Capsule framework. Capsule is a commercially supported open-source project that implements multi-tenancy using virtual control planes. Each tenant gets a dedicated control plane with its own API server and etcd instance, creating a virtualized Kubernetes cluster experience. Capsule is one of the recommended platforms by the Kubernetes community for multi-tenancy. Major components of the Capsule framework include: Capsule controller: Aggregates multiple namespaces in a lightweight abstraction called Tenant. Capsule policy engine: Achieves tenant isolation by the various Network and Security Policies, Resource Quota, Limit Ranges, RBAC, and other policies defined at the tenant level. A user who owns the tenant is called a Tenant Owner. There is a small contrast between the roles of a tenant owner and namespace administrator. Listed below are the roles and responsibilities of the cluster admin, the tenant owner, and the namespace administrator. Install Capsule Framework We will use the AWS EKS cluster to perform the exercise. This article assumes you have already created an EKS cluster "eks-cluster1" and the following software is already installed on your local machine. AWS CLI (Version 2) Kubectl (v1.21) Curl (8.1.2) Helm Charts (3.8.2) Go Lang (v1.20.6) Capsule can be installed in the two ways listed below: Using YAML Installer AWS CLI (Version 2) PowerShell aws eks --region us-east-1 update-kubeconfig --name eks-cluster1 kubectl apply -f https://raw.githubusercontent.com/clastix/capsule/master/config/install.yaml If you face any error in applying the YAML file, re-running the same command should fix the problem. If you see the status of the pod as “ImagePullback” or “errImagePull,” delete the pod of deployment (not the deployment). Using Helm As a cluster admin or root user, run the following commands to install using Helm. PowerShell aws eks --region us-east-1 update-kubeconfig --name eks-cluster1 helm repo add clastix https://clastix.github.io/charts helm install capsule clastix/capsule -n capsule-system --create-namespace Verify Capsule Installation What gets installed with the Capsule framework: Namespace: capsule-system Deployments in Namespace: capsule-controller-manager Services Exposed: capsule-controller-manager-metrics-service capsule-webhook-service Secrets in Namespace: capsule-ca capsule-tls Webhooks: In Kubernetes, webhooks are a mechanism for external services to interact with the Kubernetes API server during the lifecycle of API requests. They act like HTTP callbacks, triggered at specific points in the request flow. This allows external services to perform validations or modifications on resources before they are persisted in the cluster. There are two main types of webhooks used in Kubernetes for admission control: Mutating Admission Webhooks and Validating Admission Webhooks. The following webhooks are installed: capsule-mutating-webhook-configuration capsule-validating-webhook-configuration Custom Resource Definitions (CRDs): CRDs allow the user to extend the API and introduce new types of resources beyond the built-in ones. Imagine them as blueprints for creating your own custom resources that can be managed alongside familiar resources like Deployments and Pods. The CRDs below are installed: capsuleconfigurations.capsule.clastix.io globaltenantresources.capsule.clastix.io tenantresources.capsule.clastix.io tenants.capsule.clastix.io Cluster Roles capsule-namespace-deleter capsule-namespace-provisioner Cluster Role Bindings capsule-manager-rolebinding capsule-proxy-rolebinding Follow the below steps to check if Capsule is installed properly: Login in to verify the following commands as a root user or cluster administrator. This should list ‘capsule-system’ namespace. PowerShell aws eks --region us-east-1 update-kubeconfig --name eks-cluster1 kubectl get ns Run the below commands to see capsule-related components. PowerShell kubectl -n capsule-system get deployments kubectl -n capsule-system get svc kubectl -n capsule-system delete deployment capsule-controller-manager kubectl get mutatingwebhookconfigurations kubectl get validatingwebhookconfigurations Get capsule CRDs installed. PowerShell kubectl get crds If any of the CRDs are missing, apply the respective kubectl command mentioned below. Please note the Capsule version in the said URL, your mileage may vary according to the desired upgrading version. PowerShell kubectl apply -f https://raw.githubusercontent.com/clastix/capsule/v0.3.3/charts/capsule/crds/globaltenantresources-crd.yaml kubectl apply -f https://raw.githubusercontent.com/clastix/capsule/v0.3.3/charts/capsule/crds/tenant-crd.yaml kubectl apply -f https://raw.githubusercontent.com/clastix/capsule/v0.3.3/charts/capsule/crds/tenantresources-crd.yaml View the clusterroles and rolesbindings by running the below commands kubectl get clusterrolebindings kubectl get clusterroles Verify the resource utilization of the framework. PowerShell kubectl -n capsule-system get pods kubectl top pod <<pod name>> -n capsule-system --containers The Capsule framework creates one pod replica. The CPU (cores) should be around 3m and Memory (bytes) around 26Mi. Verify the tenants available by running the below command as Cluster admin. The result should be “No Resources Found.” PowerShell kubectl get tenants Summary In this part, we have understood what multi-tenancy is, different types of tenant isolation models, challenges with Kubernetes native services, and installing the Capsule framework on AWS EKS. In the next part, we will further deep-dive into creating tenants and policy management.
Editor's Note: The following is an article written for and published in DZone's 2024 Trend Report, Modern API Management: Connecting Data-Driven Architectures Alongside AI, Automation, and Microservices. A recent conversation with a fellow staff engineer at a Top 20 technology company revealed that their underlying infrastructure is self-managed and does not leverage cloud-native infrastructure offered by major providers like Amazon, Google, or Microsoft. Hearing this information took me a minute to comprehend given how this conflicts with my core focus on leveraging frameworks, products, and services for everything that doesn't impact intellectual property value. While I understand the pride of a Top 20 technology company not wanting to contribute to the success of another leading technology company, I began to wonder just how successful they could be if they utilized a cloud-native approach. That also made me wonder how many other companies have yet to adopt a cloud-native approach… and the impact it is having on their APIs. Why Cloud? Why Now? For the last 10 years, I have been focused on delivering cloud-native API services for my projects. While cloud adoption continues to gain momentum, a decent percentage of corporations and technology providers still utilize traditional on-premises designs. According to The Cloud in 2021: Adoption Continues report by O'Reilly Media, Figure 1 provides a summary of the state of cloud adoption in December 2021. Figure 1. Cloud technology usage Image adapted from The Cloud in 2021: Adoption Continues, O'Reilly Media Since the total percentages noted in Figure 1 exceed 100%, the underlying assumption is that it is common for respondents to maintain both a cloud and on-premises design. However, for those who are late to enter the cloud native game, I wanted to touch on some common benefits that are recognized with cloud adoption: Focus on delivering or enhancing laser-focused APIs — stop worrying about and managing on-premises infrastructure. Scale your APIs up (and down) as needed to match demand — this is a primary use case for cloud adoption. Reduce risk by expanding your API presence — leverage availability zones, regions, and countries. Describe the supporting API infrastructure as code (IaC) — faster recovery and expandability into new target locations. Making the transition toward cloud native has become easier than ever, with the major providers offering free or discounted trial periods. Additionally, smaller platform-as-a-service (PaaS) providers like Heroku and Render provide solutions that allow teams to focus on their products and services and not worry about the underlying infrastructure design. The Cloud Native Impact on Your API Since this Trend Report is focused on modern API management, I wanted to focus on a few of the benefits that cloud native can have on APIs. Availability and Latency Objectives When providing APIs for your consumers to consume, the concept of service-level agreements (SLAs) is a common onboarding discussion topic. This is basically where expectations are put into easy-to-understand wording that becomes a binding contract between the API provider and the consumer. Failure to meet these expectations can result in fees and, in some cases, legal action. API service providers often take things a step further by establishing service-level objectives (SLOs) that are even more stringent. The goal here is to establish monitors and alerts to remediate issues before they breach contractual SLAs. But what happens when the SLOs and SLAs struggle to be met? This is where the primary cloud native use case can assist. If the increase in latency is due to hardware limitations, the service can be scaled up vertically (by increasing the hardware) or horizontally (by adding more instances). If the increase in latency is driven by geographical location, introducing service instances in closer regions is something cloud native providers can provide to remedy this scenario. API Management As your API infrastructure expands, a cloud-native design provides the necessary tooling to ease supportability and manageability efforts. From an infrastructure perspective, the underlying definition of the service is defined using an IaC approach, allowing the service itself to become defined in a single location. As updates are made to that base design, those changes can be rolled out to each target service instance, avoiding any drift between service instances. From an API management perspective, cloud native providers include the necessary tooling to manage the APIs from a usage perspective. Here, API keys can be established, which offer the ability to impose thresholds on requests that can be made or features that align with service subscription levels. Cloud Native !== Utopia While APIs flourish in cloud native implementations, it is important to recognize that a cloud-native approach is not without its own set of challenges. Cloud Cost Management CloudZero's The State Of Cloud Cost Intelligence 2022 report concluded that only 40% of respondents indicated that their cloud costs were at an expected level as noted in Figure 2. Figure 2. Cloud native cost realities Image adapted from The State Of Cloud Cost Intelligence, CloudZero This means that 60% of respondents are dealing with higher-than-expected cloud costs, which ultimately impact an organization's ability to meet planned objectives. Cloud native spending can often be remediated by adopting the following strategies: Require team-based tags or cloud accounts to help understand levels of spending at a finer grain. Focus on storage buckets and database backups to understand if the cost is in line with the value. Engage a cloud business partner that specializes in cloud spending analysis. Account Takeover The concept of accounts becoming "hacked" is prevalent in social media. At times, I feel like my social media feed contains more "my account was hacked" messages than the casual updates I was tuning in to read. Believe it or not, the concept of account takeover is becoming a common fear for cloud native adopters. Imagine starting your day only to realize you no longer have access to any of your cloud-native services. Soon thereafter, your customers begin to flood your support lines to ask what is going on… and where the data they were expecting to see with each API call is. Another potential consequence is that the APIs are shut down completely, forcing customers to seek out competing APIs. Remember, your account protection is only as strong as your weakest link. Make sure to employ everything possible to protect your account and move away from simple username + password account protection. Disaster Recovery It is also important to recognize that cloud native is not a replacement for maintaining a strong disaster recovery posture. Understand the impact of availability zone and region-wide outages — both are expected to happen. Plan to implement immutable backups — avoid relying on traditional backups and snapshots. Leverage IaC to establish all aspects of cloud native — and test it often. Alternative Flows Exist While a cloud-native approach provides an excellent landscape to help your business and partnerships be successful, there are likely use cases that present themselves as alternative flows for cloud native adoption: Regulatory requirements for a given service can often present themselves as an alternative flow and not be a candidate for cloud native adoption. Point of presence requirements can also become a blocker for cloud native adoption when the closest cloud-native location is not close enough to meet the established SLAs and SLOs. On the Other Side of API Cloud Adoption By adopting a cloud-native approach, it is possible to extend an API across multiple availability zones and geographical regions within a given point of presence. Figure 3. Multi-region cloud native adoption In Figure 3, each region contains an API service instance in three different geographical regions. Additionally, each region contains an API service instance running in three different availability zones — each with its own network and power source. In this example, there are nine distinct instances running across the United States. By introducing a global common name, consumers always receive a service response from the least-latent and available service instance. This approach easily allows for entire regions to be taken offline for disaster recovery validation without any interruptions of service at the consumer level. Conclusion Readers familiar with my work may recall that I have been focused on the following mission statement, which I feel can apply to any IT professional: Focus your time on delivering features/functionality that extend the value of your intellectual property. Leverage frameworks, products, and services for everything else. —John Vester When I think about my conversion with the staff engineer at the Top 20 tech company, I wonder how much more successful his team would be without having to worry about the underlying infrastructure being managed with their on-premises approach. While the other side of cloud native is not without challenges, it does adhere to my mission statement. As a result, projects that I have worked on for the last 10 years have been able to remain focused on meeting the needs of API consumers while staying in line with corporate objectives. From an API perspective, cloud native offers additional ways to adhere to my personal mission statement by describing everything related to the service using IaC and leveraging built-in tooling to manage the APIs across different availability zones and regions. Have a really great day! This is an excerpt from DZone's 2024 Trend Report, Modern API Management: Connecting Data-Driven Architectures Alongside AI, Automation, and Microservices.Read the Free Report
How To Use LLMs: Summarize Long Documents
May 6, 2024 by
Explainable AI: Making the Black Box Transparent
May 16, 2023 by CORE
Implementing EKS Multi-Tenancy Using Capsule (Part 4)
May 6, 2024 by
May 6, 2024 by
Low Code vs. Traditional Development: A Comprehensive Comparison
May 16, 2023 by