Etsy Engineering | Code as Craft
The engineers who make Etsy make our living with a craft we love: software. This is where we'll write about our craft and our collective experience building and running the world's most vibrant handmade marketplace.
One of our Guiding Principles at Etsy is that we “commit to our craft.” This means that we have a culture of learning, in which we’re constantly looking for opportunities to improve and learn, adopt industry best practices, and share our findings with our colleagues and our community. As part of that process, Etsy recently adopted Jetpack Compose – Android's modern toolkit for defining native UIs – as our preferred means of building our Android app. The process of adoption consisted of a gradual expansion in the size and complexity of features built using Compose, eventually culminating in a full rewrite of one of the primary screens in the app. The results of that rewrite gave us the confidence to recommend Compose as the primary tool for our Android engineers to build UIs going forward. Adoption Our engineers are always investigating the latest industry trends and technologies, but in this case a more structured approach was warranted due to the central nature of UI toolkits in the development process. Several engineers on the Android team were assigned to study the existing Compose documentation and examples we had used in prior builds and then create a short curriculum based on what they learned. Over several months, the team held multiple information sessions with the entire Android group, showing how to use Compose to build simple versions of some of our real app screens. Part of our in-house curriculum for learning Jetpack Compose via small modules. Each module built upon the previous module to build more complex versions of various features in our real app. Next, our Design Systems team started creating Compose versions of our internal UI toolkit components, with the goal of having a complete Compose implementation of our design system before major adoption. Compose is designed for interoperability with our existing toolkit, XML Views, providing an uncomplicated migration path that enables us to start using these new toolkit components in our existing XML Views with minimal disruption. This was our first chance to validate that the performance of Compose would be as good as or better than our existing toolkit components. This also gave the wider Android community at Etsy a chance to start using Compose in their day-to-day work and get comfortable with the new patterns Compose introduced. A partial list of the design system components our team was able to make available in Compose. Our Design Systems team also made heavy use of one of Compose’s most powerful features: Previews. Compose Previews allow a developer to visualize Composables in as many configurations as they want using arbitrary test data, all without having to run the app on a device. Every time the team made a change to a Design Systems Composable, they could validate the effect in a wide range of scenarios. After a few months of building and adopting toolkit components in Compose, our team felt it was time for a more significant challenge: rebuilding an entire screen. To prevent inadvertently causing a disruption for buyers or sellers on Etsy, we chose a heavily used screen only available in our backend for development builds. This step exposed us to a much wider scope of concerns: navigation, system UI, data fetching using coroutines from our API, and the orchestration of multiple Compose components interacting with each other. Using Kotlin Flows, we worked out how to structure our business and UI logic around a unidirectional data flow, a key unlock for future integration of Compose with Macramé – our standard architecture for use across all screens in the Etsy app. With a full internal screen under our belts, it was time to put Compose in front of real users. A few complex bottom sheets were the next pieces of our app to get the Compose treatment. For the first time, we exposed a major part of our UI, now fully written in Compose, to buyers and sellers on Etsy. We also paired a simple version of our Macramé architecture with these bottom sheets to prove that the two were compatible. A bottom sheet fully using Compose hosted inside of a screen built using Views. After successfully rolling out bottom sheets using Compose, we saw an opportunity to adopt Compose on a larger scale in the Shop screen. The existing Shop screen code was confusing to follow and very difficult to run experiments on – limiting our ability to help sellers improve their virtual storefronts. Compose and Macramé held the promise of addressing all these concerns. The Shop screen, fully built using Compose. In just around three months, our small team completed the rebuild. Our first order of business was to run an A/B experiment on the Shop screen to compare old vs. new. The results gave Compose even better marks than we had hoped for. Initial screen rendering time improved by 5%, and subjective interactions with the Shop screen, like taps and scrolls, were quicker and more fluid. User analytics showed the new screen improved conversion rate, add to cart actions, checkout starts, shop favoriting, listing views, and more – meaning these changes made a tangible, positive impact for our sellers. For the engineers tasked with coding the Shop screen, the results were just as impressive. An internal survey of engineers who had worked with the Shop screen before the rewrite showed a significant improvement in overall developer satisfaction. Building features required fewer lines of code, our respondents told us, and thanks to the Macramé architecture, testing was much easier and enabled us to greatly increase test coverage of business logic. Similar to what we learned during the development of our Design System components, Compose Previews were called out as a superpower for covering edge cases, and engineers said they were excited to work in a codebase that now featured a modern toolkit. Learnings We've learned quite a lot about Compose on our path to adopting it: Because of the unidirectional data flow of our Macramé architecture and stateless components built with Compose, state is decoupled from the UI and business logic is isolated and testable. The combination of Macramé and Compose has become the standard way we build features for our app. Colocation of layout and display logic allows for much easier manipulation of spacing, margins, and padding when working with complex display logic. Dynamic spacing is extremely difficult to do with XML layouts alone, and requires code in separate files to keep it all in sync. Creating previews of all possible Compose states using mock data has eliminated a large source of rework, bugs, and bad experiences for our buyers. Our team found it easier to build lazy-style lists in Compose compared to managing all the pieces involved with using RecyclerView, especially horizontal lazy lists. Interoperability between Compose and Views in both directions enabled a gradual adoption of Compose. Animation of Composables can be triggered automatically by data changes–no writing extra code to start and stop the animations properly. While no individual tool is perfect, we’re excited about the opportunities and efficiencies Compose has unlocked for our teams. As with any new technology, there's a learning curve, and some bumps along the way. One issue we found was in a 3rd party library we use. While the library has support for Compose, at the time of the Shop screen conversion, that support was still in alpha stage. After extensive testing, we decided to move forward using the alpha version, but an incompatibility could have necessitated us finding an alternative solution. Another learning is that LazyRows and LazyColumns, while similar in some respects to RecyclerView, come with their own specific way of handling keys and item reuse. This new lazy list paradigm has occasionally tripped us up and resulted in some unexpected behavior. Conclusion We’re thrilled with our team’s progress and outcomes in adopting this new toolkit. We’ve now fully rewritten several key UI screens, including Listing, Favorites, Search, and Cart using Compose, with more to come. Compose has given us a set of tools that lets us be more productive when delivering new features to our buyers, and its gradual rollout in our codebase is a tangible example of the Etsy team's commitment to our craft.
At Etsy, we’re focused on elevating the best of our marketplace to help creative entrepreneurs grow their businesses. We continue to invest in making Etsy a safe and trusted place to shop, so sellers’ extraordinary items can shine. Today, there are more than 100 million unique items available for sale on our marketplace, and our vibrant global community is made up of over 90 million active buyers and 7 million active sellers, the majority of whom are women and sole owners of their creative businesses. To support this growing community, our Trust & Safety team of Product, Engineering, Data, and Operations experts are dedicated to keeping Etsy's marketplace safe by enforcing our policies and removing potentially violating or infringing items at scale For that, we make use of community reporting and automated controls for removing this potentially violating content. In order to continue to scale and enhance our detections through innovative products and technologies, we also leverage state-of-the-art Machine Learning solutions which we have already used to identify and remove over 100,000 violations during the past year on our marketplace. In this article, we are going to describe one of our systems to detect policy violations that utilizes supervised learning, a family of algorithms that uses data to train their models to recognize patterns and predict outcomes. Datasets In Machine Learning, data is one of the variables we have the most control over. Extracting data and building trustworthy datasets is a crucial step in any learning problem. In Trust & Safety, we are determined to keep our marketplace and users safe by identifying violations to our policies. For that, we log and annotate potential violations that enable us to collect datasets reliably. In our approach, these are translated into positives, these were indeed violations, and negatives, these were found not to be offending for a given policy. The latter are also known as hard negatives as they are close to our positives and can help us to better learn how to partition these two sets. In addition, we also add easy or soft negatives by adding random items to our datasets. This allows us to give further general examples to our models for listings that do not violate any policy, which is the majority in our marketplace and improve generalizability. The number of easy negatives to add is a hyper-parameter to tune, more will mean higher training time and fewer positive representations. For each training example, we extract multimodal signals, both textual and imagery from our listings. Then, we split our datasets by time using progressive evaluation, to mimic our production usecase and learn to adapt to recent behavior. These are split into training, used to train our models and learn patterns, validation to fine tune our training hyper-parameters such as learning rate and to evaluate over-fitting, and test to report our metrics in an unbiased manner. Model Architecture After usual transformations and extraction of a set of offline features from our datasets, we are all set to start training our Machine Learning model. The goal is to predict whether a given listing violates any of our predefined set of policies, or in contrast, it doesn’t violate any of them. For that, we added a neutral class that depicts the no violation class, where the majority of our listings fall into. This is a typical design pattern for these types of problems. Our model architecture includes a text encoder and an image encoder to learn representations (aka embeddings) for each modality. Our text encoder currently employs a BERT-based architecture to extract context-full representations of our text inputs. In addition, to alleviate compute time, we leverage ALBERT, a lighter BERT with 90% fewer parameters as the transformer blocks share them. Our initial lightweight representation used an in-house model trained for Search usecases. This allowed us to quickly start iterating and learning from this problem. Our image encoder currently employs EfficientNet, a very efficient and accurate Convolutional Neural Network (CNN). Our initial lightweight representation used an in-house model for category classification using CNNs. We are experimenting with transformer-based architectures, similar to our text encoders, with vision transformers but its performance has not been significantly improved. Inspired by EmbraceNet, our architecture then further learns more constrained representations for both text and image embeddings separately, before they are concatenated to form a unique multimodal representation. This is then sent to a final softmax activation that maps logits to probabilities for our internal use. In addition, in order to address the imbalanced nature of this problem, we leverage focal loss that penalizes more hard misclassified examples. Figure 1 shows our model architecture with late concatenation of our text and image encoders and final output probabilities on an example. Model Architecture. Image is obtained from @charlesdeluvio on Unsplash Model Evaluation First, we experimented and iterated by training our model offline. To evaluate its performance, we established certain benchmarks, based on the business goal of minimizing the impact of any well-intentioned sellers while successfully detecting any offending listings in the platform. This results in a typical evaluation trade-off between precision and recall, precision being the fraction of correct predictions over all predictions made, and recall being the fraction of correct predictions over the actual true values. However, we faced the challenge that recall is not possible to compute, as it’s not feasible to manually review the millions and millions of new listings per day so we had to settle for a proxy for recall from what has been annotated. Once we had a viable candidate to test in production, we deployed our model as an endpoint and built a service to perform pre-processing and post-processing steps before and after the call to our endpoint that can be called via an API. Then, we ran an A/B test to measure its performance in production using a canary release approach, slowly rolling out our new detection system to a small percentage of traffic that we keep increasing while we validate an increase in our metrics and no unexpected computation overload. Afterwards, we iterated and every time we had a promising offline candidate, named challenger, that improved our offline performance metrics, we A/B tested it with respect to our current model, named champion. We designed guidelines for model promotion to increase our metrics and our policy coverage. Now, we monitor and observe our model predictions and trigger re-training when our performance degrades. Results Our supervised learning system has been continually learning as we train frequently, run experiments with new datasets and model architectures, A/B test them and deploy them in production. We have added violations as additional classes to our model. As a result, we have identified and removed more than 100,000 violations using these methodologies, in addition to other tools and services that continue to detect and remove violations. This is one of our approaches to identify potentially offending content among others such as explicitly using the policy information and leverage the latest in Large Language Models (LLMs) and Generative AI. Stay tuned! "To infinity and beyond!" –Buzz Lightyear, Toy Story
In 2020, Etsy concluded its migration from an on-premise data center to the Google Cloud Platform (GCP). During this transition, a dedicated team of program managers ensured the migration's success. Post-migration, this team evolved into the Etsy FinOps team, dedicated to maximizing the organization's cloud value by fostering collaborations within and outside the organization, particularly with our Cloud Providers. Positioned within the Engineering organization under the Chief Architect, the FinOps team operates independently of any one Engineering org or function and optimizes globally rather than locally. This positioning, combined with Etsy's robust engineering culture focused on efficiency and craftsmanship, has fostered what we believe is a mature and successful FinOps practice at Etsy. Forecast Methodology A critical aspect of our FinOps approach is a strong forecasting methodology. A reliable forecast establishes an expected spending baseline against which we track actual spending, enabling us to identify deviations. We classify costs into distinct buckets: Core Infrastructure: Includes the costs of infrastructure and services essential for operating the Etsy.com website. Machine Learning & Product Enablement: Encompasses costs related to services supporting machine learning initiatives like search, recommendations, and advertisements. Data Enablement: Encompasses costs related to shared platforms for data collection, data processing and workflow orchestration. Dev: Encompasses non-production resources. The FinOps forecasting model relies on a trailing Cost Per Visit (CPV) metric. While CPV provides valuable insights into changes, it's not without limitations: A meaningful portion of web traffic to Etsy involves non-human activity, like web crawlers that’s not accounted for in CPV. Some services have weaker correlations to user visits. Dev, data, and ML training costs lack direct correlations to visits and are susceptible to short-term spikes during POCs, experiments or big data workflows. A/B tests for new features can lead to short-term CPV increases, potentially resulting in long-term CPV changes upon successful feature launches. Periodically, we run regression tests to validate if CPV should drive our forecasts. In addition to visits we have looked into headcount, GMV(Gross Merchandise Value) and revenue as independent variables. Thus far, visits have consistently exhibited the highest correlation to costs. Monitoring and Readouts We monitor costs using internal tools built on BigQuery and Looker. Customized dashboards for all of our Engineering teams display cost trends, CPV, and breakdowns by labels and workflows. Additionally, we've set up alerts to identify sudden spikes or gradual week-over-week/month-over-month growth. Collaboration with the Finance department occurs weekly to compare actual costs against forecasts, identifying discrepancies for timely corrections. Furthermore, the FinOps team conducts recurring meetings with major cost owners and monthly readouts for Engineering and Product leadership to review forecasted figures and manage cost variances. While we track costs at the organization/cost center level, we don't charge costs back to the teams. This both lowers our overhead and more importantly, provides flexibility to make tradeoffs that enable Engineering velocity. Cost Increase Detection & Mitigation Maintaining a healthy CPV involves swiftly identifying and mitigating cost increases, to achieve this we: Analysis: Gather information on the increase's source, whether from specific cloud products, workflows, or usage pattern changes (ie variance in resource utilization). Collaboration: Engage relevant teams, sharing insights and seeking additional context. Validation: Validate cost increases from product launches or internal changes, securing buy-in from leadership if needed. Mitigation: Unexpected increases undergo joint troubleshooting, where we outline and assign action items to owners, until issues are resolved. Communication: Inform our finance partners about recent cost trends and their incorporation into the expected spend forecast post-confirmation or resolution with teams and engineering leadership. Cost Optimization Initiatives Another side of maintaining a healthy CPV involves cost optimization, offsetting increases from product launches. Ideas for cost-saving come as a result of collaboration between FinOps and engineering teams, with the Architecture team validating and implementing efficiency improvements. Notably we focus on the engineering or business impact of the cost optimization rather than solely on savings, recognizing that inefficiencies often signal larger problems. Based on effort vs. value evaluations, some ideas are added to backlogs, while major initiatives warrant dedicated squads.Below is a breakout of some of the major wins we have had in the last year or so. GCS Storage Optimization - In 2023 we stood up a squad focused on optimizing Etsy’s use of GCS, as it has been one of the largest growth areas for us over the past few years. The squad delivered a number of improvements including improved monitoring of usage, automation features for Data engineers, implementation of TTLs that match data access patterns/business needs and the adoption of Intelligent tiering. Due to these efforts, Etsy’s GCS usage is now less than it was 2 years ago. Compute Optimization - Migrated over 90% of Etsy infrastructure that is serving traffic to the latest and greatest CPU platform. This improved our serving latency while reducing cost. Increased Automation for model deployment - In an effort to improve the developer experience, our machine learning enablement team developed a tool to automate the compute configurations for new models being deployed, which also ended up saving us money. Network Compression - Enabling network compression between our high throughput services both improved the latency profile and drastically reduced the networking cost. What's Next While our core infrastructure spend is well understood, our focus is on improving visibility into our Machine Learning platform's spend. As these systems are shared across teams, dissecting costs tied to individual product launches is challenging. Enhanced visibility will help us refine our ROI analysis of product experiments and pinpoint future areas of opportunity for optimization.
Etsy features a diverse marketplace of unique handmade and vintage items. It’s a visually diverse marketplace as well, and computer vision has become increasingly important to Etsy as a way of enhancing our users’ shopping experience. We’ve developed applications like visual search and visually similar recommendations that can offer buyers an additional path to find what they’re looking for, powered by machine learning models that encode images as vector representations. Figure 1. Visual representations power applications such as visual search and visually similar recommendations Learning expressive representations through deep neural networks, and being able to leverage them in downstream tasks at scale, is a costly technical challenge. The infrastructure required to train and serve large models is expensive, as is the iterative process that refines them and optimizes their performance. The solution is often to train deep learning architectures offline and use the pre-computed pretrained visual representations in downstream tasks served online. (We wrote about this in a previous blog post on personalization from real-time sequences and diversity of representations.) In any application where a query image representation is inferred online, it's important that you have low latency, memory-aware models. Efficiency becomes paramount to the success of these models in the product. We can think about efficiency in deep learning along multiple axes: efficiency in model architecture, model training, evaluation and serving. Model Architecture The EfficientNet family of models features a convolutional neural network architecture. It uniformly optimizes for network width, depth, and resolution using a fixed set of coefficients. By allowing practitioners to start from a limited resource budget and scale up for better accuracy as more resources are available, EfficientNet provides a great starting point for visual representations. We began our trials with EfficientNetB0, the smallest size model in the EfficientNet family. We saw good performance and low latency with this model, but the industry and research community have touted Vision Transformers (ViT) as having better representations. We decided to give that a try. Transformers lack the spatial inductive biases of CNN, but they outperform CNN when trained on large enough datasets and may be more robust to domain shifts. ViT decomposes the image into a sequence of patches (16X16 for example) and applies a transformer architecture to incorporate more global information. However, due to the massive number of parameters and compute-heavy attention mechanism, ViT-based architectures can be many times slower to train and inference than lightweight Convolutional Networks. Despite the challenges, more efficient ViT architectures have recently begun to emerge, featuring clever pooling, layer dropping, efficient normalization, and efficient attention or hybrid CNN-transformer designs. We employ the EfficientFormer-l3 to take advantage of these ViT improvements. The EfficientFormer architecture achieves efficiency through downsampling multiple blocks and employing attention only in the last stage. This derived image representation mechanism differs from the standard vision transformer, where embeddings are extracted from the first token of the output. Instead, we extract the attention from the last block for the eight heads and perform average pooling over the sequence. In Figure 2 we illustrate these different attention weights with heat maps overlaid on an image, showing how each of the eight heads learns to focus on a different salient part. Figure 2. Probing the EfficientFormer-l3 pre-trained visual representations through attention heat maps. Model Training Fine-Tuning With our pre-trained backbones in place, we can gain further efficiencies via fine tuning. For the EfficientNetB0 CNN, that means replacing the final convolutional layer and attaching a d-dimensional embedding layer followed by m classification heads, where m is the number of tasks. The embedding head consists of a new convolutional layer with the desired final representation dimension, followed by a batch normalization layer, a swish activation and a global average pooling layer to aggregate the convolutional output into a single vector per example. To train EfficientNetB0, new attached layers are trained from scratch for one epoch with the backbone layers frozen, to avoid excessive computation and overfitting. We then unfreeze 75 layers from the top of the backbone and finetune for nine additional epochs, for efficient learning. At inference time we remove the classification head and extract the output of the pooling layer as the final representation. To fine-tune the EfficientFormer ViT we stick with the pretraining resolution of 224X224, since using sequences longer than the recommended 384X384 in ViT leads to larger training budgets. To extract the embedding we average pool the last hidden state. Then classification heads are added as with the CNN, with batch normalization being swapped for layer normalization. Multitask Learning In a previous blog post we described how we built a multitask learning framework to generate visual representations for Etsy's search-by-image experience. The training architecture is shown in Figure 3. Figure 3. A multitask training architecture for visual representations. The dataset sampler combines examples from an arbitrary number of datasets corresponding to respective classification heads. The embedding is extracted before the classification heads. Multitask learning is an efficiency inducer. Representations encode commonalities, and they perform well in diverse downstream tasks when those are learned using common attributes as multiple supervision signals. A representation learned in single-task classification to the item’s taxonomy, for example, will be unable to capture visual attributes: colors, shapes, materials. We employ four classification tasks: a top-level taxonomy task with 15 top-level categories of the Etsy taxonomy tree as labels; a fine-grained taxonomy task, with 1000 fine-grained leaf node item categories as labels; a primary color task; and a fine-grained taxonomy task (review photos), where each example is a buyer-uploaded review photo of a purchased item with 100 labels sampled from fine-grained leaf node item categories. We are able to train both EfficientNetB0 and EfficientFormer-l3 on standard 16GB GPUs (we used two P100 GPUs). For comparison, a full sized ViT requires a larger 40GB RAM GPU such as an A100, which can increase training costs significantly. We provide detailed hyperparameter information for fine-tuning either backbone in our article. Evaluating Visual Representations We define and implement an evaluation scheme for visual representations to track and guide model training, on three nearest neighbor retrieval tasks. After each training epoch, a callback is invoked to compute and log the recall for each retrieval task. Each retrieval dataset is split into two smaller datasets: “queries” and “candidates.” The candidates dataset is used to construct a brute-force nearest neighbor index, and the queries dataset is used to look up the index. The index is constructed on the fly after each epoch to accommodate for embeddings changing between training epochs. Each lookup yields K nearest neighbors. We compute Recall@5 and @10 using both historical implicit user interactions (such as “visually-similar ad clicks”) and ground truth datasets of product photos taken from the same listing (“intra-item”). The recall callbacks can also be used for early stopping of training to enhance efficiency. The intra-item retrieval evaluation dataset consists of groups of seller-uploaded images of the same item. The query and candidate examples are randomly selected seller-uploaded images of an item. A candidate image is considered a positive example if it is associated with the same item as the query. In the “intra-item with reviews” dataset, the query image is a randomly selected buyer-uploaded review image of an item, with seller-uploaded images providing candidate examples. The dataset of visually similar ad clicks associates seller-uploaded primary images with primary images of items that have been clicked in the visually similar surface on mobile. Here, a candidate image is considered a positive example for some query image if a user viewing the query image has clicked it. Each evaluation dataset contains 15,000 records for building the index and 5,000 query images for the retrieval phase. We also leverage generative AI for an experimental new evaluation scheme. From ample, multilingual historical text query logs, we build a new retrieval dataset that bridges the semantic gap between text-based queries and clicked image candidates. Text-to-image generative stable diffusion makes the information retrieval process language-agnostic, since an image is worth a thousand (multilingual) words. A stable diffusion model generates high-quality images which become image queries. The candidates are images from clicked items corresponding to the source text query in the logs. One caveat is that the dataset is biased toward the search-by-text production system that produced the logs; only a search-by-image-from-text system would produce truly relevant evaluation logs. The source-candidate image pairs form the new retrieval evaluation dataset which is then used within a retrieval callback. Of course, users entering the same text query may have very different ideas in mind of, say, the garment they’re looking for. So for each query we generate several images: formally, a random sample of length 𝑛 from the posterior distribution over all possible images that can be generated from the seed text query. We pre-condition our generation on a uniform “fashion style.” In a real-world scenario, both the text-to-image query generation and the image query inference for retrieval happen in real time, which means efficient backbones are necessary. We randomly select one of the 𝑛 generated images to replace the text query with an image query in the evaluation dataset. This is a hybrid evaluation method: the error inherent in the text-to-image diffusion model generation is encapsulated in the visually similar recommendation error rate. Future work may include prompt engineering to improve the text query prompt itself, which as input by the user can be short and lacking in detail. Large memory requirements and high inference latency are challenges in using text-to-image generative models at scale. We employ an open source fast stable diffusion model through token merging and float 16 inference. Compared to the standard stable diffusion implementation available at the time we built the system, this method speeds up inference by 50% with a 5x reduction in memory consumption, though results depend on the underlying patched model. We can generate 500 images per hour with one T4 GPU (no parallelism) using the patched stable diffusion pipeline. With parallelism we can achieve further speedup. Figure 4 shows that for the English text query “black bohemian maxi dress with orange floral pattern” the efficient stable diffusion pipeline generates five image query candidates. The generated images include pleasant variations with some detail loss. Interestingly, mostly the facial details of the fashion model are affected, while the garment pattern remains clear. In some cases degradation might prohibit display, but efficient generative technology is being perfected at a fast pace, and prompt engineering helps the generative process as well. Figure 4. Text-to-image generation using a generative diffusion model, from equivalent queries in English and French Efficient Inference and Downstream Tasks Especially when it comes to latency-sensitive applications like visually similar recommendations and search, efficient inference is paramount: otherwise, we risk loss of impressions and a poor user experience. We can think of inference along two axes: online inference of the image query and efficient retrieval of top-k most similar items via approximate nearest neighbors. The dimension of the learned visual representation impacts the efficient retrieval design as well, and the smaller 256d derived from the EfficientNetB0 presents an advantage. EfficientNet B0 is hard to beat in terms of accuracy-to-latency trade-offs for online inference, with ~5M parameters and around 1.7ms latency on iPhone 12. The EfficientFormer-l3 has ~30M parameters and gets around 2.7ms latency on iPhone 12 with higher accuracy (while for example MobileViT-XS scores around 7ms with a third of accuracy; very large ViT are not considered since latencies are prohibitive). In offline evaluation, the EfficientFormer-l3-derived embedding achieves around +5% lift in the Intra-L Recall@5 evaluation, a +17% in Intra-R Recall@5, and a +1.8% in Visually Similar Ad clicks Recall@5. We performed A/B testing on the EfficientNetB0 multitask variant across visual applications at Etsy with good results. Additionally, the EfficientFormer-l3 visual representations led to a +0.65% lift in CTR, and a similar lift in purchase rate in a first visually-similar-ads experiment when compared to the production variant of EfficientNetB0. When included in sponsored search downstream rankers, the visual representations led to a +1.26% lift in post-click purchase rate. Including the efficient visual representation in Ads Information Retrieval (AIR), an embedding-based retrieval method used to retrieve similar item ad recommendations caused an increase in click-recall@100 of 8%. And when we used these representations to compute image similarity and included them directly in the last-pass ranking function, we saw a +6.25% lift in clicks. The first use of EfficientNetB0 visual embeddings was in visually similar ad recommendations on mobile. This led to a +1.92% increase in ad return-on-spend on iOS and a +1.18% increase in post-click purchase rate on Android. The same efficient embedding model backed the first search-by-image shopping experience at Etsy. Users search using photos taken with their mobile phone’s camera and the query image embedding is inferred efficiently online, which we discussed in a previous blog post. Learning visual representations is of paramount importance in visually rich e-commerce and online fashion recommendations. Learning them efficiently is a challenging goal made possible by advances in the field of efficient deep learning in computer vision. If you'd like a more in-depth discussion of this work, please see our full accepted paper to the #fashionXrecsys workshop at the Recsys 2023 conference.
Easily the most important and complex screen in the Buy on Etsy Android app is the listing screen, where all key information about an item for sale in the Etsy marketplace is displayed to buyers. Far from just a title and description, a price and a few images, over the years the listing screen has come to aggregate ratings and reviews, seller and shipping and stock information, and gained a variety of personalization and recommendation features. As information-rich as it is, as central as it is to the buying experience, for product teams the listing screen is an irresistible place to test out new methods and approaches. In just the last three years, apps teams have run nearly 200 experiments on it, often with multiple teams building and running experiments in parallel. Eventually, with such a high velocity of experiment and code change, the listing screen started showing signs of stress. Its architecture was inconsistent and not meant to support a codebase expanding so much and so rapidly in size and complexity. Given the relative autonomy of Etsy app development teams, there ended up being a lot of reinventing the wheel, lots of incompatible patterns getting layered atop one another; in short the code resembled a giant plate of spaghetti. The main listing Fragment file alone had over 4000 lines of code in it! Code that isn’t built for testability doesn’t test well, and test coverage for the listing screen was low. VERY low. Our legacy architecture made it hard for developers to add tests for business logic, and the tests that did get written were complex and brittle, and often caused continuous integration failures for seemingly unrelated changes. Developers would skip tests when it seemed too costly to write and maintain them, those skipped tests made the codebase harder for new developers to onboard into or work with confidently, and the result was a vicious circle that would lead to even less test coverage. Introducing Macramé We decided that our new architecture for the listing screen, which we’ve named Macramé, would be based on immutable data propagated through a reactive UI. Reactive frameworks are widely deployed and well understood, and we could see a number of ways that reactivity would help us untangle the spaghetti. We chose to emulate architectures like Spotify’s Mobius, molded to fit the shape of Etsy’s codebase and its business requirements. At the core of the architecture is an immutable State object that represents our data model. State for the listing screen is passed to the UI as a single data object via a StateFlow instance; each time a piece of the data model changes the UI re-renders. Updates to State can be made either from a background thread or from the main UI thread, and using StateFlow ensures that all updates reach the main UI thread. When the data model for a screen is large, as it is for the listing screen, updating the UI from a single object makes things much simpler to test and reason about than if multiple separate models are making changes independently. And that simplicity lets us streamline the rest of the architecture. When changes are made to the State, the monolithic data model gets transformed into a list of smaller models that represent what will actually be shown to the user, in vertical order on the screen. The code below shows an example of state held in the Buy Box section of the screen, along with its smaller Title sub-component. data class BuyBox( val title: Title, val price: Price, val saleEndingSoonBadge: SaleEndingSoonBadge, val unitPricing: UnitPricing, val vatTaxDescription: VatTaxDescription, val transparentPricing: TransparentPricing, val firstVariation: Variation, val secondVariation: Variation, val klarnaInfo: KlarnaInfo, val freeShipping: FreeShipping, val estimatedDelivery: EstimatedDelivery, val quantity: Quantity, val personalization: Personalization, val expressCheckout: ExpressCheckout, val cartButton: CartButton, val termsAndConditions: TermsAndConditions, val ineligibleShipping: IneligibleShipping, val lottieNudge: LottieNudge, val listingSignalColumns: ListingSignalColumns, val shopBanner: ShopBanner, ) data class Title( val text: String, val textInAlternateLanguage: String? = null, val isExpanded: Boolean = false, ) : ListingUiModel() In our older architecture, the screen was based on a single scrollable View. All data was bound and rendered during the View's initial layout pass, which created a noticeable pause the first time the screen was loaded. In the new screen, a RecyclerView is backed by a ListAdapter, which allows for asynchronous diffs of the data changes, avoiding the need to rebind portions of the screen that aren't receiving updates. Each of the vertical elements on the screen (title, image gallery, price, etc.) is represented by its own ViewHolder, which binds whichever of the smaller data models the element relies on. In this code, the BuyBox is transformed into a vertical list of ListingUiModels to display in the RecyclerView. fun BuyBox.toUiModels(): List { return listOf( price, title, shopBanner, listingSignalColumns, unitPricing, vatTaxDescription, transparentPricing, klarnaInfo, estimatedDelivery, firstVariation, secondVariation, quantity, personalization, ineligibleShipping, cartButton, expressCheckout, termsAndConditions, lottieNudge, ) } An Event dispatching system handles user actions, which are represented by a sealed Event class. The use of sealed classes for Events, coupled with Kotlin "when" statements mapping Events to Handlers, provides compile-time safety to ensure all of the pieces are in place to handle the Event properly. These Events are fed to a single Dispatcher queue, which is responsible for routing Events to the Handlers that are registered to receive them. Handlers perform a variety of tasks: starting asynchronous network calls, dispatching more Events, dispatching SideEffects, or updating State. We want to make it easy to reason about what Handlers are doing, so our architecture promotes keeping their scope of responsibility as small as possible. Simple Handlers are simple to write tests for, which leads to better test coverage and improved developer confidence. In the example below, a click handler on the listing title sets a State property that tells the UI to display an expanded title: class TitleClickedHandler constructor() { fun handle(state: ListingViewState.Listing): ListingEventResult.StateChange { val buyBox = state.buyBox return ListingEventResult.StateChange( state = state.copy( buyBox = buyBox.copy( title = title.copy(isExpanded = true) ) ) ) } } SideEffects are a special type of Event used to represent, typically, one-time operations that need to interact with the UI but aren’t considered pure business logic: showing dialogs, logging events, performing navigation or showing Snackbar messages. SideEffects end up being routed to the Fragment to be handled. Take the scenario of a user clicking on a listing's Add to Cart button. The Handler for that Event might: dispatch a SideEffect to log the button click start an asynchronous network call to update the user’s cart update the State to show a loading indicator while the cart update finishes While the network call is running on a background thread, the Dispatcher is free to handle other Events that may be in the queue. When the network call completes in the background, a new Event will be dispatched with either a success or failure result. A different Handler is then responsible for handling both the success and failure Events. This diagram illustrates the flow of Events, SideEffects, and State through the architecture: Figure 1. A flow chart illustrating system components (blue boxes) and how events and state changes (yellow boxes) flow between them. Results The rewrite process took five months, with as many as five Android developers working on the project at once. One challenge we faced along the way was keeping the new listing screen up to date with all of the experiments being run on the old listing screen while development was in progress. The team also had to create a suite of tests that could comprehensively cover the diversity of listings available on Etsy, to ensure that we didn’t forget any features or break any. With the rewrite complete, the team ran an A/B experiment against the existing listing screen to test both performance and user behavior between the two versions. Though the new listing screen felt qualitatively quicker than the old listing screen, we wanted to understand how users would react to subtle changes in the new experience. We instrumented both the old and the new listing screens to measure performance changes from the refactor. The new screen performed even better than expected. Time to First Content was decreased by 18%, going from 1585 ms down to 1298 ms. This speedup resulted in the average number of listings viewed by buyers increasing 2.4%, add to carts increasing 0.43%, searches increasing by 2%, and buyer review photo views increasing by 3.3%. On the developer side, unit test coverage increased from single digit percentages to a whopping 76% code coverage of business logic classes. This significantly validates our decision to put nearly all business logic into Handler classes, each responsible for handling just a single Event at a time. We built a robust collection of tools for generating testing States in a variety of common configurations, so writing unit tests for the Handlers is as simple as generating an input event and validating that the correct State and SideEffects are produced. Creating any new architecture involves making tradeoffs, and this project was no exception. Macramé is under active development, and we have a few pieces of feedback on our agenda to be addressed: There is some amount of boilerplate still needed to correctly wire up a new Event and Handler, and we'd like to make that go away. The ability of Handlers to dispatch their own Events sometimes makes debugging complex Handler interactions more difficult than previous formulations of the same business logic. On a relatively simple screen, the architecture can feel like overkill. Adding new features correctly to the listing screen is now the easy thing to do. The dual benefit of increasing business metrics while also increasing developer productivity and satisfaction has resulted in the Android team expanding the usage of Macramé to two more of the key screens in the app (Cart and Shop), both of which completely rewrote their UI using Jetpack Compose: but those are topics for future Code as Craft posts.
Balancing Engineering Ambition with Product Realism Introduction In July of 2023, Etsy’s App Updates team, responsible for the Updates feed in Etsy’s mobile apps, set off with an ambitious goal: to revamp the Updates tab to become Deals, a home for a shopper’s coupons and sales, in time for Cyber Week 2023. The Updates tab had been around for years, and in the course of its evolution ended up serving multiple purposes. It was a hub for updates about a user’s favorite shops and listings, but it was also increasingly a place to start new shopping journeys. Not all updates were created equal. The most acted-upon updates in the tab were coupons offered for abandoned cart items, which shoppers loved. We spotted an opportunity to clarify intentions for our users: by refactoring favorite-based updates into the Favorites tab, and (more boldly), by recentering Updates and transforming it into a hub for a buyer’s deals. Technical Opportunity While investigating the best way to move forward with the Deals implementation, iOS engineers on the team advocated for developing a new tab from the ground up. Although it meant greater initial design and architecture effort, an entirely new tab built on modern patterns would let us avoid relying on Objective C, as well as internal frameworks like SDL (server-driven layout), which is present in many legacy Etsy app screens and comes with a variety of scalability and performance issues, and is in the process of being phased out. At the same time, we needed a shippable product by October. Black Friday and Cyber Week loomed on the horizon in November, and it would be a missed opportunity, for us and for our users, not to have the Deals tab ready to go. Our ambition to use modern, not yet road-tested technologies would have to balance with realism about the needs of the product, and we were conscious of maintaining that balance throughout the course of development. In comes Swift UI and Tuist! Two new frameworks were front of mind when starting this project: Swift UI and Tuist. Swift UI provides a clear, declarative framework for UI development, and makes it easy for engineers to break down views into small, reusable components. Maybe Swift UI’s biggest benefit is its built-in view previews: in tandem with componentization, it becomes a very straightforward process to build a view out of smaller pieces and preview at every step of the way. Our team had experimented with Swift UI in the past, but with scopes limited to small views, such as headers. Confident as we were about the framework, we expected that building out a whole screen in Swift UI would present us some initial hurdles to overcome. In fact, one hurdle presented itself right away. In a decade-old codebase, not everything is optimized for use with newer technologies. The build times we saw for our Swift UI previews, which were almost long enough to negate the framework’s other benefits, testified to that fact. This is where Tuist comes in. Our App Enablement team, which has been hard at work over the past few years modernizing the Etsy codebase, has adopted Tuist as a way of taming the monolith making it modular. Any engineer at Etsy can declare a Tuist module in their project and start working on it, importing parts of the larger codebase only as they need them. (For more on Etsy’s usage of Tuist, check out this article by Mike Simons from the App Enablement team.) Moving our work for the Deals tab into a Swift-based Tuist module gave us what it took to make a preview-driven development process practical: our previews build nearly instantly, and so long as we’re only making changes in our framework the app recompiles with very little delay. Figure 1. A view of a goal end state of a modular Etsy codebase, with a first layer of core modules (in blue), and a second layer of client-facing modules that combine to build the Etsy app. Our architecture The Deals tab comprises a number of modules for any given Etsy user, including a Deals Just for You module with abandoned cart coupons, and a module that shows a user their favorite listings that are on sale. Since the screen is just a list of modules, the API returns them as an array of typed items with the following structure: { "type": "", "": { ... } } Assigning each module a type enables us to parse it correctly on the client, and moves us away from the anonymous component-based API models we had used in the past. Many models are still used across modules, however. These include, but are not limited to, buttons, headers and footers, and listing cards. To parse a new module, we either have to build a new component if it doesn't exist yet, or reuse one that does. Adding a footer to a module, for example, can be as simple as: // Model { "type": "my_module", "my_module": { "target_listing": { }", "recommended_listings": [ ], "footer": { } // Add footer here } } // View var body: some View { VStack { ListingView(listing: targetListing) ListingCarouselView(listings: recommendedListings) MyFooterView(footer: footer) // Add footer here } } We also used Decodable implementations for our API parsing, leading to faster, clearer code and an easier way to handle optionals. With Etsy’s internal APIv3 framework built on top of Apple’s Decodable protocol, it is very straightforward to define a model and decide what is and isn’t optional, and let the container handle the rest. For example, if the footer was optional, but the target and recommended listings are required, decoding would look like this: init(from decoder: Decoder) throws { let container = try decoder.containerV3(keyedBy: CodingKeys.self) // These will throw if they aren't included in the response self.targetListing = try container.requireV3(forKey: .targetListing) self.recommendedListings = try container.requireV3(forKey: .recommendedListings) // Footer is optional self.footer = container.decodeV3(forKey: .footer) } As for laying out the view on the screen, we used a Swift UI List to make the most of the under-the-hood cell reuse that List uses. Figure 2. On the left-hand side, a diagram of how the DealsUI view is embedded in the Etsy app. On the right-hand side, a diagram of how the DeasUI framework takes the API response and renders a list of module views with individual components. Previews, previews, more previews If we were going to maintain a good development pace, we needed to figure out a clean way to use Swift previews. Previewing a small component, like a header that takes a string, is simple enough: just initialize the header view with the header string. For more complex views, though, it gets cumbersome to build a mock API response every time you need to preview. This complexity is only amplified when previewing an entire Deals module. To streamline the process, we decided to add a Previews enum to our more complex models. A good example of this is in the Deals Just for You coupon cards. These cards display an image or an array of images, a few lines of custom text (depending on the coupon type), and a button. Our previews enum for this API model looks like this: // In an extension to DealsForYouCard enum Previews { static var shopCouponThreeImage: ResponseModels.DealsForYouCard { let titleText = "IrvingtonWoodworksStudio" let images = [...] // Three images let button = ResponseModels.Button( buttonText: "10% off shop", action: .init(...) ) return ResponseModels.DealsForYouCard( button: button, saleBadge: "20% off", titleText: titleText, subtitleText: "Favorited shop", action: .init(...), images: images ) } static var listingCoupon: ResponseModels.DealsForYouCard { ... } } Then, previewing a variety of coupon cards, it’s as straightforward as: #Preview { DealsForYouCardView(coupon: .Previews.listingCoupon) } #Preview { DealsForYouCardView(coupon: .Previews.shopCouponThreeImage) } The other perk of this architecture is that it makes it very easy to nest previews, for example when previewing an entire module. To build preview data for the Deals for You module, we can use our coupon examples this way: // In an extension to DealsForYouModule enum Previews { static var mockModule: ResponseModels.DealsForYouModule { let items: [ResponseModels.DealsForYouCard] = [.Previews.listingCoupon, .Previews.shopCouponThreeImage, .Previews.shopCouponTwoImage] let header = ResponseModels.DealsForYouHeader(title: "Deals just for you") return .init(header: header, items: items) } } These enums are brief, clear, and easy to understand; they allow us to lean into the benefits of modularity. This architecture, along with our Decodable models, also enabled us to clear a roadblock that used to occur when our team had to wait for API support before we could build new modules. For example, both the Similar Items on Sale and Extra Special Deals modules in the Deals tab were built via previews, and were ready approximately two weeks before the corresponding API work was complete, helping us meet deadlines and not have to wait for a new App Store release. By taking full advantage of Swift UI's modularity and previewability, not only were we able to prove out a set of new technologies, we also exceeded product expectations by significantly beating our deadlines even with the initial overhead of adopting the framework. Challenges: UIKit interoperability Particularly when it came to tasks like navigation and favoriting, interfacing between our module and the Etsy codebase could pose challenges. An assumption that we had as engineers going into this project was that the code to open a listing page, for example, would just be readily available to use; this was not the case, however. Most navigation code within the Etsy codebase is handled by an Objective C class called EtsyScreenController. While in the normal target, it’s as straightforward as calling a function to open a listing page, that functionality was not available to us in our Deals module. One option would have been to build our own navigation logic using Swift UI Navigation stacks; we weren’t trying to reinvent the wheel, however. To balance product deadlines and keep things as simple as possible, we decided not to be dogmatic, and to handle navigation outside of our framework. We did this by building a custom @Environment struct, called DealsAction, which passes off responsibility for navigation back to the main target, and uses the new Swift callAsFunction() feature so we can treat this struct like a function in our views. We have a concept of a DealsAction type in our API response, which enables us to match an action with an actionable part of the screen. For example, a button response has an action that will be executed when a user taps the button. The DealsAction handler takes that action, and uses our existing UIKit code to perform it. The Deals tab is wrapped in a UIHostingController in the main Etsy target, so when setting up the Swift UI view, we also set the DealsAction environment object using a custom view modifier: let dealsView = DealsView() .handleDealsAction { [weak self] in self?.handleAction(action: $0) } ... func handleDealsAction(action: DealsAction) { // UIKit code to execute action } Then, when we need to perform an action from a Swift UI view, the action handler is present at any layer within the view hierarchy within the Deals tab. Performing the action is as simple as: @Environment(\.handleDealsAction) var handleDealsAction: DealsAction ... MyButton(title: buttonText, fillWidth: false) { handleDealsAction(model.button?.action) } We reused this pattern for other existing functionality that was only available in the main target. For example, we built an environment object for favoriting listings, or for following a shop, and for logging performance metrics. This pattern allows us to include environment objects as needed, and it simplifies adding action handling to any view. Instead of rebuilding this functionality in our Tuist module in pure Swift, which could have taken multiple sprints, we struck a balance between modernization and the need to meet product deadlines. Challenges: Listing Cards The listing card view is a common component used across multiple screens within the Etsy app. This component was originally written in Objective-C and throughout the years has been enhanced to support multiple configurations and layouts, and to be available for A/B testing. It also has built-in functionality like favoriting, which engineers shouldn't have to reimplement each time they want to present a listing card. Figure 3. A diagram of how listing card views are conventionally built in UIKit, using configuration options and the analytics framework to combine various UIKit subviews. It's been our practice to reuse this same single component and make small modifications to support changes in the UI, as per product or experimentation requirements. This means that many functionalities, such as favoriting, long-press menus, and image manipulation, are heavily coupled with this single component, many parts of which are still written in Objective C. Early in the process of developing the new tab, we decided to scope out a way of supporting conventional listing card designs—ones that matched existing cards elsewhere in the app—without having to rebuild the entire card component in Swift UI. We knew a rebuild would eventually be necessary, since we expected to have to support listing cards that differed significantly from the standard designs, but the scope of such a rebuild was a known unknown. To balance our deadlines, we decided to push this more ambitious goal until we knew we had product bandwidth. Since the listing card view is heavily coupled with old parts of the codebase, however, it wasn’t as simple as just typing import ListingCard and flying along. We faced two challenges: first, the API model for a listing card couldn’t be imported into our module, and second the view couldn’t be imported for simple use in a UIViewRepresentable wrapper. To address these, we deferred responsibility back up to the UIKit view controller. Our models for a listing card component look something like this: struct ListingCard { public let listingCardWrapper: ListingCardWrapper let listingCard: TypedListingCard } The model is parsed in two ways: as a wrapper, where it is parsed as an untyped dictionary that will eventually be used to initialize our legacy listing card model, and as a TypedListingCard, which is used only within the Deals tab module. Figure 4. A diagram of how a UIKit listing card builder is passed from the main target to the Deals framework for rendering listing cards. To build the listing card view, we pass in a view builder to the SwiftUI DealsView initializer in the hosting controller code. Here, we are in the full Etsy app codebase, meaning that we have access to the legacy listing card code. When we need to build a listing card, we use this view builder as follows: var body: some View { LazyVGrid(...) { ForEach(listings) { listing in cardViewBuilder(listing) // Returns a UIViewRepresentable } } } There was some initial overhead involved in getting these cards set up, but it was worth it to guarantee that engineering unknowns in a Swift UI rewrite wouldn’t block us and compromise our deadlines. Once built, the support for legacy cards enabled us to reuse them easily wherever they were needed. In fact, legacy support was one of the things that helped us move faster than we expected, and it became possible to stretch ourselves and build at least some listing cards in the Deals tab entirely in Swift UI. This meant that writing the wrapper ultimately gave us the space we needed to avoid having to rely solely on the wrapper! Conclusion After just three months of engineering work, the Deals tab was built and ready to go, even beating product deadlines. While it took some engineering effort to overcome initial hurdles, as well as the switch in context from working in UIKit in the main target to working in Swift UI in our own framework, once we had solutions to those challenges and could really take advantage of the new architecture, we saw a very substantial increase in our engineering velocity. Instead of taking multiple sprints to build, new modules could take just one sprint or less; front-end work was decoupled from API work using Previews, which meant we no longer had to wait for mock responses or even API support at all; and maybe most important, it was fun to use Swift UI’s clear and straightforward declarative UI building, and see our changes in real time! From a product perspective, the Deals tab was a great success: buyers converted their sessions more frequently, and we saw an increase in visits to the Etsy app. The tab was rolled out to all users in mid October, and has seen significant engagement, particularly during Black Friday and Cyber Monday. By being bold and by diving confidently into new frameworks that we expected to see benefits from, we improved engineer experience and not just met but beat our product deadlines. More teams at Etsy are using Swift UI and Tuist in their product work now, thanks to the success of our undertaking, sometimes using our patterns to work through hurdles, sometimes creating their own. We expect to see more of this kind of modernization start to make its way into the codebase. As we iterate on the Deals tab over the next year, and make it even easier for buyers to find the deals that mean the most to them, we look forward to continuing to work in the same spirit. Special thanks to Vangeli Ontiveros for the diagrams in this article, and a huge shoutout to the whole App Deals team for their hard work on this project!
In the past, sellers were responsible for managing and fulfilling their own tax obligations. However, more and more jurisdictions are now requiring marketplaces such as Etsy to collect the tax from buyers and remit the tax to the relevant authorities. Etsy now plays an active role in collecting tax from buyers and remitting it all over the world. In this post, I will walk you through our tax calculation infrastructure and how we adapted to the ongoing increase in traffic and business needs over the years. The tax calculation workflow We determine tax whenever a buyer adds an item to their Etsy shopping cart. The tax determination is based on buyer and seller location and product category, and a set of tax rules and mappings. To handle the details of these calculations we partner with Vertex, and issue a call to their tax engine via the Quotation Request API to get the right amount to show in our buyer's cart. Vertex ensures accurate and efficient tax management and continuously updates the tax rules and rates for jurisdictions around the world. The two main API calls we use are Quotation Request and DistributeTaxRequest SOAP calls. When the buyer proceeds to payment, an order is created, and we call back to Vertex with a DistributeTaxRequest sending the order information and tax details. We sync information with Vertex through the order fulfillment lifecycle. To keep things up to date in case an order is canceled or a refund needs to be issued later on, we inform the details of the cancellation and refunds to the tax engine via DistributeTaxRequest. This ensures that when Vertex generates tax reports for us they will be based on a complete record of all the relevant transactions. Etsy collects the tax from the buyers and remits that tax to the taxing authority, when required. Generate tax details for reporting and audit purpose Vertex comes with a variety of report formats out of the box, and gives us tools to define our own. When Etsy calls the Distribute Tax API, Vertex saves the information we pass to it as raw metadata in its tax journal database. A daily cron job in Vertex then moves this data to the transaction detail table, populating it with tax info. When reports and audit data are generated, we download these reports and import to Etsy’s bigdata and the workflow completes. Mapping the Etsy taxonomy to tax categories Etsy maintains product categories to help our buyers find exactly the items they're looking for. To determine whether transactions are taxed or exempt it's not enough to know item prices and buyer locations: we have to map our product categories to Vertex's rule drivers. That was an effort involving not just engineering but also our tax and analytics teams, and with the wide range of Etsy taxonomy categories it was no small task. Handling increased API traffic Coping with the continuous increase in traffic and maintaining the best checkout experience without delays has been a challenge all the time. Out of the different upgrades we did, the most important ones were to switch to multiple instances for vertex calls and shadowing. Multiple Instance upgrade In our initial integration, we were using the same vertex instance for Quotation and Distribute calls. And the same instance was responsible for generating the reports. This report generation started to affect our checkout experience. Reports are generally used by our tax team and they run them on a regular basis. But on top of that, we also run daily reports to feed the data captured by Vertex back into our own system for analytics purposes. We solved this by routing the quotation calls to one instance and then distributing them to the other. This helped in maintaining a clear separation of functionalities, and avoided interference between the two processes. We had to align the configurations between the instances as well. Splitting up the quotation and distribution calls opened up the door to horizontal scaling, now we can add as many instances of each type and load balance the requests between instances. Eg: When a request type lists multiple instances, we load balance between the instances by using the cart_id for quotations and receipt_ids for distributes I.e. cart_id % quotation_instance_count Shadow logging Shadow logging the requests helped us to simulate the stress on Vertex and monitor the checkout experience. We used this technique multiple times in the past. Whenever we had situations like, for example, adding five hundred thousand more listings whose taxes would be passed through the Vertex engine, we were concerned that the increase in traffic might impact buyer experience. To ensure it wouldn't, we tested for a period of time by slowly ramping shadow requests to Vertex: "Shadow requests" are test requests that we send to Vertex from orders, but without applying the calculated tax details to buyers' carts. This will simulate the load on vertex and we can monitor the cart checkout experience. Once we have done shadowing and seen how well Vertex handled the increased traffic, we are confident that releasing the features ensures it would not have any performance implications. Conclusion Given the volume of increasing traffic and the data involved, we will have to keep improving our design to support those. We've also had to address analytics, reporting, configuration sync and many more in designing the system, but we'll leave that story for next time.
A little while ago, Etsy introduced a new feature in its iOS app that could place Etsy sellers' artwork on a user's wall using Apple's Augmented Reality (AR) tools. It let them visualize how a piece would look in their space, and even gave them an idea of its size options. When we launched the feature as a beta, it was only available in "wall art"-related categories, and after the initial rollout we were eager to expand it to work with more categories. What differentiates Etsy is the nature of our sellers’ unique items. Our sellers create offerings that can be personalized in numbers of ways, and they often hand-make orders based on demand. Taking the same approach we did with wall art and attempting to show 3D models of millions of Etsy items – many of which could be further customized – would be a huge undertaking. Nevertheless, with inspiration from Etsy's Guiding Principles, we decided to dig deeper into the feature. What could we improve in the way it worked behind the scenes? What about it would make for a compelling extension into the rest of our vast marketplace? We took steps to improve how we parse seller-provided data, and we used this data with Apple’s AR technology to make it easy for Etsy users to understand the size and scale of an object they might want to buy. We decided we could make tape measures obsolete (or at least not quite as essential) for our home-decor shoppers by building an AR tool to let them visualize–conveniently, accurately, and with minimal effort–how an item would fit in their space. Improving dimension parsing In our original post on the wall art experience, we mentioned the complexity involved in doing things like inferring an item's dimensions from text in its description. Etsy allows sellers to add data about dimensions in a structured way when they create a listing, but that wasn't always the case, and some sellers still provide those details in places like the description or the item's title. The solution was to create a regex-based parser in the iOS App that would glean dimensions (width and height) by sifting through a small number of free-form fields–title, description, customization information, overview–looking for specific patterns. We were satisfied being able to catch most of the formats in which our sellers reported dimensions, handling variable positions of values and units (3 in x 5 in vs 3 x 5 in), different long and short names of units, special unit characters (‘, “), and so on, in all the different languages that Etsy supports. Migrating our parsing functionality to the API backend was a first step towards making the AR measuring tool platform-independent, so we could bring it to our Android App as well. It would also be a help in development, since we could iterate improvements to our regex patterns faster outside the app release schedule. And we’d get more consistent dimensions because we'd be able to cache the results instead of having to parse them live on the client at each visit. We knew that an extended AR experience would need to reliably show our users size options for items that had them, so we prioritized the effort to parse out dimensions from variations in listings. We sanitized free-form text input fields that might contain dimensions—inputs like title or description—so that we could catch a wider range of formats. (Several different characters can be used to write quotation marks, used as shorthand for inches and feet, and we needed to handle special characters for new lines, fraction ligatures like ½ or ¼, etc.) Our regex pattern was split and updated so it could detect: Measurement units in plural forms (inches, feet, etc.); Incorrect spellings (e.g. "foots"); Localization of measurement units in the languages spoken by Etsy’s users ("meters", "metros", and "mètres" in English, Spanish, and French, respectively); Other formats in which dimensions are captured by sellers like dimensions with unit conversions in parentheses (e.g. 12 in x 12 in (30 cm x 30 cm)) or with complex measurements in the imperial system (3’6”). Making our dimension parsing more robust and bringing it server-side had several ancillary benefits. We were able to maintain the functionality of our iOS app while removing a lot of client-side code, even in Etsy’s App Clip, where size is a matter of utmost importance. And though regex processing isn’t that processor-intensive, the view feature performed better once we implemented server-side caching of parsed dimensions. We figured we could even take the parsing offline (rather than parsing every listing on every visit) by running a backfill process to store dimensions in our database and deliver them to the App along with item details. We found, thanks to decoupling our parser work from the App release cycle, that we were able to test hypotheses faster and iterate at a quicker pace. So we could proceed to some improvements that would have been quite resource-intensive if we had tried to implement them on the native app side. Sellers often provide dimensions in inconsistent units, for instance, or they might add the same data multiple times in different fields, when there are variations in properties like material or color. We worked out ways to de-duplicate this data during parsing, to minimize the number of size options we show users. (Though where we find dimensions that are specifically associated with variations, we make sure to retain them, since those will more directly correlate with offering prices.) And we made it possible to prioritize structured dimension data, where sellers have captured it in dedicated fields, as a more reliable source of truth than free-form parsing. Measuring in 3D The box With this new and improved dimension data coming to us from the server, we had to figure out the right way to present it in 3D in iOS. The display needed to be intuitive, so our users would know more or less at a glance what the tool was and how to interact with it. Ultimately, we decided to present a rectangular prism-type object scaled to the parsed dimensions we have for a given listing. Apple's SceneKit framework – specifically its SCNBox class – is what creates this box, which of course we style with the Etsy Orange look. So that users understand the box's purpose, we make sure to display the length on each side. We use SceneKit's SCNNode class to create the pills displaying our measurements. Users drag or tap the measuring box to move it around, and it can rotate on all axes – all made possible by having a different animation for each type of rotation using SCNActions. Rotation is a must-have feature: when we place the measuring box in a user's space, we may not always be able to get the orientation correct. We might, as in the illustration below, place a side table vertically on the floor instead of horizontally. Our users would have a poor experience of the measuring tool if they couldn't adjust for that. (Note that you may see some blinking yellow dots when you try out the AR experience: these are called feature points, and they're useful for debugging, to give us an idea of what surfaces are successfully being detected.) Environment occlusion In addition to ensuring the box would be scaled correctly, we wanted it to "sit" as realistically as possible in the real world, so we built in scene occlusion. When a user places the measuring box in a room with other furniture, scene occlusion lets it interact with real-life objects as if the box were actually there. Users get valuable information this way about how an item will fit in their space. Will that end table go between the wall and couch? Will it be tall enough to be visible from behind the couch? (As demonstrated below, the table will indeed be tall enough.) Environment occlusion became a possibility with LiDAR, a method of determining depth using laser light. Although LiDAR has been around for a few decades, used to map everything from archeological sites to agricultural fields, Apple only included LiDAR scanners in iPhones and iPads beginning in 2020, with the 4th-generation iPad Pro and the iPhone 12 Pro. An iPhone’s LiDAR scanner retrieves depth information from the area it scans and converts it into a series of vertices which connect to form a mesh (or a surface). To add occlusion to our SpriteKit-backed AR feature, we convert the mesh into a 3D object and place it (invisibly to the user) in the environment shown on their phone. As the LiDAR scanner measures more of the environment, we have more meshes to convert into objects and place in 3D. The video below shows an AR session where for debugging purposes we assign a random color to the detected mesh objects. Each different colored outline shown over a real-world object represents a different mesh. Notice how, as we scan more of the room, the device adds more mesh objects as it continues drawing out the environment. The user's device uses these mesh objects to know when and how to occlude the measuring box. Essentially, these mesh objects help determine where the measuring box is relative to all the real-world items and surfaces it should respect. Taking advantage of occlusion gives our users an especially realistic AR experience. In the side-by-side comparison below, the video on the left shows how mesh objects found in the environment determine what part of the measuring box will be hidden as the camera moves in front of the desk. The video on the right shows the exact same thing, but with the mesh objects hidden. Mesh objects are visible Mesh objects are hidden Closing thoughts This project took a 2D concept, our Wall View experience, and literally extended it into 3-dimensional space using Apple’s newest AR tools. And though the preparatory work we did improving our dimension parser may not be anything to look at, without the consistency and accuracy of that parsed information this newly realistic and interactive tool would not have been possible. Nearly a million Etsy items now have real-size AR functionality added to them, viewed by tens of thousands of Etsy users every week. As our marketplace evolves and devices become more powerful, working on features like this only increases our appetite for more and brings us closer to providing our users with a marketplace that lets them make the most informed decision about their purchases effortlessly. Special shoutout to Jacob Van Order and Siri McClean as well as the rest of our team for their work on this.
Introduction Each year, Etsy hosts an event known as “CodeMosaic” - an internal hackathon in which Etsy admin propose and build bold advances quickly in our technology across a number of different themes. People across Etsy source ideas, organize into teams, and then have 2-3 days to build innovative proofs-of-concept that might deliver big wins for Etsy’s buyers and sellers, or improve internal engineering systems and workflows. Besides being a ton of fun, CodeMosaic is a time for engineers to pilot novel ideas. Our team’s project this year was extremely ambitious - we wanted to build a system for stateful machine learning (ML) model training and online machine learning. While our ML pipelines are no stranger to streaming data, we currently don’t have any models that learn in an online context - that is, that can have their weights updated in near-real time. Stateful training updates an already-trained ML model artifact incrementally, sparing the cost of retraining models from scratch. Online learning updates model weights in production rather than via batch processes. Combined, the two approaches can be extremely powerful. A study conducted by Grubhub in 2021 reported that a shift to stateful online learning saw up to a 45x reduction in costs with a 20% increase in metrics, and I’m all about saving money to make money. Day 1 - Planning Of course, building such a complex system would be no easy task. The ML pipelines we use to generate training data from user actions require a number of offline, scheduled batch jobs. As a result it takes quite a while, 40 hours at a minimum, for user actions to be reflected in a model’s weights. To make this project a success over the course of three days, we needed to scope our work tightly across three streams: Real-time training data - the task here was to circumvent the batch jobs responsible for our current training data and get attributions (user actions) right from the source. A service to consume the data stream and learn incrementally - today, we heavily leverage TensorFlow for model training. We needed to be able to load a model's weights into memory, read data from a stream, update that model, and incrementally push it out to be served online. Evaluation - we'd have to make a case for our approach by validating its performance benefits over our current batch processes. No matter how much we limited the scope it wasn't going to be easy, but we broke into three subteams reflecting each track of work and began moving towards implementation. Day 2 - Implementation The real-time training data team began by looking far upstream of the batch jobs that compute training data - at Etsy’s Beacon Main Kafka stream, which contains bot-filtered events. By using Kafka SQL and some real-time calls to our streaming feature platform, Rivulet, we figured we could put together a realistic approach to solving this part of the problem. Of course, as with all hackathon ideas it was easier said than done. Much of our feature data uses the binary avro data format for serialization, and finding the proper schema for deserializing and joining this data was troublesome. The team spent most of the second day munging the data in an attempt to join all the proper sources across platforms. And though we weren't able to write the output to a new topic, the team actually did manage to join multiple data sources in a way that generated real-time training data! Meanwhile the team focusing on building the consumer service to actually learn from the model faced a different kind of challenge: decision making. What type of model were we going to use? Knowing we weren’t going to be able to use the actual training data stream yet - how would we mock it? Where and how often should we push new model artifacts out? After significant discussion, we decided to try using an Ad Ranking model as we had an Ads ML engineer in our group and the Ads models take a long time to train - meaning we could squeeze a lot of benefit out of them by implementing continuous training. The engineers in the group began to structure code that pulled an older Ads model into memory and made incremental updates to the weights to satisfy the second requirement. That meant that all we had left to handle was the most challenging task - evaluation. None of this architecture would mean anything if a model that was trained online performed worse than the model retrained daily in batch. Evaluating a model with more training training periods is also more difficult, as each period we’d need to run the model on some held-out data in order to get an accurate reading without data leakage. Instead of performing an extremely laborious and time-intensive evaluation for continuous training like the one outlined above, we chose to have a bit more fun with it. After all, it was a hackathon! What if we made it a competition? Pick a single high-performing Etsy ad and see which surfaced it first, our continuously trained model or the boring old batch-trained one? We figured if we could get a continuously trained model to recommend a high-performing ad sooner, we’d have done the job! So we set about searching for a high-performing Etsy ad and training data that would allow us to validate our work. Of course, by the time we were even deciding on an appropriate advertised listing, it was the end of day two, and it was pretty clear the idea wasn’t going to play out before it was time for presentations. But still a fun thought, right? Presentation takeaways and impact Day 3 gives you a small window for tidying up work and slides, followed by team presentations. At this point, we loosely had these three things: Training data from much earlier in our batch processing pipelines A Kafka consumer that could almost update a TensorFlow model incrementally A few click attributions and data for a specific listing In the hackathon spirit, we phoned it in and pivoted towards focusing on the theoretical of what we’d been able to achieve! The 1st important potential area of impact was cost savings. We estimated that removing the daily “cold-start” training and replacing it with continuous training would save about $212K annually in Google Cloud costs for the 4 models in ads alone. This is a huge potential win - especially when coupled with the likely metrics gains coming from more reactive models. After all, if we were able to get events to models 40 hours earlier, who knows how much better our ranking could get! Future directions and conclusion Like many hackathon projects, there's no shortage of hurdles getting this work into a production state. Aside from the infrastructure required to actually architect a continuous-training pipeline, we’d need a significant number of high-quality checks and balances to ensure that updating models in real-time didn’t lead to sudden degradations in performance. The amount of development, number of parties involved, and the breadth of expertise to get this into production would surely be extensive. However, as ML continues to mature, we should be able to enable more complex architectures with less overhead.
Introduction Personalization is vital to connect our unique marketplace to the right buyer at the right time. Etsy has recently introduced a novel, general approach to personalizing ML models based on encoding and learning from short-term (one-hour) sequences of user actions through a reusable three-component deep learning module, the adSformer Diversifiable Personalization Module (ADPM). We describe in detail our method in our recent paper, with an emphasis on personalizing the CTR (clickthrough rate) and PCCVR (post-click conversion rate) ranking models we use in Etsy Ads. Here, we'd like to present a brief overview. Etsy offers its sellers the opportunity to place sponsored listings as a supplement to the organic results returned by Etsy search. For sellers and buyers alike, it’s important that those sponsored listings be as relevant to the user’s intent as possible. As Figure 1 suggests, when it comes to search, a “jacket” isn't always just any jacket: Figure 1: Ad results for the query jacket for a user who has recently interacted with mens leather jackets. In the top row, the results without personalized ranking; in the bottom row, the results with session personalization. For ads to be relevant, they need to be personalized. If we define a “session” as a one-hour shopping window, and make a histogram of the total number of listings viewed across a sample of sessions (Fig. 2), we see that a power law distribution emerges. The vast majority of users interact with only a small number of listings before leaving their sessions. Figure 2: A histogram of listing views in a user session. Most users see fewer than ten listings in a one-hour shopping window. Understood simply in terms of listing views, it might seem that session personalization would be an insurmountable challenge. To overcome this challenge we leverage a rich stream of user actions surrounding those views and communicating intent, for example: search queries, item favorites, views, add-to-carts, and purchases. Our rankers can optimize the shopping experience in the moment by utilizing streaming features being made available within seconds of these user actions. Consider a hypothetical sequence of lamps viewed by a buyer within the last hour. Figure 3: An example of a user session with the sequence of items viewed over time. 70s orange lamp ---> retro table lamp --> vintage mushroom lamp Not only is the buyer looking within a particular set of lamps (orange, mushroom-shaped), but they arrived at these lamps through a sequence of query refinements. The search content itself contains information about the visual and textual similarities between the listings, and the order in which the queries occur adds another dimension of information. The content and the sequence of events can be used together to infer what is driving the user’s current interest in lamps. adSformer Diversifiable Personalization Module The adSformer Diversifiable Personalization Module (ADPM), illustrated on the left hand side of Figure 4, is Etsy's solution for using temporal and content signals for session personalization. A dynamic representation of the user is generated from a sequence of the user's most recent streamed actions. The input sequence contains item IDs, queries issued and categories viewed. We consider the item IDs, queries, and categories as “entities” that have recent interactions within the session. For each of these entities we consider different types of actions within a user session–views, recent cart-adds, favorites, and purchases–and we encode each type of entity/action pair separately. This lets us capture fine-grained information about the user's interests in their current session. Figure 4: On the left, a stack representing the ADPM architecture. The right part of the figure is a blown-out illustration of the adSformer Encoder component. Through ablation studies we found that ADPM’s three components work together symbiotically to outperform experiments where each component is considered independently. Furthermore, in deployed applications, the diversity of learned signals improves robustness to input distribution shifts. It also leads to more relevant personalized results, because we understand the user from multiple perspectives. Here is how the three components operate: Component One: The adSformer Encoder The adSformer encoder component uses one or more custom adSformer blocks illustrated in the right panel of Figure 4. This component learns a deep, expressive representation of the one-hour input sequence. The adSformer block modifies the standard transformer block in the attention literature by adding a final global max pooling layer. The pooling layer downsamples the block’s outputs by extracting the most salient signals from the sequence representation instead of outputting the fully concatenated standard transformer output. Formally, for a user, for a one-hour sequence S of viewed item IDs, the adSformer encoder is defined as the output of a stack of layers g(x), where x is the output of each previous layer and o1 is the component’s output. The first layer is an embedding of item and position. Component Two: Pretrained Representations. Component two employs pretrained embeddings of item IDs that users have interacted with together with average pooling to encode the one-hour sequence of user actions. Depending on downstream performance and availability, we choose from multimodal (AIR) representations and visual representations. Thus component two encodes rich image, text and multimodal signals from all the items in the sequence. The advantage of leveraging pretrained item embeddings is that these rich representations are learned efficiently offline using complex deep learning architectures that would not be feasible online in real time. Formally, for a given one-hour sequence of m1hr item IDs pretrained d-dimensional embedding vectors e, we compute a sequence representation as Component Three: Representations Learned "On the Fly" The third component of ADPM introduces representations learned for each sequence from scratch in its own vector space as part of the downstream models. This component learns lightweight representations for many different sequences for which we do not have pretrained representations available, for example sequences of favorited shop ids. Formally, for z one-hour sequences of entities acted upon S we learn embeddings for each entity and sequence to obtain the component’s output o3 as The intermediary outputs of the three components are concatenated to form the final ADPM output, the dynamic user representation u. This user representation is then concatenated to the input vector in various rankers or recommenders we want to real-time personalize. Formally, for one-hour variable length sequences of user actions S, and ADPM’s components outputs o From a software perspective, the module is implemented as a Tensorflow Keras module which can easily be employed in downstream models through a simple import statement. Pretrained Representation Learning The second component of the ADPM includes pretrained representations. We rely on several pretrained representations: image embeddings, text embeddings, and multimodal item representations. Visual Representations In Etsy Ads, we employ image signals across a variety of tasks, such as visually similar candidate generation, search by image, as inputs for learning other pretrained representations, and in the ADPM's second component. To effectively leverage the rich signal encoded in Etsy Ads images we train image embeddings in a multitask classification learning paradigm. By using multiple classification heads, such as taxonomy, color, and material, our representations are able to capture more diverse information about the image. So far we have derived great benefit from our multitask visual embeddings, trained using a lightweight EfficientNetB0 architecture, and weights pretrained on ImageNet as backbone. We replaced the final layer with a 256-dimensional convolutional block, which becomes the output embedding. We apply image random rotation, translation, zoom, and a color contrast transformation to augment the dataset during training. We are currently in the process of updating the backbone architectures to efficient vision transformers to further improve the quality of the image representations and the benefits derived in downstream applications, including the ADPM. Ads Information Retrieval Representations Ads Information Retrieval (AIR) item representations encode an item ID through a metric learning approach, which aims to learn a distance function or similarity metric between two items. Standard approaches to metric learning include siamese networks, contrastive loss, and triplet loss. However, we found more interpretable results using a sampled in-batch softmax loss. This method treats each batch as a classification problem pairing all the items in a batch that were co-clicked. A pseudo-two-tower architecture is used to encode the source items and candidate items towers which share all trainable weights across both towers. Each item tower captures and encodes information about an item’s title, image, primary color, attributes, category, etc. This information diversity is key to our personalization outcomes. By leveraging a variety of data sources, the system can identify patterns and insights that would be missed by a more limited set of inputs. ADPM-Personalized Sponsored Search ADPM’s effectiveness and generality is demonstrated in the way we use it to personalize the CTR prediction model in EtsyAds’ Sponsored Search. The ADPM encodes reverse-chronological sequences of recent user actions (in the sliding one-hour window we've discussed), anywhere on Etsy, for both logged-in and logged-out users. We concatenate ADPM’s output, the dynamic user representation, to the rest of the wide input vector in the CTR model. To fully leverage this even wider input vector, a deep and cross (DCN) interaction module is included in the overall CTR architecture. If we remove the DCN module, the CTR’s model ROC-AUC drops by 1.17%. The architecture of the ADPM-personalized CTR prediction model employed by EtsyAds in sponsored search is given in Figure 5. (We also employ the ADPM to personalize the PCCVR model with a similar architecture, which naturally led to ensembling the two models in a multitask architecture, a topic beyond the scope of this blog post.) Figure 5: An example of how the ADPM is used in a downstream ranking model The ADPM-personalized CTR and PCCVR models outperformed the CTR and PCCVR non-personalized production baselines by +2.66% and +2.42%, respectively, in offline Area Under the Receiver Operating Characteristic Curve (ROC-AUC). Following the robust online gains in A/B tests, we deployed the ADPM-personalized sponsored search system to 100% of traffic. Conclusion The adSformer diversifiable personalization module (ADPM) is a scalable, general approach to model personalization from short-term sequences of recent user actions. Its use in sponsored search to personalize our ranking and bidding models is a milestone for EtsyAds, and is delivering greater relevance in sponsored placements for the millions of buyers and sellers that Etsy's marketplace brings together. If you would like more details about ADPM, please see our paper.
Introduction The Feature Systems team at Etsy is responsible for the platform and services through which machine learning (ML) practitioners create, manage and consume feature data for their machine learning models. We recently made new real-time features available through our streaming feature platform, Rivulet, where we return things like “most recent add-to-carts.” While timeseries data itself wasn’t new to our system, these newer features from our streaming feature service would be the first timeseries inputs to be supplied to our ML models themselves to inform search, ads, and recommendations use cases. Not too long after we made these features available to users for ML model training, we received a message from Harshal, an ML practitioner on Recommendations, warning us of "major problems" lying in wait. Figure 1. A user message alerting us to the possibility of "major problems for downstream ML models" in our use of the timestamp datatype. Harshal told us our choice to export real-time features using a timestamp datatype was going to cause problems in downstream models. The training data that comes from our offline feature store uses the binary Avro file format, which has a logical type called timestamp we used to store these features, with an annotation specifying that they should be at the millisecond precision. The problem, we were being informed, is that this Avro logical type would be interpreted differently in different frameworks. Pandas, NumPy, and Spark would read our timestamps, served with millisecond precision, as datetime objects with nanosecond precision - creating the possibility of a training/serving skew. In order to prevent mismatches, and the risk they posed of silent failures in production, the recommendation was that we avoid the timestamp type entirely and serve our features as a more basic numeric data type, such as Longs. Getting to the root of the issue We started the way software engineers usually do, attempting to break down the problem and get to root causes. Before changing data types, we wanted to understand if the misinterpretation of the precision of the timestamp was an issue with Python, Spark, or even a misuse of the Avro timestamp annotation that we were using to specify the millisecond precision. We were hesitant to alter the data type of the feature without an in-depth investigation. After all, timestamp and datetime objects are typically passed around between systems precisely to resolve inconsistencies and improve communication. We started by attempting to put together a diagram of all the different ways that timestamp features were represented across our systems. The result was a diagram like this: Figure 2. A diagram of all the objects/interpretations of timestamp features across our systems. Though the user only ever sees microseconds, between system domains we see a diversity of representations. While it was clear Spark and other frameworks weren’t respecting the timestamp annotation that specified millisecond precision, we began to realize that that problem was actually a symptom of a larger issue for our ML practitioners. Timestamp features can take a number of different forms before finally being passed into a model. In itself this isn't really surprising. Every type is language-specific in one way or another - the diagram would look similar if we were going to be serializing integers in Scala and deserializing integers in Python. However, the overall disparity between objects is much greater for complex datetime objects than it is for basic data types. There is simply more room for interpretation with datetime objects, and less certainty about how they translate across system boundaries, and for our use case in training ML models uncertainty was exactly what we did not want. As we dug deeper into the question, it started to become clear that we weren’t trying to resolve a specific bug or issue, but reduce the amount of toil for ML practitioners who would be consuming timestamp features long-term. While the ISO-8061 format is massively helpful for sharing datetime and timestamp objects across different systems, it’s less helpful when all you’re looking for is an integer representation at a specific precision. Since these timestamps were features of a machine learning model, we didn’t need all the complexity that datetime objects and timestamp types offered across systems. The use case for this information was to be fed as an integer of a specific precision into an ML model, and nothing more. Storing timestamps as logical types increased cognitive overhead for ML practitioners and introduced additional risk that training with the wrong precision could degrade model quality during inference. Takeaways This small request bubbled into a much larger discussion during one of our organization’s architecture working group meetings. Although folks were initially hesitant to change the type of these features, by the end of the meeting there was a broad consensus that it would be desirable to represent datetime features in our system as a primitive numeric type (unix timestamps with millisecond precision) to promote consistency between model training and inference. Given the wide range of training contexts that all of our features are used in, we decided it was a good idea to promote consistency between training and inference by standardizing on primitive types more generally. Members of the Feature Systems team also expressed a desire to improve documentation around how features are transformed end-to-end throughout the current system to make things easier for customers in the future. We designed our ML features with abstraction and interoperability in mind, as software engineers do. It’s not that ML isn’t a software engineering practice, but that it’s a domain in which the specific needs of ML software didn’t match our mental model of best practices for the system. Although ML has been around for some time, the rapidly-changing nature of the space means the nuance of many ML-specific guidelines are still ill-defined. I imagine this small cross-section of difficulty applying software practices to ML practices will be the first of many as ML continues its trajectory through software systems of all shapes and sizes.
Etsy announced the arrival of a powerful new image-based discovery tool on Etsy’s mobile apps. The ability to search by image gives buyers the opportunity to search the Etsy marketplace using their own photos as a reference. Tap the camera icon in the search bar to take a picture, and in a fraction of a second we’ll surface visually similar results from our inventory of nearly 100 million listings. Searching by image is being rapidly adopted throughout the e-commerce industry, and nowhere does it make more sense than on Etsy, where the uniqueness of our sellers’ creations can’t always be expressed easily with words. In this post we’ll give you a look at the machine-learning architecture behind our search by image feature and the work we did to evolve it. Overview In order to search a dataset of images using another image as the query, we first need to convert all those images into a searchable representation. Such a representation is called an embedding, which is a dense vector existing in some relatively low n-dimensional shared space. Once we have the embedding of our query image, and given the precomputed embeddings for our dataset of listing images, we can use any geometric distance metric to look up the closest set of listings to our query. This type of search algorithm is often referred to as a nearest-neighbor search. Figure 1. A plot that shows embeddings from a random sample of a thousand Etsy images. The embeddings have been reduced to three dimensions so that they can be plotted. In this embedding space, bags and purses are embedded near each other. Separate clusters form around craft supply images on the left and home goods images on the right. At a high level, the visual retrieval system works by using a machine learning model to convert every listing’s image into an embedding. The embeddings are then indexed into an approximate nearest-neighbor (ANN) system which scores a query image for similarity against Etsy's image embeddings in a matter of milliseconds. Multitask Vision Model To convert images to embeddings we use a convolutional neural network (CNN) that has been trained on Etsy data. We can break our approach into three components: the model architecture, the learning objective, and the dataset. Model Architecture Training the entire CNN from scratch can be costly. It is also unnecessary as the early layers of a pretrained CNN can be shared and reused across new model tasks. We leveraged a pre-trained model and applied transfer learning to fine-tune it on Etsy data. Our approach was to download pre-trained weights into the model and replace the “head” of the model with one for our specific task. During training, we then “freeze” most of the pre-trained weights, and only optimize those for the new classification head as well as for the last few layers of the CNN. The particular pre-trained model we used is called EfficientNet: a family of CNNs that have been tuned in terms of width, depth, and resolution, all to achieve optimal tradeoffs between accuracy and efficiency. Learning Objective A proven approach to learning useful embeddings is to train a model on a classification task as a proxy. Then, at prediction time, extracting the penultimate layer just before the classification head produces an embedding instead of a classification probability. Our first attempt at learning image embeddings was to train a model to categorize product images. Not surprisingly, our tests showed that these embeddings were particularly useful in surfacing listings from the same taxonomy. Often though the results were not “visually cohesive”: items were surfaced that didn't match well with the query image in color, material or pattern. To mitigate this problem we switched to a deep metric learning approach utilizing triplet loss. In this approach, the model is trained on triplets of examples where each triplet consists of an anchor, a positive example, and a negative example. After generating an embedding for each of the three examples, the triplet loss function tries to push the anchor and positive examples closer together, while pushing the negative example farther away. In our case, we used pairs of images from the same item as the anchor and positive examples, and an image from a different listing as the negative example. With these triplet embeddings, tests showed that our listings were now visually cohesive, displaying similar colors and patterns. Figure 2. The top row shows a query image. Middle row is a sample of nearest neighbors from image classification learned embeddings. Bottom row is a sample of nearest neighbors from triplet embeddings. The triplet embeddings show improved color and pattern consistency over image classification. But these embeddings lacked categorical accuracy compared to the classification approach. And the training metrics for the triplet approach offered less observability, which made it harder for us to assess the model's learning progress than with classification. Figure 3. While triplet metrics mostly revolve around the change in distances between the anchor and positive/negative examples, classification provides accuracy metrics that can be a proxy for how well the model fares in the task’s specific domain, and are simpler to reason about Taxonomy is not something we can tolerate our model being sloppy about. Since classification had already proven its ability to retrieve items of the same type, we decided to see if a multitask classification approach could be made to produce visually consistent results. Instead of having a single classification head on top of the pre-trained model, we attach separate heads for multiple categorization tasks: Item category (e.g. accessories, home & living), fine-grained item category (belt buckles, dining chairs), primary color, and other item attributes. Loss and training/evaluation metrics are then computed for each task individually while the embedding weights are shared across all of them. One challenge we faced in implementing this approach was that some optional seller input attributes, such as item color and material, can be sparse. The more tasks we added, the harder it was for us to sample training data equally across classes. To overcome this limitation we implemented a data sampler that reads from entirely disjoint datasets, one for each task, and each with its own unique set of labels. At training time, the sampler combines an equal number of examples from each dataset into every minibatch. All examples then pass through the model, but loss from each classification head is calculated only for examples from the head’s corresponding source dataset. Figure 4. Multitask learning architecture Returning to the classification paradigm meant that we could once again rely on accuracy metrics as a proxy for gauging and comparing models’ understanding of the domain of each task. This greatly simplifies the process of iterating and improving the model. Dataset The embeddings produced by multitask classification were now encapsulating more information about the visual attributes we added tasks for. This meant that when we searched using an embedding of some Etsy listing image, our results were both categorically and visually cohesive. However, when we talk about search by image, we're not expecting users to search from Etsy listing photos - the whole point of this is that the user is holding a camera. Photos uploaded by sellers are typically high quality, have professional lighting, and are taken over a white or premeditated background. But photos from a user's phone might be blurry, or poorly lit, or on a diversity of backgrounds that distract from the object the user is searching for. Deep learning is a powerful and useful tool, but training deep learning models is highly susceptible to biases in the data distribution, and training on seller-provided product images was biasing us away from user photos. Fortunately, Etsy allows users to post reviews of items they’ve purchased, and those reviews can have photos attached: photos taken by buyers, often with their phone cameras, very much like the images we expect to see used when searching by image. So we plugged in an additional classification task using the dataset of review photos, expanding the distribution of images our model learns about in training. And indeed with this new component in place we saw significant improvement in the model’s ability to surface visually relevant results. Figure 5. Multitask learning architecture with the added review photos dataset and classification head Inference Pipeline and Serving Our inference pipeline is an orchestrated ensemble of data processing jobs that turns the entire Etsy inventory of nearly 100M active listings into a searchable framework. We construct an approximate nearest neighbor (ANN) index using an inverted file (IVF) algorithm. The IVF algorithm divides the embedding space into clusters of listings. Later, at query time, we only look at the nearest subset of clusters to the query embedding, which greatly reduces search latency while only marginally impairing accuracy. Figure 6. DAG tasks to generate ANN index While the listing images are indexed in batch offline, the query photo is taken by users on the fly, so we have to infer those in real time - and fast. Due to the size of CNN models, it can take a long time to inference on a CPU. To overcome this hurdle we partnered with Etsy’s ML platform team to bring the first use case of real-time GPU inferencing at Etsy. Figure 7. Request flow We hope this feature gives even more of our buyers a new way to find exactly what they’re looking for on Etsy. So, the next time you come across something you love, snap a photo and search by image on Etsy! Acknowledgements This project started as part of CodeMosaic, Etsy’s annual “hackathon” week, where engineers can practice new skills and drive projects not necessarily related to their day-to-day work. We’re proud that we were able to take a proof-of-concept hackathon project, and turn it into a production feature to help make the millions of unique and special items on Etsy more discoverable for buyers. In particular, we’d like to thank the App Core Experience team for taking a chance and prioritizing this project. We could not have done it without the buy-in of leadership and help from our engineering enablement teams. More on this in our next article!
There are more than 100 million unique listings on Etsy, so we provide buyers recommendations to help them find that one special item that stands out to them. Recommendations are ubiquitous across Etsy, tailored for different stages of a user's shopping mission. We call each recommendation set a module, and there are hundreds of them both on the web and on mobile apps. These help users find trending items, pick up shopping from where they left off, or discover new content and interests based on their prior activity. Modules in an enterprise-scale recommendation system usually work in two phases: candidate set selection and candidate set ranking. In the candidate set selection phase, the objective is to retrieve a small set of relevant items out of the entire inventory, as quickly as possible. The second phase then ranks the items in the candidate set using a more sophisticated machine learning model, typically with an emphasis on the user's current shopping mission, and decides on the best few items to offer as recommendations. We call these models rankers and that's the focus of this post. Figure 1. Two recommendation modules sharing the same item page. The "more from this shop" module recommends similar items from the shop the user is currently looking at; "you may also like" finds relevant items from across Etsy shops. Rankers score candidate items for relevance based on both contextual attributes, such as the user's recent purchases and most clicked categories, as well as item attributes, such as an item's title and taxonomy. (In machine learning we refer to such attributes as features.) Rankers are optimized against a specific user engagement metric–clickthrough rate, for example, or conversion rate–and trained on users' interactions with the recommendation module, which is known as implicit feedback. Etsy has historically powered its recommendation modules on a one-to-one basis: one ranker for each module, trained exclusively on data collected from that module. This approach made it easy to recommend relevant items for different business purposes, but as we got into the hundreds of modules it became burdensome. On the engineering side, the cost of maintaining and iterating on so many rankers, running hundreds of daily pipelines in the process, is prohibitive. And as it becomes harder to iterate, we could lose opportunities to incorporate new features and best practices in our rankers. Without a solution, eventually the quality of our recommendations could degrade and actually do harm to the user experience. To address this potential problem, we pivoted to what we call canonical rankers. As with single-purpose rankers, these are models optimized for a particular user-engagement metric, but the intention is to train them so that they can power multiple modules. We expect these rankers to perform at least on par with module-specific rankers, while at the same time being more efficient computationally, and less costly to train and maintain. A Canonical Frequency Ranker We want Etsy to be not just an occasional stop for our users but a go-to destination, and that means paying attention to what will inspire future shopping missions after a user is finished with their current one. Our first canonical ranker was focused on visit frequency. We wanted to be able to identify latent user interests and surface recommendations that could expose a user to the breadth of inventory on Etsy at moments that might impact a decision to return to the site: for example, showing them complementary items right after their purchase, to suggest a new shopping journey ahead. Data and goal of the model Compared to metrics like conversion rate, revisit frequency is difficult to optimize for: there are no direct and immediate signals within a given visit session to indicate that a buyer is likely to return. There are, however, a multitude of ways for an Etsy user to interact with one of our items: they can click it, favorite it, add it to a collection or to a shopping cart, and of course they can purchase it. Of all of these, data analysis suggests that favoriting is most closely related to a user's intention to come back to the site, so we decided that our ranker would optimize on favorite rate as the best available surrogate for revisit frequency. Favoriting doesn't always follow the same pattern as purchasing, though. And we needed to be wary of the possibility that favorites-based recommendations, not being closely enough related to what a user wanted to buy in their current session, might create a distraction and could actually jeopardize sales. It would be important for us to keep an eye on purchase rate as we developed our frequency ranker. We also had to find appropriate modules to provide training data for the new ranker. There are a lot of them, on both mobile and web, and they appear in a lot of different user contexts. Most mobile app users are signed in, for example, while a large proportion of desktop users aren't. Interaction patterns are different for different modules: some users land on an item page via external search, or from a Google ad, and they tend to look for specific items, whereas habitual mobile app users are more exploratory. It was important for us that the few modules we trained the frequency ranker on should be as representative as possible of the data generated by these many different modules occurring on many different pages and platforms. We wanted to be confident that our ranker would really be canonical, and able to generalize from its training set to power a much wider range of modules. Model structure The requirement that our ranker should optimize for favorites, while at the same time not reducing purchase rate, naturally lent itself to a multi-task learning framework. For a given item we want to predict both the probability of the item being favorited and that of it being purchased. The two scores are then combined to produce the final ranking score. This sort of framework is not directly supported by the tree-based models that have been powering Etsy's recommendations in the past. However, neural models have many advantages over tree-based models and one of them is their ability to handle multi-task architectures. So it was a natural call to build our frequency ranker on a neural model. Favoriting and purchasing are obviously not completely unrelated tasks, so it is reasonable to assume that they share some common factors. This assumption suggests a shared-bottom structure, as illustrated on the right in the figure below: what the tasks have in common is expressed in shared layers at the bottom of the network, and they diverge into separate layers towards the top. A challenge that arises is to balance the two tasks. Both favorites and purchases are positive events, so sample weights can be assigned in both the loss function computation and the final score step. We devoted a lot of manual effort to finding the optimal weights for each task, and this structure became our first milestone. The simple shared-bottom structure proved to be an efficient benchmark model. To improve the model's performance, we moved forward with adding an expert layer following the Multi-gate Mixture Model of Experts (MMOE) framework proposed by Zhao et al. In this framework, favorites and purchases have more flexibility to learn different representations from the embeddings, which leads to more relevant recommendations with little extra computation cost. Figure 2. Model input and structure. The table on the left illustrates how training data is formatted, with samples from multiple modules concatenated and passed in to the ranker. On the right, a workflow that describes the multi-task model structure. A module name is appended on each layer. Building a canonical ranker In addition to using data from multiple modules, we also took training data and model structure into account when developing the canonical ranker. We did offline tests of how a naive multi-task model performed on eight different modules, where the training data was extracted from only a subset of them. Performance varied a lot, and on several modules we could not achieve parity against the existing production ranker. As expected, modules that were not included in the training data had the worst performance. We also observed that adding features or implementing an architectural change often led to opposite results on different modules. We took several steps to address these problems: We added a feature, module_name, representing which module each sample comes from. Given how much patterns can vary across different modules, we believe this is a critical feature, and we manually stack it in each layer of the neural net. It’s possible that the module_name passed to the ranker during inference is not one that it saw during training. (Remember that we only train the model on data from a subset of modules.) We account for this by randomly sampling 10% of the training data and replacing the module_name feature with a dummy name, which we can use in inference to cover that training gap when it occurs. User segments distribution and user behaviors vary a lot across modules, so it’s important to keep a balance of training data across different user segments. We account for this during training data sampling. We assign different weights to different interaction types, (e.g., impressions, clicks, favorites, and purchases), and the weights may vary depending on module. The intuition is that the correlation between interactions may be different across modules. For example, click may show the same pattern as favorite on module X, but a different pattern on module Y. To help ensure that the canonical ranker can perform well on all modules, we carefully tune the weights to achieve a balance. Launching experiments After several iterations, including pruning features to meet latency requirements and standardizing the front-end code that powers recommendation modules, in Q2 of 2022 we launched the first milestone ranker on an item page module and a homepage module. We observed as much as a 12.5% improvement on module-based favorite NDCG and significant improvements on the Etsy-wide favorite rate. And though our main concern was simply to not negatively impact purchases, we were pleased to observe significant improvements on purchase metrics as well as other engagement metrics. We also launched experiments to test our ranker on a few other modules, whose data are not used during training, and have observed that our ranker outperformed the module-specific rankers in production. These observations suggest that the model is in fact a successful canonical ranker. In Q3, we launched our second milestone model, which proved to be better than the first one and improved engagement even further. As of now, the ranker is powering multiple modules on both web and app, and we anticipate that it will be applied in more places. For machine learning at Etsy, the frequency ranker marks a paradigm shift in how we build recommendations. From the buyer's perspective, not only does the ranker provide more personalized recommendations, but employing the same ranker across multiple modules and platforms also helps guarantee a more consistent user experience. Moving forward, we’ll continue iterating on this ranker to improve our target metrics, making the ranker more contextual and testing other novel model architectures. Acknowledgements Thanks Davis Kim and Litao Xu for engineering support of the work and Murium Iqbal for internal review of this post. Special thanks to folks in recs-platform, feature-system, ml-platform and search-ranking.
Machine learning (ML) model deployment is one of the most common topics of discussion in the industry. That’s because deployment represents a meeting of two related but dramatically different domains, ML practice and software development. ML work is experimental: practitioners iterate on model features and parameters, and tune various aspects of their models to achieve a desired performance. The work demands flexibility and a readiness to change. But when they’re deployed, models become software: they become subject to the rigorous engineering constraints that govern the workings of production systems. The process can frequently be slow and awkward, and the interfaces through which we turn models into deployed software are something we devote a lot of attention to, looking for ways to save time and reduce risk. At Etsy, we’ve been developing ML deployment tools since 2017. Barista, the ML Model Serving team’s flagship product, manages lifecycles for all types of models - from Recommendations and Vision to Ads and Search. The Barista interface has evolved dramatically alongside the significant evolution in the scope and range of our ML practice. In this post we’ll walk you through the story of where we started with this interface, where we ended up, and where we intend to keep going. Arc 1: Managing Deployed Model Configs as Code Like many companies deploying ML models to production, Etsy’s ML platform team uses Kubernetes to help scale and orchestrate our services. At its core, Barista itself is a Python+Flask-based application that utilizes the Kubernetes Python API to create all the Kubernetes objects necessary for a scalable ML deployment (Deployment, Service, Ingress, Horizontal Pod Autoscaler, etc.). Barista takes a model deployment configuration specified by users and performs CRUD operations on model deployments in our Kubernetes cluster. The Barista UI in 2020 In the initial version of Barista, the configurations for these deployments were managed as code, via a simple, table-based, read-only UI. Tight coupling of configuration with code is typically frowned upon, but on the scale of our 2017-era ML practice it made a lot of sense. Submitting config changes to the Barista codebase as PRs, which required review and approval, made it easy for us to oversee and audit those changes. It was no real inconvenience for us to require our relatively small corps of ML practitioners to be capable of adding valid configurations to the codebase, especially if it meant we always knew the who, what, and when of any action that affected production. Our old Python file where infrastructure configurations for model deployments were stored Updating configurations in this tightly coupled system required rebuilding and deploying the entire Barista codebase. That process took 20-30 minutes to complete, and it could be further blocked by unrelated code errors or by issues in the build pipeline itself. As ML efforts at Etsy began to ramp up and our team grew, we found ourselves working with an increasing number of model configurations, hundreds of them ultimately, defined in a single large Python file thousands of lines long. With more ML practitioners making more frequent changes, a build process that had been merely time-consuming was becoming a bottleneck. And that meant we were starting to lose the advantage in visibility that had justified the configuration-as-code model for us in the first place. Arc 2: Decoupling Configs with a Database By 2021, we knew we had to make changes. Working with Kubernetes was becoming particularly worrisome. We had no safe way to quickly change Kubernetes settings on models in production. Small changes like raising the minimum replicas or CPU requests of a single deployment required users to push Barista code, seek PR approval, then merge and push that code to production. Even though in an emergency ML platform team members could use tools like kubectl or the Google Cloud Console to directly edit deployment manifests, that didn't make for a scalable or secure practice. And in general, supporting our change process was costing us significant developer hours. So we decoupled. We designed a simple CRUD workflow backed by a CloudSQL database that would allow users to make instantaneous changes through a Barista-provided CLI. Early architecture diagrams from the design doc The new system gave us a huge boost in productivity. ML practitioners no longer had to change configs in code, or develop in our codebases at all. They could perform simple CRUD operations against our API that were DB-backed and reflected by the deployments in our cluster. By appropriately storing both the live configuration of models and an audit log of operations performed against production model settings, we maintained the auditability we had in the PR review process and unblocked our ML practitioners to deploy and manage models faster. We designed the CLI to be simple to use, but it still required a certain degree of developer setup and acumen that was inconvenient for many of our ML practitioners. Even simple CLIs have their quirks, and they can be intimidating to people who don't routinely work on the command line. Increasingly the platform team was being called on to help with understanding and running CLI commands and fixing bash issues. It began to look as if we'd traded one support burden for another, and might see our productivity gains start to erode Arc 3: A UI Built Mostly by Non-UI People We’d always had an API, and now we had a database backing that API. And we had the command line: but what we needed, if we wanted wide adoption of our platform across an increasing user base, was a product. A purpose-built, user-friendly Web interface atop our API would let ML practitioners manage their model deployments directly from the browser, making CLI support unnecessary, and could compound the hours we'd saved moving to the CRUD workflow. So, in the summer of 2021 we started building it. The vaunted Barista web app Now this is certainly not the most aesthetic web app ever built – none of us on the ML platform team are front-end developers. What it is, though, is a fairly robust and complete tool for managing ML deployments on Kubernetes. We've given our users the ability to update anything to do with their models, from application settings to model artifacts to Kubernetes settings like CPU resources or replicas. The UI provides useful integrations to the Google Kubernetes Engine (GKE) console to render information about Pods in a deployment, and even integrates with the API of our cost tool so practitioners can understand the cost of serving their models live in production. In 2017, Barista was an HTML table and Python config file. Now, in 2023, it’s fully functioning web interface that integrates with multiple internal and third-party APIs to render useful information about models, and gives users complete and immediate control over their model deployments. Small changes that might have taken hours can happen in seconds now, unblocking users and reducing the toll of unnecessary workflows. Arc 4: Security and Cost The Barista UI made it so much easier to serve models at Etsy that it helped drive up the rate of experimentation and the number of live models. Suddenly, over the course of a few months, Barista was seeing hundreds of models both in production and on dev servers. And while we were thrilled about the product’s success, it also raised some concerns: specifically, that the simplicity and user-friendliness of the process might lead to spikes in cloud cost and an increase in misconfigurations. Serving a model on Barista accrues cost from Google Cloud, Etsy’s cloud computing service partner. This cost can range anywhere from a few hundred to thousands of dollars per month, depending on the model. While the costs were justified in production, in the development system we were seeing high daily CPU costs with relatively low usage of the environment, which was something that needed a remedy. Unfortunately, by default the Kubernetes Horizontal Pod Autoscaler, which we were using to manage our replicas, doesn't let you scale down below 1. With the increased flow of models through Barista, it became harder for ML practitioners to remember to remove resources when they were no longer needed–and unless a deployment was completely deleted and recreated, we were going to keep incurring costs for it. To mitigate the issue we added the Kube Downscaler to our development cluster. This allowed us to scale deployment replicas to zero both off-hours and on the weekends, saving us about $4k per week. We still had deployments sitting around unused on weekdays, though, so we decided to build Kube Downscaler functionality directly into Barista. This is a safety feature that pauses model deployments: by automatically scaling models in development to zero replicas after three hours, or on user demand. We're now seeing savings of up to 82% in dev cost during periods of inactivity. And we've avoided the runaway cost scenario (where models are not scaled down after test), which could have resulted in annualized costs in excess of $500k. How to use the Barista UI From a certain angle, this has mostly been an article about how it took us three years to write a pretty standard web app. The app isn’t the point, though. The story is really about paying attention to the needs of our ML users and to the scale of their work within Etsy, and above all about resisting premature optimization. Over the years we’ve moved from meeting the basic technical requirements of an ML platform to building a complete product for it. But we never aimed at a product, not till it became necessary, and so we were able to avoid getting ahead of ourselves. The goals of the ML platform team have broadened now that Barista is where it is. We continue to try to enable faster experimentation and easier deployment. The consensus of our ML practitioners is that they have the tools they need, but that our greatest area of opportunity is to improve in the cohesiveness of our suite of services - most importantly, automation. And that’s where we’re investing now: in tooling to provide optimal infrastructure settings for model deployments, for instance, so we can reduce tuning time in serving them and further minimize our cloud costs. Our ML practice continues to grow, both in number of models and team size, and our platform is going to have to continue to grow to keep up with them.
The question of documentation—not just formats and standards, but the tools and processes that can make documenting code a normal and repeatable part of development workflows—has been a live one in the software industry for a long time. The docs-as-code approach has emerged as a way of integrating documentation into development on the principle that the same sets of tools and procedures should be used for both. This might mean, for instance, versioning documentation the same way code is versioned, formatting it using plain-text markup like Markdown or reStructuredText, and employing automation tools to build and deploy documentation. At Etsy, we've developed an internal tool called Docsbuilder to implement our own docs-as-code approach. It's our primary tool for maintaining and discovering software documentation at Etsy, and in this post we'll give you an overview of it. What is Docs-as-code? Docs-as-code aims to manage software documentation with the same level of rigor and attention typically brought to managing software source code. The goal of the methodology is to make the work of authoring, versioning, reviewing and publishing documentation smoother, more efficient, more repeatable and reliable. The main principles of docs-as-code are: Documentation is a first-class citizen in the software development process As such, it should be versioned and stored in source control repositories such as Git Documentation should be written in plain-text formats like Markdown, to support versioning and collaboration Documentation should be reviewed and tested before being published The documentation workflow should be automated where possible to improve the reliability and user experience of publishing sites Fundamentally, docs-as-code encourages an approach that balances documentation writing with coding. Developers who tag and release documentation the same way they do source code, with the same tools and procedures, will tend to be more organized and structured about handling their documentation. And the documentation itself becomes easier to audit for quality, and to update and maintain. Docs-as-code at Etsy Etsy's Docsbuilder tool is a collection of bin scripts, pipelines and processes that let developers create and maintain software documentation. Docsbuilder is Markdown-based; we use Docusaurus, a Facebook-created tool for generating project documentation websites, to convert Markdown files to HTML. Each project team at Etsy owns its own Docsbuilder repository, a mix of Markdown files, configurations and NPM dependencies, and we use a secure GitOps-based workflow to review and merge changes to these repositories. To simplify the process of creating new sites, we developed a "docsbuilder-create" bin script. The script checks to make sure a supplied site name is valid and unique, then it bootstraps a Docusaurus site and installs all necessary Etsy-specific plugins. Once the site is created, users can push documentation to their Git branch and open a pull request. To validate and test the PR, we use Google Cloud Build to clone a preconfigured Node.js container, which builds the entire documentation website and runs some integration tests to make sure the site is working properly. When the tests are completed and the PR is approved, we merge the code into the main branch of the repository and another Cloud Build job then builds and deploys the site to production. To make it easier for users within Etsy to find documentation, we developed a docs search engine that scans and indexes all Docsbuilder sites and all available pages. To simplify navigation, a React component displays a list of all important and frequently used sites on the homepage of our documentation hub. In total, we currently host about 6.2k pages on 150 Docsbuilder project sites. What’s Next Discoverability and ease of use are important factors in making documentation more available, and that’s an area where we’re aiming to improve. Better links between doc sites and related topics will surface documentation better, as will improvements to the docs search engine. With the same idea in mind, we want to focus on creating more organized and navigable documentation pages. And there’s no reason not to make the content itself more engaging: we’re looking into adding more support for screenshots and other imagery, and integrating with React plugins to draw diagrams (e.g. flowcharts) and other visuals that can illustrate key concepts and ideas.
Between Dec 2020 and May 2022, the Etsy Payments Platform, Database Reliability Engineering and Data Access Platform teams moved 23 tables, totaling over 40 billion rows, from four unsharded payments databases into a single sharded environment managed by Vitess. While Vitess was already managing our database infrastructure, this was our first usage of vindexes for sharding our data. This is part 3 in our series on Sharding Payments with Vitess. In the first post, we focused on challenges in the application and data model. In part two, we discussed the challenges of cutting over a crucial high traffic system, and in this final post we will discuss different classes of errors that might crop up when cutting over traffic from an unsharded keyspace to a sharded keyspace. As an engineer on Etsy’s Data Access Platform team, my role bridges the gap between product engineering and infrastructure. My team builds and maintains Etsy’s in-house ORM, and we are experts in both software engineering and the database software that our code relies on. Myself and members of my team have upstreamed over 30 patches to Vitess main. We also maintain an internal fork of Vitess, containing a handful of patches necessary to adapt Vitess to the specifics of our infrastructure. In our sharding project, my role was to ensure that the queries our ORM generates would be compatible with vindexes and to ensure that we configured Vitess correctly to preserve all of the data guarantees that our application expects. Cutover risks Vitess does an excellent job of abstracting the concept of sharding. At a high level, Vitess allows your application to interact with a sharded keyspace much the same as it would with an unsharded one. However, in our experience sharding with Vitess, we found a few classes of errors in which things could break in new ways after cutting over to the sharded keyspace. Below, we will discuss these classes of errors and how to reduce the risk of encountering them. Transaction mode errors When sharding data with Vitess, your choice of transaction_mode is important, to the extent that you care about atomicity. The transaction_mode setting is a VTGate flag. The default value is multi, which allows for transactions to span multiple shards. The documentation notes that “partial commits are possible” when using multi. In particular, when a transaction fails in multi mode, writes may still be persisted on one or more shards. Thus, the usual atomicity guarantees of database transactions change significantly: typically when transactions fail, we expect no writes to be persisted. Using Vitess’s twopc (two-phase commit) transaction mode solves this atomicity problem, but as Vitess documentation notes, “2PC is an experimental feature and is likely not robust enough to be considered production-ready.” “Transaction commit is much slower when using 2PC," say the docs. "The authors of Vitess recommend that you design your VSchema so that cross-shard updates (and 2PC) are not required.” As such, we did not seriously consider using it. Given the shortcomings of the multi and twopc transaction modes, we opted for single transaction mode instead. With single mode, all of the usual transactional guarantees are maintained, but if you attempt to query more than one shard within a database transaction, the transaction will fail with an error. We decided that the semantics of single as compared with multi would be easier to communicate to product developers who are not database experts, less surprising, and provide more useful guarantees. Using the single transaction mode is not all fun and games, however. There are times when it can seem to get in the way. For instance, a single UPDATE statement which touches records on multiple shards will fail in single mode, even if the UPDATE statement is run with autocommit. While this is understandable, sometimes it is useful to have an escape hatch. Vitess provides one, with an API for changing the transaction mode per connection at runtime via SET transaction_mode=? statements. Your choice of transaction mode is not very important when using Vitess with an unsharded keyspace – since there is only a single database, it would be impossible for a transaction to span multiple databases. In other words, all queries will satisfy the single transaction mode in an unsharded keyspace. But when using Vitess with a sharded keyspace, the choice of transaction mode becomes relevant, and your transactions could start failing if they issue queries to multiple shards. To minimize our chances of getting a flood of transaction mode errors at our cutover, we exhaustively audited all of the callsites in our codebase that used database transactions. We logged the SQL statements that were executed and pored over them manually to determine which shards they would be routed to. Luckily, our codebase uses transactions relatively infrequently, since we can get by with autocommit in most cases. Still it was a painstaking process. In the end, our hard work paid off: we did not observe any transaction mode-related errors at our production cutover. Reverse VReplication breaking Pre-cutover, VReplication copies data from the original unsharded keyspace to the new sharded keyspace, to keep the data in sync. Post-cutover, VReplication switches directions: it copies data back from the new sharded keyspace to the original unsharded keyspace. We’ll refer to this as reverse VReplication. This ensures that if the cutover needs to be reversed, the original keyspace is kept in sync with any writes that were sent to the sharded keyspace. If reverse VReplication breaks, a reversal of the cutover may not be possible. Reverse VReplication broke several times in our development environment due to enforcement of MySQL unique keys. In an unsharded keyspace, a unique key in your MySQL schema enforces global uniqueness. In a sharded keyspace, unique keys can only enforce per-shard uniqueness. It is perfectly possible for two shards to share the same unique key for a given row. When reverse VReplication attempts to write such rows back to the unsharded database, one of those writes will fail, and reverse VReplication will grind to a halt. This form of broken VReplication can be fixed in one of two ways: Delete the row corresponding to the write that succeeded in the unsharded keyspace. This will allow the subsequent row to reverse vreplicate without violating the unique-key constraint. Skip the problematic row by manually updating the Pos column in Vitess’s internal _vt.vreplication table. It was important to us that reverse VReplication be rock solid for our production cutover. We didn’t want to be in a situation where we would be unable to reverse the cutover in the event of an unexpected problem. Thus, before our production cutover, we created alerts that would page us if reverse VReplication broke. Furthermore, we had a runbook that we would use to fix any issues with reverse VReplication. In the end, reverse VReplication never broke in production. The reason it broke in our development environment, it turned out, was due to a workflow specific to that environment. As an aside, we later discovered that Vitess does in fact provide a mechanism for enforcing global uniqueness on a column in a sharded keyspace. Scatter queries In a sharded keyspace, if you forget to include the sharding key (or another vindexed column) in your query’s WHERE clause, Vitess will default to sending the query to all shards. This is known as a scatter query. Unfortunately, it can be easy to overlook adding the sharding key to one or more of your queries before the cutover, especially if you have a large codebase with many types of queries. The situation might only become obvious to you after cutover to the sharded cluster. If you start seeing a much higher than expected volume of queries post-cutover, scatter queries are likely the cause. See part 2 in this series of posts for an example of how we were impacted by scatter queries in one cutover attempt. Having scatter queries throw a wrench in one of our earlier cutover attempts got us thinking about how we could identify them more easily. Vitess provides several tools that can show its execution plan for a given query, including whether or not it will scatter: vtexplain, EXPLAIN FORMAT=vitess …, and EXPLAIN FORMAT=vtexplain. At the time of our earlier cutover, we had not been in the habit of regularly analyzing our queries with these tools. We used them before the next cutover attempt, though, and made sure all the accidental scatter queries got fixed. Useful as Vitess query analysis is, there is always some chance an accidental scatter query will slip through and surprise you during a production cutover. Scatter queries have a multiplicative impact: they are executed on every shard in your keyspace, and at a high enough volume can push the aggregate query volume in your sharded keyspace to be several times what you would see in an unsharded one. If query volume is sufficient to overwhelm all shards after the cutover, accidental scatter queries can result in a total keyspace outage. It seemed to us it would limit the problem if Vitess had a feature to prevent scatter queries by default. Instead of potentially taking down the entire sharded keyspace in a surge of query traffic, with a no-scatter default only accidental scatter queries would fail. At our request, PlanetScale later implemented a feature to prevent scatter queries. The feature is enabled by starting VTGate with the --no_scatter flag. Scatter queries are still allowed if a comment directive is included in the query: /*vt+ ALLOW_SCATTER */. While this feature was not yet available during our earlier cutover attempts, we have since incorporated it into our Vitess configuration. Incompatible queries Some types of SQL queries that work on unsharded keyspaces are incompatible with sharded keyspaces. Those queries can start failing after cutover. We ran into a handful of them when we tested queries in our development environment. One such class of queries were scatter queries with a GROUP BY on a VARCHAR column. As an example, consider a table in a sharded keyspace with the following schema: CREATE TABLE `users` ( `user_id` int(10) unsigned NOT NULL, `username` varchar(32) NOT NULL, `status` varchar(32) NOT NULL, PRIMARY KEY (`user_id`) ) Assume the users table has a primary vindex on the column user_id and no additional vindexes. On such a table in a sharded keyspace, the following query will be a scatter query: SELECT status, COUNT(*) FROM users GROUP BY status; This query failed with an error on our version of Vitess: ERROR 2013 (HY000): Lost connection to MySQL server during query. After diving into Vitess’s code to determine the cause of the error, we came up with a workaround: we could CAST the VARCHAR column to a BINARY string: > SELECT CAST(status AS BINARY) AS status, COUNT(*) FROM users GROUP BY status; +----------+----------+ | status | count(*) | +----------+----------+ | active | 10 | | inactive | 2 | +----------+----------+ We occasionally ran into edge cases like this where Vitess’s query planner did not support specific query constructs. But Vitess is constantly improving its SQL compatibility – in fact, this specific bug was fixed in later versions of Vitess. In our development environment, we exhaustively tested all queries generated by our application for compatibility with the new sharded keyspace. Thanks to this careful testing, we managed to identify and fix all incompatible queries before the production cutover. Conclusion There are several classes of errors that might only start appearing after the cutover from an unsharded keyspace to a sharded keyspace. This makes cutover a risky process. Although cutovers can generally be easily reversed, so long as reverse VReplication doesn’t break, the impact of even a short disruption can be large, depending on the importance of the data being migrated. Through careful testing before cutover, you can reduce your exposure to these classes of errors and guarantee yourself a much more peaceful cutover. This is part 3 in our series on Sharding Payments with Vitess. Check out part 1 to read about the challenges of migrating our data models and part 2 to read about the challenges of cutting over a crucial high-traffic system.
Between Dec 2020 and May 2022, the Etsy Payments Platform, Database Reliability Engineering and Data Access Platform teams moved 23 tables, totaling over 40 billion rows, from four unsharded payments databases into a single sharded environment managed by Vitess. While Vitess was already managing our database infrastructure, this was our first usage of vindexes for sharding our data. This is part 2 of our series on Sharding Payments with Vitess. In the first post, we focused on making application data models available for sharding. In this part, we’ll discuss what it takes to cut over a crucial high traffic system, and part 3 will go into detail about reducing the risks that emerge during the cutover. Migrating Data Once we had chosen a sharding method, our payments data had to be redistributed to forty new shards, the data verified as complete, reads and writes switched over from existing production databases, and any additional secret sauce applied. Considering this was only the payments path for all of Etsy.com, the pressure was on to get it right. The only way to make things work would be to test, and test again. And while some of that could be done on an ad-hoc basis by individual engineers, to reorganize a system as critical as payments there is no substitute for a production-like environment. So we created a staging infrastructure where we could operationally test Vitess's internal tooling against snapshots of our production MySQL data, make whatever mistakes we needed to make, and then rebuild the environment, as often as necessary. Staging was as real as we could make it: we had a clone of the production dataset, replicas and all, and were equipped with a full destination set of forty Vitess shards. Redistributing the data in this playground, seeing how the process behaved, throwing it all away and doing it again (and again), we built confidence that we could safely wrangle the running production infrastructure. We discovered, and documented, barriers and unknowns that had to be adjusted for (trying to shard on a nullable column, for instance). We learned to use VDiff to confirm data consistency and verify performance of the tooling, we learned to run test queries through Vitess proper, checking behavior and estimating workload, and we discovered various secondary indexing methods and processes (using CreateLookupVindex and ExternalizeVindex) to overcome some of these barriers. One of the key things we found out about was making the read/write switch. Vitess's VReplication feature is the star performer here. Via the MoveTables workflow, VReplication sets up streams that replicate all writes in whichever direction you're currently operating. If you're writing to the source side, VReplication has streams to replicate those writes into the sharded destination hosts, properly distributing them to the correct shard. When you switch writes with Vitess, those VReplication streams are also reversed, now writing from the destination shards to the source database(s). Knowing we had this as an option was a significant confidence booster. If the first switch wasn't perfect, we could switch back and try again: both sides would stay consistent and in sync. Nothing would get lost in between. Top: The MoveTables workflow has VReplication setup to move existing data to the new destination shards and to propagate new writes. Bottom: After SwitchWrites the Vreplication stream reverses to the source database, keeping the origin data in sync with new writes to the sharded setup. Cutting Over Production VReplication paved most of the road for us to cut over production, but we did have our share of “whoa, I didn’t expect that” moments. As powerful as the Vitess workflow system is, it does have some limitations when attempting to move a substantial dataset, and that requires finding additional solutions for the same problem. We were traversing new ground here for Etsy, so even as we pushed onward with the migration we were still learning how to operate within Vitess. In one phase of the project we switched traffic and within a couple of minutes saw a massive 40x jump in query volume (from ~5k/second to ~200k/second) on the same workload. What is going on here? After some quick investigation we found that a query gathering data formerly colocated on a monolithic database now required that data from some portion of the new shard set. The problem was, Vitess didn’t know exactly which of our forty shards that data existed on, since it was being requested via a column that was not part of the sharding column. An explosion: This massive increase in query volume surprised us initially. A dive into what was generating this volume quickly revealed scatter queries were at the root of it. Enter the story's second superhero: CreateLookupVindex. While we switched traffic back and headed to our staging environment to come up with a solution, we quickly realized the 40x increase was due to Vitess using a scatter gather process, sending queries to all shards in an effort to see which of them had pieces of the data. Besides those scatter queries being inefficient, they were also bombarding the databases with requests that returned empty result sets. To combat this, we created Secondary Vindexes to tell Vitess where the data was located using a different column than the Primary Vindex. These additional Vindexes allowed the VTGates to use a lookup table to identify the shards housing the relevant data and make sure they were the only ones receiving queries. A solution: Application of Lookup Vindexes quelled the rampant scatter queries immediately, returning query volume to manageable levels. Knowing there were additional datasets that would behave similarly, we were able to find those callsites and query patterns early, and apply those Secondary Vindexes while the MoveTables workflow was in progress. This prevented the same overwhelming traffic pattern from recurring and kept the final transitions from being disruptively resource-intensive. (For more information on the scatter traffic pattern, part 3 of this series has additional details.) As we grew more comfortable with our command of the migration process, we decided that since Vitess supported multiple workflows per keyspace, we could get both the MoveTables and the Secondary Vindexes deployed simultaneously. However, there's a caveat: due to the way Secondary Vindexes are maintained by VReplication streams, the Vindexes cannot be externalized (made available to the system) until after the switching of writes is complete. These indexes work by inserting rows into a lookup table as writes are made to the owner table, keeping the lookup table current. While the MoveTables is in motion, the Secondary Vindex is doing its job inserting while the data is being sharded via the vReplication streams. And there's the rub: if you externalize the Vindex before writes are switched, there aren’t any writes being made to the destination shards, and you are going to miss all of those lookup records coming from the vReplication stream. Taking all this into account, we performed the switch writes, externalized the two Secondary Vindexes we were creating, and found the machines we were running on couldn’t handle the query load effectively. No problem: we'd switch the writes back to the source database. Oops! We just broke our fancy new Vindexes because CreateLookupVindex was no longer running! While it wasn’t the end of the world, it did mean we had to remove those Secondary Vindexes, remove all of their artifacts from the sharded dataset (drop the lookup tables), then rebuild them a second time. During the small window this work created we raised the specs for the destination cluster, then did the dance again. Switch writes, externalize the Vindexes, and roll forward. This time, happily, roll forward is exactly what things did. Conclusion Thanks to extensive testing on a full production copy in our staging environment, Vitess tooling being as powerful and performing as well as we expected, and meticulous precursor work on the data models, we encountered no disruption, no downtime, and no impact to normal operation throughout the course of these migrations. While there was tolerance for the idea of some disruption, the most important thing to our team was that our processes should run transparently. Completing this monumental project on a highly sensitive path without anyone noticing (inside Etsy or out) was very satisfying. This is part 2 in our series on Sharding Payments with Vitess. Check out part 1 to read about the challenges of migrating our data models, and part 3 to read about reducing cutover risks.
At the end of 2020, Etsy’s Payments databases were in urgent need of scaling. Specifically, two of our databases were no longer vertically scalable — they were using the highest resource tier offered on the Google Cloud platform. These databases were crucial to our day-to-day payment processing, so it was a high-risk situation: spikes in traffic could have led to performance issues or even loss of transactions. Our sellers depend on our payments system to get paid for their hard work, making this reliability effort even more critical. A stable and scalable payments platform provides the best possible experience for our sellers. Between Dec 2020 and May 2022, the Etsy Payments Platform, Database Reliability Engineering and Data Access Platform teams moved 23 tables, totaling over 40 billion rows, from four unsharded payments databases into a single sharded environment managed by Vitess. While Vitess was already managing our database infrastructure, this was our first usage of vindexes for sharding our data. We did this work in two phases. First we migrated our seller ledger infrastructure, a contained domain that determines seller bills and payouts. For the second phase, we worked to reduce load on our primary payments database, which houses transaction data, payment data and much more. This database has been around for over a decade and hosted nearly 90 tables before the migration. To cut down the scope of the project we strategically chose to migrate just a subset of tables with the highest query volume, adding in others related to those high-volume tables as needed. In the end, even operating on just that subset, we were able to reduce load by 60%, giving us room to scale for many years down the road. Throughout this project, we encountered challenges across the stack. This is the first in a series of posts in which we’ll share how we approached those challenges, both in application and infrastructure. Here we’ll be focusing on making application data models available for sharding. In part 2, we’ll discuss what it takes to cut over a crucial high-traffic system, and part 3 will go into detail about reducing the risks that emerge during the cutover. An Ideal Sharding Model: Migrating the Ledger A data model’s shape impacts how easy it is to shard, and its resiliency when sharded. Data models that are ideal for sharding are shallow, with a single root entity that all other entities reference via foreign key. For example, here is a generic data model for a system with Users, and some data directly related to Users: This type of data model allows all tables in the domain to share a shardifier (in this example, user_id), meaning that related records are colocated. Even with Vitess handling them, cross-shard operations can be inefficient and difficult to reason about; colocated data makes it possible for operations to take place on a single shard. Colocation can also reduce how many shards a given request depends on, mitigating user impacts from a shard going down or having performance issues. Etsy's payments ledger, our first sharding migration, was close to this ideal shape. Each Etsy shop has a ledger of payments activity, and all entities in this domain could be related to a single root Ledger entity. The business rule that maintains one ledger to a shop means that ledger_id and shop_id would both be good candidates for a shardifier. Both would keep all shop-related data on a single shard, isolating shard outages to a minimal number of shops. We went with shop_id because it's already in use as a shardifier elsewhere at Etsy, and we wanted to stay consistent. It also future-proofs us in case we ever want to allow a shop to have multiple ledgers. This "ideal" model may have been conceptually simple, but migrating it was no small task. We needed to add a shop_id column to tables and to modify constraints such as primary keys, unique keys, and other indexes, all while the database was resource-constrained. Then we had to backfill values to billions of existing rows, while–again–the database was resource-constrained. (We came up with a real-time tunable script that could backfill at up to 60x faster using INSERT … ON DUPLICATE KEY UPDATE statements.) And when we had our new shardifier in place there were hundreds of existing queries to update with it, so Vitess would know how to route to the appropriate shard, and hundreds of tests whose test fixture data had to be updated. Non-ideal Models: Sharding Payments For our second phase of work, which would reduce load on our primary payments database, the data model was less straightforward. The tables we were migrating have grown and evolved over a decade-plus of changing requirements, new features, changes in technology and tight deadlines. As such, the data model was not a simple hierarchy that could lend itself to a standard Etsy sharding scheme. Here’s a simplified diagram of the payments model: Each purchase can relate to multiple shops or other entities. Payments and PaymentAdjustments are related to multiple types of transactions, CreditCardTransactions and PayPalTransations. Payments are also related to many ShopPayments with different shop_ids, so sharding on that familiar Etsy key would spread data related to a single payment across many shards. PaymentAdjustments meanwhile are related to Payment by payment_id, and to the two transaction types by reference_id (which maps to payment_adjustment_id). This is a much more complex case than the Ledger model, and to handle it we considered two approaches, described below. As with any technical decision, there were tradeoffs to be negotiated. To evaluate our options, we spiked out changes, diagrammed existing and possible data models, and dug into production to understand how existing queries were using the data. Option 1: Remodel the core The first approach we considered was to modify the structure of our data model with sharding in mind. The simplified diagram below illustrates the approach: We’ve renamed our existing “Payment” models “Purchase,” to better reflect what they represent in the system. We’ve created a new entity called Payment that groups together all entities related to any kind of payment that happens on Etsy. With this model we move closer to an “ideal” sharding model, where all records related to a single payment live on the same shard. We can shard everything by payment_id and enable colocation of all these related entities, with the attendant benefits of resilience and predictability that we've already noted. Introducing a consequential change to an important data model is costly. It would require sweeping changes to core application models and business logic and engineers would have to learn the new model. Etsy Payments is a large and complex codebase, and integrating it with this a new built-to-shard model would lead to a scope of work well beyond our goal of scaling the database. Option 2: Find shardifiers where they live The second approach was to shard smaller hierarchies using primary and foreign keys already in our data model, as illustrated below: Here we’ve sharded sub-hierarchies in the data together, using reference_id as a polymorphic shardifier in the transaction model so we can collocate transactions with their related Payment or PaymentAdjustment entities. (The downside of this approach is that PaymentAdjustments are also related to Payments, and those two models will not be colocated.) Considering how urgent it was that we move to a scalable infrastructure, and the importance of keeping Etsy Payments reliable in the meantime, this more modest approach is the one we opted for. As discussed above, most of the effort in the Ledger phase of the project went towards adding columns, modifying primary keys to existing data tables, backfilling data, and modifying queries to add a new shardifier. In contrast, taking established primary and foreign keys as shardifiers whenever possible would cut out nearly all of that effort from the Payments work, giving us a much shorter timeline to completion while still achieving horizontal scalability. Without having to manage the transition to a new data model, we could focus on scaling with Vitess. As it happens, Vitess has re-sharding features that give us flexibility to change our shard design in future, if we see the need; sharding on the basis of the legacy payments model was not a once-and-forever decision. Vitess can even overcome some of the limitations of our "non-ideal" data model using features such as secondary indexes: lookup tables that Vitess maintains to allow shard targeting even when the shardifier is not in a query. This was part 1 of our series on Sharding Payments with Vitess. We'll have a lot more to say about our real-world experience of working with Vitess in part 2 and part 3 of this series. Please join us for those discussions.
The first time I performed a live upgrade of Etsy's Kafka brokers, it felt exciting. There was a bit of an illicit thrill in taking down a whole production system and watching how all the indicators would react. The second time I did the upgrade, I was mostly just bored. There’s only so much excitement to be had staring at graphs and looking at patterns you've already become familiar with. Platform upgrades made for a tedious workday, not to mention a full one. I work on what Etsy calls an “Enablement” team -- we spend a lot of time thinking about how to make working with streaming data a delightful experience for other engineers at Etsy. So it was somewhat ironic that we would spend an entire day staring at graphs during this change, a developer experience we would not want for the end users of our team’s products. We hosted our Kafka brokers in the cloud on Google's managed Kubernetes. The brokers were deployed in the cluster as a StatefulSet, and we applied changes to them using the RollingUpdate strategy. The whole upgrade sequence for a broker went like this: Kubernetes sends a SIGTERM signal to a broker-pod-n The Kafka process shuts down cleanly, and Kubernetes deletes the broker-pod Kubernetes starts a new broker-pod with any changes applied The Kafka process starts on the new broker-pod, which begins recovering by (a) rebuilding its indexes, and (b) catching up on any replication lag Once recovered, a configured readiness probe marks the new broker as Ready, signaling Kubernetes to take down broker-pod-(n-1). Kafka is an important part of Etsy's data ecosystem, and broker upgrades have to happen with no downtime. Topics in our Kafka cluster are configured to have three replicas each to provide us with redundancy. At least two of them have to be available for Kafka to consider the topic “online.” We could see to it that these replicas were distributed evenly across the cluster in terms of data volume, but there were no other guarantees around where replicas for a topic would live. If we took down two brokers at once, we'd have no way to be sure we weren't violating Kafka's two-replica-minimum rule. To ensure availability for the cluster, we had to roll out changes one broker at a time, waiting for recovery between each restart. Each broker took about nine minutes to recover. With 48 Kafka brokers in our cluster, that meant seven hours of mostly waiting. Entering the Multizone In the fall of 2021, Andrey Polyakov led an exciting project that changed the design of our Kafka cluster to make it resilient to a full zonal outage. Where Kafka had formerly been deployed completely in Google’s us-central1-a region, we now distributed the brokers across three zones, ensuring that an outage in any one region would not make the entire cluster unavailable. In order to ensure zonal resilience, we had to move to a predictable distribution of replicas across the cluster. If a replica for a topic was on a broker located in zone A, we could be sure the second replica was on a broker in zone C with the third replica in zone F. Figure 1: Distribution of replicas for a topic are now spread across three GCP zones. Taking down only brokers in Zone A is guaranteed to leave topic partitions in Zone C and Zone F available. This new multizone Kafka architecture wiped out the one-broker-at-a-time limitation on our upgrade process. We could now take down every single broker in the same zone simultaneously without affecting availability, meaning we could upgrade as many as twelve brokers at once. We just needed to find a way to restart the correct brokers. Kubernetes natively provides a means to take down multiple pods in a StatefulSet at once, using the partitioned rolling updates rollout strategy. However, that requires pods to come down in sequential order by pod number, but our multizonal architecture placed every third broker in the same zone. We could have reassigned partitions across brokers, but that would have been a lengthy and manual process, so we opted instead to write some custom logic to control the updates. Figure 2. Limitations of Kubernetes sequential ordering in a multizone architecture: we can concurrently update pods 11 and 10, for example, but not pods 11 and 8. We changed the RolloutPolicy on the StatefulSet Kubernetes object for Kafka to OnDelete. This means that once an update is applied to a StatefulSet, Kubernetes will not automatically start rolling out changes to the pods, but will expect users to explicitly delete the pods to roll out their changes. The main loop of our program essentially finds some pods in a zone that haven’t been updated and updates them. It waits until the cluster has recovered, and then moves on to the next batch. Figure 3. Polling for pods that still need updates. We didn’t want to have to run something like this from our local machines (what if someone has a flakey internet connection?), so we decided to dockerize the process and run it as a Kubernetes batch job. We set up a small make target to deploy the upgrade script, with logic that would prevent engineers from accidently deploying two versions of it at once. Performance Improvements Figure 4. Visualization of broker upgrades We tested our logic out in production, and with a parallelism of three we were able to finish upgrades in a little over two hours. In theory we could go further and restart all the Kafka brokers in a zone en masse, but we have shied away from that. Part of broker recovery involves catching up with replication lag: i.e., reading data from the brokers that have been continuing to serve traffic. A restart of an entire zone would mean increased load on all the remaining brokers in the cluster as they saw their number of client connections jump -- we would find ourselves essentially simulating a full zonal outage every time we updated. It’s pretty easy just to look at the reduced duration of our upgrades and call this project a success -- we went from spending seven hours rolling out changes to about two. And in terms of our total investment of time—time coding the new process vs. time saved on upgrades—I suspect by now, eight months in, we’ve probably about broken even. But my personal favorite way to measure the success of this project is centered around toil -- and every upgrade I’ve performed these last eight months has been quick, peaceful, and over by lunchtime.
In 2018, with its decision to choose Google Cloud Platform as its provider, Etsy began a long project of migration to the cloud. This wasn't just a question of moving out of our data center. Things like on-demand capacity scaling and multi-zone/region resilience don't just magically happen once you’re "in the cloud." To get the full benefits, we undertook a major redesign to host our Kafka brokers (and clients) on Google’s managed Kubernetes (GKE). Fast forward a few years: we now have many more Kafka clients with streaming applications that power critical user-facing features, like search indexing. Our initial architecture, which saved on cost by operating (mostly) in a single availability zone, was beginning to look shaky. The consequences of a Kafka outage had grown, for example resulting in stale search results. This would have a negative impact on our buyers and sellers alike, and would mean a direct revenue loss to Etsy. We decided it was time to reevaluate. With some research and thinking, and a good bit of experimentation, we put together a plan to make our Kafka cluster resilient to zonal failures. For such an important production service, the migration had to be accomplished with zero downtime in production. This post will discuss how we accomplished that feat, and where we're looking to optimize costs following a successful rollout. Single-Zone Kafka Inter-zone network costs in Google Cloud can add up surprisingly quickly, even exceeding the costs of VMs and storage. Mindful of this, our original design operated the Kafka cluster in one zone only, as illustrated below: Figure 1. An illustration of our original single-zone architecture. The Kafka cluster operates entirely within zone “a”. Only a few critically important components, such as Zookeeper in this example, run in multiple zones. Those with keen eyes might have noticed the Persistent Disks drawn across a zone boundary. This is to indicate that we’re using Google’s Regional PDs, replicated to zones “a” and “b”. (Regular PDs, by contrast, are zonal resources, and can only be accessed from a GCE instance in the same zone.) Even though we weren’t willing to pay the network costs for operating Kafka in multiple zones at the time, we wanted at least some capability to handle Google Cloud zonal outages. The worst-case scenario in the design above is a “zone-a” outage taking down the entire Kafka cluster. Until the cluster came back up, consumers would be dealing with delayed data, and afterward they'd have to spend time processing the backlog. More concerning is that with the cluster unavailable producers would have nowhere to send their data. Our primary producer, Franz, has an in-memory buffer and unlimited retries as a hedge against the problem, but producer memory buffers aren't unlimited. In the event of a sustained zone outage our runbook documented a plan to have a team member manually relocate the Kafka brokers to zone-b, where disks and historical data have been safely stored. A quick response would be critical to prevent data loss, which might be a lot to ask of a possibly sleepy human. Multizone Kafka As Kafka's importance to the business grew–and even though the single-zone architecture hadn’t suffered an outage yet–so did our discomfort with the limitations of manual zone evacuation. So we worked out a design that would give the Kafka cluster zonal resilience: Figure 2. A multizone design for Kafka on GKE. Notice that persistent disks are no longer drawn across zone boundaries. The most crucial change is that our Kafka brokers are now running in three different zones. The GKE cluster was already regional, so we applied Kubernetes Pod Topology Spread Constraints with topologyKey: topology.kubernetes.io/zone to ensure an even distribution across zones. Less apparent is the fact that topic partition replicas also need to be evenly distributed. We achieve this by setting Kafka’s broker.rack configuration based on the zone where a broker is running. This way, in the event of an outage of any one zone, two out of three partition replicas are still available. With this physical data layout, we no longer need persistent disks to be regional, since Kafka is providing inter-zone replication for us. Zero-Downtime Migration Satisfied with the design, we still had the challenge of applying the changes to our production Kafka cluster without data loss, downtime, or negative impact to client applications. The first task was to move broker Pods to their correct zones. Simply applying the Topology Constraints alone resulted in Pods stuck in “Pending” state when they were recreated with the new configs. The problem was that the disks and PVCs for the pods are zonal resources that can only be accessed locally (and even regional disks are limited to just two zones). So we had to move the disks first, then the Pods could follow. One way to accomplish the move is by actually deleting and recreating the disks. Since we have a replication factor of three on all our topics, this is safe if we do it one disk at a time, and Kafka will re-replicate the missing data onto the blank disk. Testing showed, however, that completing the procedure for all brokers would take an unacceptably long time, on the order of weeks. Instead, we took advantage of Google’s disk snapshotting feature. Automated with some scripting, the main loop performs roughly the following steps: Create a “base” snapshot of the Kafka broker disk, while the broker is still up and running Halt the broker: kubectl delete statefulset --cascade=orphan kubectl delete pod Create the “final” snapshot of the same Kafka broker disk, referencing the base snapshot Create a brand new disk from the final snapshot, in the correct zone Delete the original disk Recreate the StatefulSet, which recreates the Pod and starts up the broker Wait until the cluster health returns to normal (under replicated partitions is zero) Repeat for each broker. The two-step base/final snapshot process is just an optimisation. The second snapshot is much faster than the first, which minimizes broker downtime, and also reduces the time for partition replication to catch up afterwards. Now that we’ve got brokers running in the correct zone location, what about partitions? Kafka doesn’t provide automatic partition relocation, and the broker.rack configuration only applies to newly created partitions. So this was a do-it-yourself situation, which at high level involved: Generating a list of topic-partitions needing relocation, based on the requirement that replicas need to be distributed evenly across all zones. After some scripting, this list contained 90% of the cluster’s partitions. Generating a new partition assignment plan in JSON form. Kafka provides some CLI tools for the job, but we used the SiftScience/kafka-assigner tool instead (with a few of the open PRs applied). This allowed us to minimize the amount of data movement, saving time and reducing load on the cluster. Applying partition assignments using the official “kafka-reassign-partitions” CLI tool. To prevent overwhelming the cluster, we throttled the data migration, going in small batches rather than all at once (we had something like 10k partitions to move), and grouping migrations by class of topic, large, small, or empty (with the empty ones probably available for deletion). It was delicate work that took days of manual effort and babysitting to complete, but the result was a completely successful, zero-downtime partition migration. Post Migration In 2021, a company-wide initiative to test and understand zonal resilience in a large number of Etsy's systems, led by Jeremy Tinley, gave us a perfect opportunity to put our multizone Kafka design through its paces. We performed our testing in production, like many other teams (staging environments not being 100% representative), and brought down an entire zone, a third of the Kafka cluster. As leader election and partition replicas became unavailable, client requests automatically switched to still-available brokers, and any impact turned out to be minimal and temporary. Some napkin math at the time of redesign led us to believe that we would see only minimal cost increases from our multizone setup. In particular, eliminating regional disks (the most expensive Google Cloud SKU in our single-zone design) in favor of Kafka replication would halve our significant storage expense. By current measurements, though, we've ended up with a roughly 20% increase in cost since migration, largely due to higher-than-anticipated inter-zone network costs. We expected some increase, of course: we wanted an increase, since the whole point of the new architecture is to make data available across zones. Ideally, we would only be traversing zone boundaries for replication, but in practice that ideal is hard to achieve. Kafka's follower fetching feature has helped us make progress on this front. By default, consumers read from the leader replica of a partition, where records are directly produced: but if you're willing to accept some replication latency (well within our application SLOs), follower fetching lets you consume data from same-zone replica partitions, eliminating extra hops across boundaries. The feature is enabled by specifying the client.rack configuration to consumers, and the RackAwareReplicaSelector class for the broker-side replica.selector.class config. This isn’t the only source of inter-zone traffic however, many of our client applications are not pure consumers but also produce data themselves, and when their data is written back to the Kafka cluster across zones. (We also have some consumers in different Google Cloud projects outside our team's control that we haven't been able to update yet.) Arguably, increased network costs are worth it to be insured against catastrophic zone failures. (Certainly there are some possibly sleepy humans who are glad they won't be called upon to manually evacuate a bad zone.) We think that with continued work we can bring our costs more in line with initial expectations. But even as things stand, the benefits of automatic and fast tolerance to zone outages are significant, and we'll pay for them happily.
One of our Guiding Principles at Etsy is that we “commit to our craft.” This means that we have a culture of learning, in which we’re constantly looking for opportunities to improve and learn, adopt industry best practices, and share our findings with our colleagues and our community. As part of that process, Etsy recently adopted Jetpack Compose – Android's modern toolkit for defining native UIs – as our preferred means of building our Android app. The process of adoption consisted of a gradual expansion in the size and complexity of features built using Compose, eventually culminating in a full rewrite of one of the primary screens in the app. The results of that rewrite gave us the confidence to recommend Compose as the primary tool for our Android engineers to build UIs going forward. Adoption Our engineers are always investigating the latest industry trends and technologies, but in this case a more structured approach was warranted due to the central nature of UI toolkits in the development process. Several engineers on the Android team were assigned to study the existing Compose documentation and examples we had used in prior builds and then create a short curriculum based on what they learned. Over several months, the team held multiple information sessions with the entire Android group, showing how to use Compose to build simple versions of some of our real app screens. Part of our in-house curriculum for learning Jetpack Compose via small modules. Each module built upon the previous module to build more complex versions of various features in our real app. Next, our Design Systems team started creating Compose versions of our internal UI toolkit components, with the goal of having a complete Compose implementation of our design system before major adoption. Compose is designed for interoperability with our existing toolkit, XML Views, providing an uncomplicated migration path that enables us to start using these new toolkit components in our existing XML Views with minimal disruption. This was our first chance to validate that the performance of Compose would be as good as or better than our existing toolkit components. This also gave the wider Android community at Etsy a chance to start using Compose in their day-to-day work and get comfortable with the new patterns Compose introduced. A partial list of the design system components our team was able to make available in Compose. Our Design Systems team also made heavy use of one of Compose’s most powerful features: Previews. Compose Previews allow a developer to visualize Composables in as many configurations as they want using arbitrary test data, all without having to run the app on a device. Every time the team made a change to a Design Systems Composable, they could validate the effect in a wide range of scenarios. After a few months of building and adopting toolkit components in Compose, our team felt it was time for a more significant challenge: rebuilding an entire screen. To prevent inadvertently causing a disruption for buyers or sellers on Etsy, we chose a heavily used screen only available in our backend for development builds. This step exposed us to a much wider scope of concerns: navigation, system UI, data fetching using coroutines from our API, and the orchestration of multiple Compose components interacting with each other. Using Kotlin Flows, we worked out how to structure our business and UI logic around a unidirectional data flow, a key unlock for future integration of Compose with Macramé – our standard architecture for use across all screens in the Etsy app. With a full internal screen under our belts, it was time to put Compose in front of real users. A few complex bottom sheets were the next pieces of our app to get the Compose treatment. For the first time, we exposed a major part of our UI, now fully written in Compose, to buyers and sellers on Etsy. We also paired a simple version of our Macramé architecture with these bottom sheets to prove that the two were compatible. A bottom sheet fully using Compose hosted inside of a screen built using Views. After successfully rolling out bottom sheets using Compose, we saw an opportunity to adopt Compose on a larger scale in the Shop screen. The existing Shop screen code was confusing to follow and very difficult to run experiments on – limiting our ability to help sellers improve their virtual storefronts. Compose and Macramé held the promise of addressing all these concerns. The Shop screen, fully built using Compose. In just around three months, our small team completed the rebuild. Our first order of business was to run an A/B experiment on the Shop screen to compare old vs. new. The results gave Compose even better marks than we had hoped for. Initial screen rendering time improved by 5%, and subjective interactions with the Shop screen, like taps and scrolls, were quicker and more fluid. User analytics showed the new screen improved conversion rate, add to cart actions, checkout starts, shop favoriting, listing views, and more – meaning these changes made a tangible, positive impact for our sellers. For the engineers tasked with coding the Shop screen, the results were just as impressive. An internal survey of engineers who had worked with the Shop screen before the rewrite showed a significant improvement in overall developer satisfaction. Building features required fewer lines of code, our respondents told us, and thanks to the Macramé architecture, testing was much easier and enabled us to greatly increase test coverage of business logic. Similar to what we learned during the development of our Design System components, Compose Previews were called out as a superpower for covering edge cases, and engineers said they were excited to work in a codebase that now featured a modern toolkit. Learnings We've learned quite a lot about Compose on our path to adopting it: Because of the unidirectional data flow of our Macramé architecture and stateless components built with Compose, state is decoupled from the UI and business logic is isolated and testable. The combination of Macramé and Compose has become the standard way we build features for our app. Colocation of layout and display logic allows for much easier manipulation of spacing, margins, and padding when working with complex display logic. Dynamic spacing is extremely difficult to do with XML layouts alone, and requires code in separate files to keep it all in sync. Creating previews of all possible Compose states using mock data has eliminated a large source of rework, bugs, and bad experiences for our buyers. Our team found it easier to build lazy-style lists in Compose compared to managing all the pieces involved with using RecyclerView, especially horizontal lazy lists. Interoperability between Compose and Views in both directions enabled a gradual adoption of Compose. Animation of Composables can be triggered automatically by data changes–no writing extra code to start and stop the animations properly. While no individual tool is perfect, we’re excited about the opportunities and efficiencies Compose has unlocked for our teams. As with any new technology, there's a learning curve, and some bumps along the way. One issue we found was in a 3rd party library we use. While the library has support for Compose, at the time of the Shop screen conversion, that support was still in alpha stage. After extensive testing, we decided to move forward using the alpha version, but an incompatibility could have necessitated us finding an alternative solution. Another learning is that LazyRows and LazyColumns, while similar in some respects to RecyclerView, come with their own specific way of handling keys and item reuse. This new lazy list paradigm has occasionally tripped us up and resulted in some unexpected behavior. Conclusion We’re thrilled with our team’s progress and outcomes in adopting this new toolkit. We’ve now fully rewritten several key UI screens, including Listing, Favorites, Search, and Cart using Compose, with more to come. Compose has given us a set of tools that lets us be more productive when delivering new features to our buyers, and its gradual rollout in our codebase is a tangible example of the Etsy team's commitment to our craft.
At Etsy, we’re focused on elevating the best of our marketplace to help creative entrepreneurs grow their businesses. We continue to invest in making Etsy a safe and trusted place to shop, so sellers’ extraordinary items can shine. Today, there are more than 100 million unique items available for sale on our marketplace, and our vibrant global community is made up of over 90 million active buyers and 7 million active sellers, the majority of whom are women and sole owners of their creative businesses. To support this growing community, our Trust & Safety team of Product, Engineering, Data, and Operations experts are dedicated to keeping Etsy's marketplace safe by enforcing our policies and removing potentially violating or infringing items at scale For that, we make use of community reporting and automated controls for removing this potentially violating content. In order to continue to scale and enhance our detections through innovative products and technologies, we also leverage state-of-the-art Machine Learning solutions which we have already used to identify and remove over 100,000 violations during the past year on our marketplace. In this article, we are going to describe one of our systems to detect policy violations that utilizes supervised learning, a family of algorithms that uses data to train their models to recognize patterns and predict outcomes. Datasets In Machine Learning, data is one of the variables we have the most control over. Extracting data and building trustworthy datasets is a crucial step in any learning problem. In Trust & Safety, we are determined to keep our marketplace and users safe by identifying violations to our policies. For that, we log and annotate potential violations that enable us to collect datasets reliably. In our approach, these are translated into positives, these were indeed violations, and negatives, these were found not to be offending for a given policy. The latter are also known as hard negatives as they are close to our positives and can help us to better learn how to partition these two sets. In addition, we also add easy or soft negatives by adding random items to our datasets. This allows us to give further general examples to our models for listings that do not violate any policy, which is the majority in our marketplace and improve generalizability. The number of easy negatives to add is a hyper-parameter to tune, more will mean higher training time and fewer positive representations. For each training example, we extract multimodal signals, both textual and imagery from our listings. Then, we split our datasets by time using progressive evaluation, to mimic our production usecase and learn to adapt to recent behavior. These are split into training, used to train our models and learn patterns, validation to fine tune our training hyper-parameters such as learning rate and to evaluate over-fitting, and test to report our metrics in an unbiased manner. Model Architecture After usual transformations and extraction of a set of offline features from our datasets, we are all set to start training our Machine Learning model. The goal is to predict whether a given listing violates any of our predefined set of policies, or in contrast, it doesn’t violate any of them. For that, we added a neutral class that depicts the no violation class, where the majority of our listings fall into. This is a typical design pattern for these types of problems. Our model architecture includes a text encoder and an image encoder to learn representations (aka embeddings) for each modality. Our text encoder currently employs a BERT-based architecture to extract context-full representations of our text inputs. In addition, to alleviate compute time, we leverage ALBERT, a lighter BERT with 90% fewer parameters as the transformer blocks share them. Our initial lightweight representation used an in-house model trained for Search usecases. This allowed us to quickly start iterating and learning from this problem. Our image encoder currently employs EfficientNet, a very efficient and accurate Convolutional Neural Network (CNN). Our initial lightweight representation used an in-house model for category classification using CNNs. We are experimenting with transformer-based architectures, similar to our text encoders, with vision transformers but its performance has not been significantly improved. Inspired by EmbraceNet, our architecture then further learns more constrained representations for both text and image embeddings separately, before they are concatenated to form a unique multimodal representation. This is then sent to a final softmax activation that maps logits to probabilities for our internal use. In addition, in order to address the imbalanced nature of this problem, we leverage focal loss that penalizes more hard misclassified examples. Figure 1 shows our model architecture with late concatenation of our text and image encoders and final output probabilities on an example. Model Architecture. Image is obtained from @charlesdeluvio on Unsplash Model Evaluation First, we experimented and iterated by training our model offline. To evaluate its performance, we established certain benchmarks, based on the business goal of minimizing the impact of any well-intentioned sellers while successfully detecting any offending listings in the platform. This results in a typical evaluation trade-off between precision and recall, precision being the fraction of correct predictions over all predictions made, and recall being the fraction of correct predictions over the actual true values. However, we faced the challenge that recall is not possible to compute, as it’s not feasible to manually review the millions and millions of new listings per day so we had to settle for a proxy for recall from what has been annotated. Once we had a viable candidate to test in production, we deployed our model as an endpoint and built a service to perform pre-processing and post-processing steps before and after the call to our endpoint that can be called via an API. Then, we ran an A/B test to measure its performance in production using a canary release approach, slowly rolling out our new detection system to a small percentage of traffic that we keep increasing while we validate an increase in our metrics and no unexpected computation overload. Afterwards, we iterated and every time we had a promising offline candidate, named challenger, that improved our offline performance metrics, we A/B tested it with respect to our current model, named champion. We designed guidelines for model promotion to increase our metrics and our policy coverage. Now, we monitor and observe our model predictions and trigger re-training when our performance degrades. Results Our supervised learning system has been continually learning as we train frequently, run experiments with new datasets and model architectures, A/B test them and deploy them in production. We have added violations as additional classes to our model. As a result, we have identified and removed more than 100,000 violations using these methodologies, in addition to other tools and services that continue to detect and remove violations. This is one of our approaches to identify potentially offending content among others such as explicitly using the policy information and leverage the latest in Large Language Models (LLMs) and Generative AI. Stay tuned! "To infinity and beyond!" –Buzz Lightyear, Toy Story
In 2020, Etsy concluded its migration from an on-premise data center to the Google Cloud Platform (GCP). During this transition, a dedicated team of program managers ensured the migration's success. Post-migration, this team evolved into the Etsy FinOps team, dedicated to maximizing the organization's cloud value by fostering collaborations within and outside the organization, particularly with our Cloud Providers. Positioned within the Engineering organization under the Chief Architect, the FinOps team operates independently of any one Engineering org or function and optimizes globally rather than locally. This positioning, combined with Etsy's robust engineering culture focused on efficiency and craftsmanship, has fostered what we believe is a mature and successful FinOps practice at Etsy. Forecast Methodology A critical aspect of our FinOps approach is a strong forecasting methodology. A reliable forecast establishes an expected spending baseline against which we track actual spending, enabling us to identify deviations. We classify costs into distinct buckets: Core Infrastructure: Includes the costs of infrastructure and services essential for operating the Etsy.com website. Machine Learning & Product Enablement: Encompasses costs related to services supporting machine learning initiatives like search, recommendations, and advertisements. Data Enablement: Encompasses costs related to shared platforms for data collection, data processing and workflow orchestration. Dev: Encompasses non-production resources. The FinOps forecasting model relies on a trailing Cost Per Visit (CPV) metric. While CPV provides valuable insights into changes, it's not without limitations: A meaningful portion of web traffic to Etsy involves non-human activity, like web crawlers that’s not accounted for in CPV. Some services have weaker correlations to user visits. Dev, data, and ML training costs lack direct correlations to visits and are susceptible to short-term spikes during POCs, experiments or big data workflows. A/B tests for new features can lead to short-term CPV increases, potentially resulting in long-term CPV changes upon successful feature launches. Periodically, we run regression tests to validate if CPV should drive our forecasts. In addition to visits we have looked into headcount, GMV(Gross Merchandise Value) and revenue as independent variables. Thus far, visits have consistently exhibited the highest correlation to costs. Monitoring and Readouts We monitor costs using internal tools built on BigQuery and Looker. Customized dashboards for all of our Engineering teams display cost trends, CPV, and breakdowns by labels and workflows. Additionally, we've set up alerts to identify sudden spikes or gradual week-over-week/month-over-month growth. Collaboration with the Finance department occurs weekly to compare actual costs against forecasts, identifying discrepancies for timely corrections. Furthermore, the FinOps team conducts recurring meetings with major cost owners and monthly readouts for Engineering and Product leadership to review forecasted figures and manage cost variances. While we track costs at the organization/cost center level, we don't charge costs back to the teams. This both lowers our overhead and more importantly, provides flexibility to make tradeoffs that enable Engineering velocity. Cost Increase Detection & Mitigation Maintaining a healthy CPV involves swiftly identifying and mitigating cost increases, to achieve this we: Analysis: Gather information on the increase's source, whether from specific cloud products, workflows, or usage pattern changes (ie variance in resource utilization). Collaboration: Engage relevant teams, sharing insights and seeking additional context. Validation: Validate cost increases from product launches or internal changes, securing buy-in from leadership if needed. Mitigation: Unexpected increases undergo joint troubleshooting, where we outline and assign action items to owners, until issues are resolved. Communication: Inform our finance partners about recent cost trends and their incorporation into the expected spend forecast post-confirmation or resolution with teams and engineering leadership. Cost Optimization Initiatives Another side of maintaining a healthy CPV involves cost optimization, offsetting increases from product launches. Ideas for cost-saving come as a result of collaboration between FinOps and engineering teams, with the Architecture team validating and implementing efficiency improvements. Notably we focus on the engineering or business impact of the cost optimization rather than solely on savings, recognizing that inefficiencies often signal larger problems. Based on effort vs. value evaluations, some ideas are added to backlogs, while major initiatives warrant dedicated squads.Below is a breakout of some of the major wins we have had in the last year or so. GCS Storage Optimization - In 2023 we stood up a squad focused on optimizing Etsy’s use of GCS, as it has been one of the largest growth areas for us over the past few years. The squad delivered a number of improvements including improved monitoring of usage, automation features for Data engineers, implementation of TTLs that match data access patterns/business needs and the adoption of Intelligent tiering. Due to these efforts, Etsy’s GCS usage is now less than it was 2 years ago. Compute Optimization - Migrated over 90% of Etsy infrastructure that is serving traffic to the latest and greatest CPU platform. This improved our serving latency while reducing cost. Increased Automation for model deployment - In an effort to improve the developer experience, our machine learning enablement team developed a tool to automate the compute configurations for new models being deployed, which also ended up saving us money. Network Compression - Enabling network compression between our high throughput services both improved the latency profile and drastically reduced the networking cost. What's Next While our core infrastructure spend is well understood, our focus is on improving visibility into our Machine Learning platform's spend. As these systems are shared across teams, dissecting costs tied to individual product launches is challenging. Enhanced visibility will help us refine our ROI analysis of product experiments and pinpoint future areas of opportunity for optimization.
Etsy features a diverse marketplace of unique handmade and vintage items. It’s a visually diverse marketplace as well, and computer vision has become increasingly important to Etsy as a way of enhancing our users’ shopping experience. We’ve developed applications like visual search and visually similar recommendations that can offer buyers an additional path to find what they’re looking for, powered by machine learning models that encode images as vector representations. Figure 1. Visual representations power applications such as visual search and visually similar recommendations Learning expressive representations through deep neural networks, and being able to leverage them in downstream tasks at scale, is a costly technical challenge. The infrastructure required to train and serve large models is expensive, as is the iterative process that refines them and optimizes their performance. The solution is often to train deep learning architectures offline and use the pre-computed pretrained visual representations in downstream tasks served online. (We wrote about this in a previous blog post on personalization from real-time sequences and diversity of representations.) In any application where a query image representation is inferred online, it's important that you have low latency, memory-aware models. Efficiency becomes paramount to the success of these models in the product. We can think about efficiency in deep learning along multiple axes: efficiency in model architecture, model training, evaluation and serving. Model Architecture The EfficientNet family of models features a convolutional neural network architecture. It uniformly optimizes for network width, depth, and resolution using a fixed set of coefficients. By allowing practitioners to start from a limited resource budget and scale up for better accuracy as more resources are available, EfficientNet provides a great starting point for visual representations. We began our trials with EfficientNetB0, the smallest size model in the EfficientNet family. We saw good performance and low latency with this model, but the industry and research community have touted Vision Transformers (ViT) as having better representations. We decided to give that a try. Transformers lack the spatial inductive biases of CNN, but they outperform CNN when trained on large enough datasets and may be more robust to domain shifts. ViT decomposes the image into a sequence of patches (16X16 for example) and applies a transformer architecture to incorporate more global information. However, due to the massive number of parameters and compute-heavy attention mechanism, ViT-based architectures can be many times slower to train and inference than lightweight Convolutional Networks. Despite the challenges, more efficient ViT architectures have recently begun to emerge, featuring clever pooling, layer dropping, efficient normalization, and efficient attention or hybrid CNN-transformer designs. We employ the EfficientFormer-l3 to take advantage of these ViT improvements. The EfficientFormer architecture achieves efficiency through downsampling multiple blocks and employing attention only in the last stage. This derived image representation mechanism differs from the standard vision transformer, where embeddings are extracted from the first token of the output. Instead, we extract the attention from the last block for the eight heads and perform average pooling over the sequence. In Figure 2 we illustrate these different attention weights with heat maps overlaid on an image, showing how each of the eight heads learns to focus on a different salient part. Figure 2. Probing the EfficientFormer-l3 pre-trained visual representations through attention heat maps. Model Training Fine-Tuning With our pre-trained backbones in place, we can gain further efficiencies via fine tuning. For the EfficientNetB0 CNN, that means replacing the final convolutional layer and attaching a d-dimensional embedding layer followed by m classification heads, where m is the number of tasks. The embedding head consists of a new convolutional layer with the desired final representation dimension, followed by a batch normalization layer, a swish activation and a global average pooling layer to aggregate the convolutional output into a single vector per example. To train EfficientNetB0, new attached layers are trained from scratch for one epoch with the backbone layers frozen, to avoid excessive computation and overfitting. We then unfreeze 75 layers from the top of the backbone and finetune for nine additional epochs, for efficient learning. At inference time we remove the classification head and extract the output of the pooling layer as the final representation. To fine-tune the EfficientFormer ViT we stick with the pretraining resolution of 224X224, since using sequences longer than the recommended 384X384 in ViT leads to larger training budgets. To extract the embedding we average pool the last hidden state. Then classification heads are added as with the CNN, with batch normalization being swapped for layer normalization. Multitask Learning In a previous blog post we described how we built a multitask learning framework to generate visual representations for Etsy's search-by-image experience. The training architecture is shown in Figure 3. Figure 3. A multitask training architecture for visual representations. The dataset sampler combines examples from an arbitrary number of datasets corresponding to respective classification heads. The embedding is extracted before the classification heads. Multitask learning is an efficiency inducer. Representations encode commonalities, and they perform well in diverse downstream tasks when those are learned using common attributes as multiple supervision signals. A representation learned in single-task classification to the item’s taxonomy, for example, will be unable to capture visual attributes: colors, shapes, materials. We employ four classification tasks: a top-level taxonomy task with 15 top-level categories of the Etsy taxonomy tree as labels; a fine-grained taxonomy task, with 1000 fine-grained leaf node item categories as labels; a primary color task; and a fine-grained taxonomy task (review photos), where each example is a buyer-uploaded review photo of a purchased item with 100 labels sampled from fine-grained leaf node item categories. We are able to train both EfficientNetB0 and EfficientFormer-l3 on standard 16GB GPUs (we used two P100 GPUs). For comparison, a full sized ViT requires a larger 40GB RAM GPU such as an A100, which can increase training costs significantly. We provide detailed hyperparameter information for fine-tuning either backbone in our article. Evaluating Visual Representations We define and implement an evaluation scheme for visual representations to track and guide model training, on three nearest neighbor retrieval tasks. After each training epoch, a callback is invoked to compute and log the recall for each retrieval task. Each retrieval dataset is split into two smaller datasets: “queries” and “candidates.” The candidates dataset is used to construct a brute-force nearest neighbor index, and the queries dataset is used to look up the index. The index is constructed on the fly after each epoch to accommodate for embeddings changing between training epochs. Each lookup yields K nearest neighbors. We compute Recall@5 and @10 using both historical implicit user interactions (such as “visually-similar ad clicks”) and ground truth datasets of product photos taken from the same listing (“intra-item”). The recall callbacks can also be used for early stopping of training to enhance efficiency. The intra-item retrieval evaluation dataset consists of groups of seller-uploaded images of the same item. The query and candidate examples are randomly selected seller-uploaded images of an item. A candidate image is considered a positive example if it is associated with the same item as the query. In the “intra-item with reviews” dataset, the query image is a randomly selected buyer-uploaded review image of an item, with seller-uploaded images providing candidate examples. The dataset of visually similar ad clicks associates seller-uploaded primary images with primary images of items that have been clicked in the visually similar surface on mobile. Here, a candidate image is considered a positive example for some query image if a user viewing the query image has clicked it. Each evaluation dataset contains 15,000 records for building the index and 5,000 query images for the retrieval phase. We also leverage generative AI for an experimental new evaluation scheme. From ample, multilingual historical text query logs, we build a new retrieval dataset that bridges the semantic gap between text-based queries and clicked image candidates. Text-to-image generative stable diffusion makes the information retrieval process language-agnostic, since an image is worth a thousand (multilingual) words. A stable diffusion model generates high-quality images which become image queries. The candidates are images from clicked items corresponding to the source text query in the logs. One caveat is that the dataset is biased toward the search-by-text production system that produced the logs; only a search-by-image-from-text system would produce truly relevant evaluation logs. The source-candidate image pairs form the new retrieval evaluation dataset which is then used within a retrieval callback. Of course, users entering the same text query may have very different ideas in mind of, say, the garment they’re looking for. So for each query we generate several images: formally, a random sample of length 𝑛 from the posterior distribution over all possible images that can be generated from the seed text query. We pre-condition our generation on a uniform “fashion style.” In a real-world scenario, both the text-to-image query generation and the image query inference for retrieval happen in real time, which means efficient backbones are necessary. We randomly select one of the 𝑛 generated images to replace the text query with an image query in the evaluation dataset. This is a hybrid evaluation method: the error inherent in the text-to-image diffusion model generation is encapsulated in the visually similar recommendation error rate. Future work may include prompt engineering to improve the text query prompt itself, which as input by the user can be short and lacking in detail. Large memory requirements and high inference latency are challenges in using text-to-image generative models at scale. We employ an open source fast stable diffusion model through token merging and float 16 inference. Compared to the standard stable diffusion implementation available at the time we built the system, this method speeds up inference by 50% with a 5x reduction in memory consumption, though results depend on the underlying patched model. We can generate 500 images per hour with one T4 GPU (no parallelism) using the patched stable diffusion pipeline. With parallelism we can achieve further speedup. Figure 4 shows that for the English text query “black bohemian maxi dress with orange floral pattern” the efficient stable diffusion pipeline generates five image query candidates. The generated images include pleasant variations with some detail loss. Interestingly, mostly the facial details of the fashion model are affected, while the garment pattern remains clear. In some cases degradation might prohibit display, but efficient generative technology is being perfected at a fast pace, and prompt engineering helps the generative process as well. Figure 4. Text-to-image generation using a generative diffusion model, from equivalent queries in English and French Efficient Inference and Downstream Tasks Especially when it comes to latency-sensitive applications like visually similar recommendations and search, efficient inference is paramount: otherwise, we risk loss of impressions and a poor user experience. We can think of inference along two axes: online inference of the image query and efficient retrieval of top-k most similar items via approximate nearest neighbors. The dimension of the learned visual representation impacts the efficient retrieval design as well, and the smaller 256d derived from the EfficientNetB0 presents an advantage. EfficientNet B0 is hard to beat in terms of accuracy-to-latency trade-offs for online inference, with ~5M parameters and around 1.7ms latency on iPhone 12. The EfficientFormer-l3 has ~30M parameters and gets around 2.7ms latency on iPhone 12 with higher accuracy (while for example MobileViT-XS scores around 7ms with a third of accuracy; very large ViT are not considered since latencies are prohibitive). In offline evaluation, the EfficientFormer-l3-derived embedding achieves around +5% lift in the Intra-L Recall@5 evaluation, a +17% in Intra-R Recall@5, and a +1.8% in Visually Similar Ad clicks Recall@5. We performed A/B testing on the EfficientNetB0 multitask variant across visual applications at Etsy with good results. Additionally, the EfficientFormer-l3 visual representations led to a +0.65% lift in CTR, and a similar lift in purchase rate in a first visually-similar-ads experiment when compared to the production variant of EfficientNetB0. When included in sponsored search downstream rankers, the visual representations led to a +1.26% lift in post-click purchase rate. Including the efficient visual representation in Ads Information Retrieval (AIR), an embedding-based retrieval method used to retrieve similar item ad recommendations caused an increase in click-recall@100 of 8%. And when we used these representations to compute image similarity and included them directly in the last-pass ranking function, we saw a +6.25% lift in clicks. The first use of EfficientNetB0 visual embeddings was in visually similar ad recommendations on mobile. This led to a +1.92% increase in ad return-on-spend on iOS and a +1.18% increase in post-click purchase rate on Android. The same efficient embedding model backed the first search-by-image shopping experience at Etsy. Users search using photos taken with their mobile phone’s camera and the query image embedding is inferred efficiently online, which we discussed in a previous blog post. Learning visual representations is of paramount importance in visually rich e-commerce and online fashion recommendations. Learning them efficiently is a challenging goal made possible by advances in the field of efficient deep learning in computer vision. If you'd like a more in-depth discussion of this work, please see our full accepted paper to the #fashionXrecsys workshop at the Recsys 2023 conference.
Easily the most important and complex screen in the Buy on Etsy Android app is the listing screen, where all key information about an item for sale in the Etsy marketplace is displayed to buyers. Far from just a title and description, a price and a few images, over the years the listing screen has come to aggregate ratings and reviews, seller and shipping and stock information, and gained a variety of personalization and recommendation features. As information-rich as it is, as central as it is to the buying experience, for product teams the listing screen is an irresistible place to test out new methods and approaches. In just the last three years, apps teams have run nearly 200 experiments on it, often with multiple teams building and running experiments in parallel. Eventually, with such a high velocity of experiment and code change, the listing screen started showing signs of stress. Its architecture was inconsistent and not meant to support a codebase expanding so much and so rapidly in size and complexity. Given the relative autonomy of Etsy app development teams, there ended up being a lot of reinventing the wheel, lots of incompatible patterns getting layered atop one another; in short the code resembled a giant plate of spaghetti. The main listing Fragment file alone had over 4000 lines of code in it! Code that isn’t built for testability doesn’t test well, and test coverage for the listing screen was low. VERY low. Our legacy architecture made it hard for developers to add tests for business logic, and the tests that did get written were complex and brittle, and often caused continuous integration failures for seemingly unrelated changes. Developers would skip tests when it seemed too costly to write and maintain them, those skipped tests made the codebase harder for new developers to onboard into or work with confidently, and the result was a vicious circle that would lead to even less test coverage. Introducing Macramé We decided that our new architecture for the listing screen, which we’ve named Macramé, would be based on immutable data propagated through a reactive UI. Reactive frameworks are widely deployed and well understood, and we could see a number of ways that reactivity would help us untangle the spaghetti. We chose to emulate architectures like Spotify’s Mobius, molded to fit the shape of Etsy’s codebase and its business requirements. At the core of the architecture is an immutable State object that represents our data model. State for the listing screen is passed to the UI as a single data object via a StateFlow instance; each time a piece of the data model changes the UI re-renders. Updates to State can be made either from a background thread or from the main UI thread, and using StateFlow ensures that all updates reach the main UI thread. When the data model for a screen is large, as it is for the listing screen, updating the UI from a single object makes things much simpler to test and reason about than if multiple separate models are making changes independently. And that simplicity lets us streamline the rest of the architecture. When changes are made to the State, the monolithic data model gets transformed into a list of smaller models that represent what will actually be shown to the user, in vertical order on the screen. The code below shows an example of state held in the Buy Box section of the screen, along with its smaller Title sub-component. data class BuyBox( val title: Title, val price: Price, val saleEndingSoonBadge: SaleEndingSoonBadge, val unitPricing: UnitPricing, val vatTaxDescription: VatTaxDescription, val transparentPricing: TransparentPricing, val firstVariation: Variation, val secondVariation: Variation, val klarnaInfo: KlarnaInfo, val freeShipping: FreeShipping, val estimatedDelivery: EstimatedDelivery, val quantity: Quantity, val personalization: Personalization, val expressCheckout: ExpressCheckout, val cartButton: CartButton, val termsAndConditions: TermsAndConditions, val ineligibleShipping: IneligibleShipping, val lottieNudge: LottieNudge, val listingSignalColumns: ListingSignalColumns, val shopBanner: ShopBanner, ) data class Title( val text: String, val textInAlternateLanguage: String? = null, val isExpanded: Boolean = false, ) : ListingUiModel() In our older architecture, the screen was based on a single scrollable View. All data was bound and rendered during the View's initial layout pass, which created a noticeable pause the first time the screen was loaded. In the new screen, a RecyclerView is backed by a ListAdapter, which allows for asynchronous diffs of the data changes, avoiding the need to rebind portions of the screen that aren't receiving updates. Each of the vertical elements on the screen (title, image gallery, price, etc.) is represented by its own ViewHolder, which binds whichever of the smaller data models the element relies on. In this code, the BuyBox is transformed into a vertical list of ListingUiModels to display in the RecyclerView. fun BuyBox.toUiModels(): List { return listOf( price, title, shopBanner, listingSignalColumns, unitPricing, vatTaxDescription, transparentPricing, klarnaInfo, estimatedDelivery, firstVariation, secondVariation, quantity, personalization, ineligibleShipping, cartButton, expressCheckout, termsAndConditions, lottieNudge, ) } An Event dispatching system handles user actions, which are represented by a sealed Event class. The use of sealed classes for Events, coupled with Kotlin "when" statements mapping Events to Handlers, provides compile-time safety to ensure all of the pieces are in place to handle the Event properly. These Events are fed to a single Dispatcher queue, which is responsible for routing Events to the Handlers that are registered to receive them. Handlers perform a variety of tasks: starting asynchronous network calls, dispatching more Events, dispatching SideEffects, or updating State. We want to make it easy to reason about what Handlers are doing, so our architecture promotes keeping their scope of responsibility as small as possible. Simple Handlers are simple to write tests for, which leads to better test coverage and improved developer confidence. In the example below, a click handler on the listing title sets a State property that tells the UI to display an expanded title: class TitleClickedHandler constructor() { fun handle(state: ListingViewState.Listing): ListingEventResult.StateChange { val buyBox = state.buyBox return ListingEventResult.StateChange( state = state.copy( buyBox = buyBox.copy( title = title.copy(isExpanded = true) ) ) ) } } SideEffects are a special type of Event used to represent, typically, one-time operations that need to interact with the UI but aren’t considered pure business logic: showing dialogs, logging events, performing navigation or showing Snackbar messages. SideEffects end up being routed to the Fragment to be handled. Take the scenario of a user clicking on a listing's Add to Cart button. The Handler for that Event might: dispatch a SideEffect to log the button click start an asynchronous network call to update the user’s cart update the State to show a loading indicator while the cart update finishes While the network call is running on a background thread, the Dispatcher is free to handle other Events that may be in the queue. When the network call completes in the background, a new Event will be dispatched with either a success or failure result. A different Handler is then responsible for handling both the success and failure Events. This diagram illustrates the flow of Events, SideEffects, and State through the architecture: Figure 1. A flow chart illustrating system components (blue boxes) and how events and state changes (yellow boxes) flow between them. Results The rewrite process took five months, with as many as five Android developers working on the project at once. One challenge we faced along the way was keeping the new listing screen up to date with all of the experiments being run on the old listing screen while development was in progress. The team also had to create a suite of tests that could comprehensively cover the diversity of listings available on Etsy, to ensure that we didn’t forget any features or break any. With the rewrite complete, the team ran an A/B experiment against the existing listing screen to test both performance and user behavior between the two versions. Though the new listing screen felt qualitatively quicker than the old listing screen, we wanted to understand how users would react to subtle changes in the new experience. We instrumented both the old and the new listing screens to measure performance changes from the refactor. The new screen performed even better than expected. Time to First Content was decreased by 18%, going from 1585 ms down to 1298 ms. This speedup resulted in the average number of listings viewed by buyers increasing 2.4%, add to carts increasing 0.43%, searches increasing by 2%, and buyer review photo views increasing by 3.3%. On the developer side, unit test coverage increased from single digit percentages to a whopping 76% code coverage of business logic classes. This significantly validates our decision to put nearly all business logic into Handler classes, each responsible for handling just a single Event at a time. We built a robust collection of tools for generating testing States in a variety of common configurations, so writing unit tests for the Handlers is as simple as generating an input event and validating that the correct State and SideEffects are produced. Creating any new architecture involves making tradeoffs, and this project was no exception. Macramé is under active development, and we have a few pieces of feedback on our agenda to be addressed: There is some amount of boilerplate still needed to correctly wire up a new Event and Handler, and we'd like to make that go away. The ability of Handlers to dispatch their own Events sometimes makes debugging complex Handler interactions more difficult than previous formulations of the same business logic. On a relatively simple screen, the architecture can feel like overkill. Adding new features correctly to the listing screen is now the easy thing to do. The dual benefit of increasing business metrics while also increasing developer productivity and satisfaction has resulted in the Android team expanding the usage of Macramé to two more of the key screens in the app (Cart and Shop), both of which completely rewrote their UI using Jetpack Compose: but those are topics for future Code as Craft posts.
Balancing Engineering Ambition with Product Realism Introduction In July of 2023, Etsy’s App Updates team, responsible for the Updates feed in Etsy’s mobile apps, set off with an ambitious goal: to revamp the Updates tab to become Deals, a home for a shopper’s coupons and sales, in time for Cyber Week 2023. The Updates tab had been around for years, and in the course of its evolution ended up serving multiple purposes. It was a hub for updates about a user’s favorite shops and listings, but it was also increasingly a place to start new shopping journeys. Not all updates were created equal. The most acted-upon updates in the tab were coupons offered for abandoned cart items, which shoppers loved. We spotted an opportunity to clarify intentions for our users: by refactoring favorite-based updates into the Favorites tab, and (more boldly), by recentering Updates and transforming it into a hub for a buyer’s deals. Technical Opportunity While investigating the best way to move forward with the Deals implementation, iOS engineers on the team advocated for developing a new tab from the ground up. Although it meant greater initial design and architecture effort, an entirely new tab built on modern patterns would let us avoid relying on Objective C, as well as internal frameworks like SDL (server-driven layout), which is present in many legacy Etsy app screens and comes with a variety of scalability and performance issues, and is in the process of being phased out. At the same time, we needed a shippable product by October. Black Friday and Cyber Week loomed on the horizon in November, and it would be a missed opportunity, for us and for our users, not to have the Deals tab ready to go. Our ambition to use modern, not yet road-tested technologies would have to balance with realism about the needs of the product, and we were conscious of maintaining that balance throughout the course of development. In comes Swift UI and Tuist! Two new frameworks were front of mind when starting this project: Swift UI and Tuist. Swift UI provides a clear, declarative framework for UI development, and makes it easy for engineers to break down views into small, reusable components. Maybe Swift UI’s biggest benefit is its built-in view previews: in tandem with componentization, it becomes a very straightforward process to build a view out of smaller pieces and preview at every step of the way. Our team had experimented with Swift UI in the past, but with scopes limited to small views, such as headers. Confident as we were about the framework, we expected that building out a whole screen in Swift UI would present us some initial hurdles to overcome. In fact, one hurdle presented itself right away. In a decade-old codebase, not everything is optimized for use with newer technologies. The build times we saw for our Swift UI previews, which were almost long enough to negate the framework’s other benefits, testified to that fact. This is where Tuist comes in. Our App Enablement team, which has been hard at work over the past few years modernizing the Etsy codebase, has adopted Tuist as a way of taming the monolith making it modular. Any engineer at Etsy can declare a Tuist module in their project and start working on it, importing parts of the larger codebase only as they need them. (For more on Etsy’s usage of Tuist, check out this article by Mike Simons from the App Enablement team.) Moving our work for the Deals tab into a Swift-based Tuist module gave us what it took to make a preview-driven development process practical: our previews build nearly instantly, and so long as we’re only making changes in our framework the app recompiles with very little delay. Figure 1. A view of a goal end state of a modular Etsy codebase, with a first layer of core modules (in blue), and a second layer of client-facing modules that combine to build the Etsy app. Our architecture The Deals tab comprises a number of modules for any given Etsy user, including a Deals Just for You module with abandoned cart coupons, and a module that shows a user their favorite listings that are on sale. Since the screen is just a list of modules, the API returns them as an array of typed items with the following structure: { "type": "", "": { ... } } Assigning each module a type enables us to parse it correctly on the client, and moves us away from the anonymous component-based API models we had used in the past. Many models are still used across modules, however. These include, but are not limited to, buttons, headers and footers, and listing cards. To parse a new module, we either have to build a new component if it doesn't exist yet, or reuse one that does. Adding a footer to a module, for example, can be as simple as: // Model { "type": "my_module", "my_module": { "target_listing": { }", "recommended_listings": [ ], "footer": { } // Add footer here } } // View var body: some View { VStack { ListingView(listing: targetListing) ListingCarouselView(listings: recommendedListings) MyFooterView(footer: footer) // Add footer here } } We also used Decodable implementations for our API parsing, leading to faster, clearer code and an easier way to handle optionals. With Etsy’s internal APIv3 framework built on top of Apple’s Decodable protocol, it is very straightforward to define a model and decide what is and isn’t optional, and let the container handle the rest. For example, if the footer was optional, but the target and recommended listings are required, decoding would look like this: init(from decoder: Decoder) throws { let container = try decoder.containerV3(keyedBy: CodingKeys.self) // These will throw if they aren't included in the response self.targetListing = try container.requireV3(forKey: .targetListing) self.recommendedListings = try container.requireV3(forKey: .recommendedListings) // Footer is optional self.footer = container.decodeV3(forKey: .footer) } As for laying out the view on the screen, we used a Swift UI List to make the most of the under-the-hood cell reuse that List uses. Figure 2. On the left-hand side, a diagram of how the DealsUI view is embedded in the Etsy app. On the right-hand side, a diagram of how the DeasUI framework takes the API response and renders a list of module views with individual components. Previews, previews, more previews If we were going to maintain a good development pace, we needed to figure out a clean way to use Swift previews. Previewing a small component, like a header that takes a string, is simple enough: just initialize the header view with the header string. For more complex views, though, it gets cumbersome to build a mock API response every time you need to preview. This complexity is only amplified when previewing an entire Deals module. To streamline the process, we decided to add a Previews enum to our more complex models. A good example of this is in the Deals Just for You coupon cards. These cards display an image or an array of images, a few lines of custom text (depending on the coupon type), and a button. Our previews enum for this API model looks like this: // In an extension to DealsForYouCard enum Previews { static var shopCouponThreeImage: ResponseModels.DealsForYouCard { let titleText = "IrvingtonWoodworksStudio" let images = [...] // Three images let button = ResponseModels.Button( buttonText: "10% off shop", action: .init(...) ) return ResponseModels.DealsForYouCard( button: button, saleBadge: "20% off", titleText: titleText, subtitleText: "Favorited shop", action: .init(...), images: images ) } static var listingCoupon: ResponseModels.DealsForYouCard { ... } } Then, previewing a variety of coupon cards, it’s as straightforward as: #Preview { DealsForYouCardView(coupon: .Previews.listingCoupon) } #Preview { DealsForYouCardView(coupon: .Previews.shopCouponThreeImage) } The other perk of this architecture is that it makes it very easy to nest previews, for example when previewing an entire module. To build preview data for the Deals for You module, we can use our coupon examples this way: // In an extension to DealsForYouModule enum Previews { static var mockModule: ResponseModels.DealsForYouModule { let items: [ResponseModels.DealsForYouCard] = [.Previews.listingCoupon, .Previews.shopCouponThreeImage, .Previews.shopCouponTwoImage] let header = ResponseModels.DealsForYouHeader(title: "Deals just for you") return .init(header: header, items: items) } } These enums are brief, clear, and easy to understand; they allow us to lean into the benefits of modularity. This architecture, along with our Decodable models, also enabled us to clear a roadblock that used to occur when our team had to wait for API support before we could build new modules. For example, both the Similar Items on Sale and Extra Special Deals modules in the Deals tab were built via previews, and were ready approximately two weeks before the corresponding API work was complete, helping us meet deadlines and not have to wait for a new App Store release. By taking full advantage of Swift UI's modularity and previewability, not only were we able to prove out a set of new technologies, we also exceeded product expectations by significantly beating our deadlines even with the initial overhead of adopting the framework. Challenges: UIKit interoperability Particularly when it came to tasks like navigation and favoriting, interfacing between our module and the Etsy codebase could pose challenges. An assumption that we had as engineers going into this project was that the code to open a listing page, for example, would just be readily available to use; this was not the case, however. Most navigation code within the Etsy codebase is handled by an Objective C class called EtsyScreenController. While in the normal target, it’s as straightforward as calling a function to open a listing page, that functionality was not available to us in our Deals module. One option would have been to build our own navigation logic using Swift UI Navigation stacks; we weren’t trying to reinvent the wheel, however. To balance product deadlines and keep things as simple as possible, we decided not to be dogmatic, and to handle navigation outside of our framework. We did this by building a custom @Environment struct, called DealsAction, which passes off responsibility for navigation back to the main target, and uses the new Swift callAsFunction() feature so we can treat this struct like a function in our views. We have a concept of a DealsAction type in our API response, which enables us to match an action with an actionable part of the screen. For example, a button response has an action that will be executed when a user taps the button. The DealsAction handler takes that action, and uses our existing UIKit code to perform it. The Deals tab is wrapped in a UIHostingController in the main Etsy target, so when setting up the Swift UI view, we also set the DealsAction environment object using a custom view modifier: let dealsView = DealsView() .handleDealsAction { [weak self] in self?.handleAction(action: $0) } ... func handleDealsAction(action: DealsAction) { // UIKit code to execute action } Then, when we need to perform an action from a Swift UI view, the action handler is present at any layer within the view hierarchy within the Deals tab. Performing the action is as simple as: @Environment(\.handleDealsAction) var handleDealsAction: DealsAction ... MyButton(title: buttonText, fillWidth: false) { handleDealsAction(model.button?.action) } We reused this pattern for other existing functionality that was only available in the main target. For example, we built an environment object for favoriting listings, or for following a shop, and for logging performance metrics. This pattern allows us to include environment objects as needed, and it simplifies adding action handling to any view. Instead of rebuilding this functionality in our Tuist module in pure Swift, which could have taken multiple sprints, we struck a balance between modernization and the need to meet product deadlines. Challenges: Listing Cards The listing card view is a common component used across multiple screens within the Etsy app. This component was originally written in Objective-C and throughout the years has been enhanced to support multiple configurations and layouts, and to be available for A/B testing. It also has built-in functionality like favoriting, which engineers shouldn't have to reimplement each time they want to present a listing card. Figure 3. A diagram of how listing card views are conventionally built in UIKit, using configuration options and the analytics framework to combine various UIKit subviews. It's been our practice to reuse this same single component and make small modifications to support changes in the UI, as per product or experimentation requirements. This means that many functionalities, such as favoriting, long-press menus, and image manipulation, are heavily coupled with this single component, many parts of which are still written in Objective C. Early in the process of developing the new tab, we decided to scope out a way of supporting conventional listing card designs—ones that matched existing cards elsewhere in the app—without having to rebuild the entire card component in Swift UI. We knew a rebuild would eventually be necessary, since we expected to have to support listing cards that differed significantly from the standard designs, but the scope of such a rebuild was a known unknown. To balance our deadlines, we decided to push this more ambitious goal until we knew we had product bandwidth. Since the listing card view is heavily coupled with old parts of the codebase, however, it wasn’t as simple as just typing import ListingCard and flying along. We faced two challenges: first, the API model for a listing card couldn’t be imported into our module, and second the view couldn’t be imported for simple use in a UIViewRepresentable wrapper. To address these, we deferred responsibility back up to the UIKit view controller. Our models for a listing card component look something like this: struct ListingCard { public let listingCardWrapper: ListingCardWrapper let listingCard: TypedListingCard } The model is parsed in two ways: as a wrapper, where it is parsed as an untyped dictionary that will eventually be used to initialize our legacy listing card model, and as a TypedListingCard, which is used only within the Deals tab module. Figure 4. A diagram of how a UIKit listing card builder is passed from the main target to the Deals framework for rendering listing cards. To build the listing card view, we pass in a view builder to the SwiftUI DealsView initializer in the hosting controller code. Here, we are in the full Etsy app codebase, meaning that we have access to the legacy listing card code. When we need to build a listing card, we use this view builder as follows: var body: some View { LazyVGrid(...) { ForEach(listings) { listing in cardViewBuilder(listing) // Returns a UIViewRepresentable } } } There was some initial overhead involved in getting these cards set up, but it was worth it to guarantee that engineering unknowns in a Swift UI rewrite wouldn’t block us and compromise our deadlines. Once built, the support for legacy cards enabled us to reuse them easily wherever they were needed. In fact, legacy support was one of the things that helped us move faster than we expected, and it became possible to stretch ourselves and build at least some listing cards in the Deals tab entirely in Swift UI. This meant that writing the wrapper ultimately gave us the space we needed to avoid having to rely solely on the wrapper! Conclusion After just three months of engineering work, the Deals tab was built and ready to go, even beating product deadlines. While it took some engineering effort to overcome initial hurdles, as well as the switch in context from working in UIKit in the main target to working in Swift UI in our own framework, once we had solutions to those challenges and could really take advantage of the new architecture, we saw a very substantial increase in our engineering velocity. Instead of taking multiple sprints to build, new modules could take just one sprint or less; front-end work was decoupled from API work using Previews, which meant we no longer had to wait for mock responses or even API support at all; and maybe most important, it was fun to use Swift UI’s clear and straightforward declarative UI building, and see our changes in real time! From a product perspective, the Deals tab was a great success: buyers converted their sessions more frequently, and we saw an increase in visits to the Etsy app. The tab was rolled out to all users in mid October, and has seen significant engagement, particularly during Black Friday and Cyber Monday. By being bold and by diving confidently into new frameworks that we expected to see benefits from, we improved engineer experience and not just met but beat our product deadlines. More teams at Etsy are using Swift UI and Tuist in their product work now, thanks to the success of our undertaking, sometimes using our patterns to work through hurdles, sometimes creating their own. We expect to see more of this kind of modernization start to make its way into the codebase. As we iterate on the Deals tab over the next year, and make it even easier for buyers to find the deals that mean the most to them, we look forward to continuing to work in the same spirit. Special thanks to Vangeli Ontiveros for the diagrams in this article, and a huge shoutout to the whole App Deals team for their hard work on this project!
In the past, sellers were responsible for managing and fulfilling their own tax obligations. However, more and more jurisdictions are now requiring marketplaces such as Etsy to collect the tax from buyers and remit the tax to the relevant authorities. Etsy now plays an active role in collecting tax from buyers and remitting it all over the world. In this post, I will walk you through our tax calculation infrastructure and how we adapted to the ongoing increase in traffic and business needs over the years. The tax calculation workflow We determine tax whenever a buyer adds an item to their Etsy shopping cart. The tax determination is based on buyer and seller location and product category, and a set of tax rules and mappings. To handle the details of these calculations we partner with Vertex, and issue a call to their tax engine via the Quotation Request API to get the right amount to show in our buyer's cart. Vertex ensures accurate and efficient tax management and continuously updates the tax rules and rates for jurisdictions around the world. The two main API calls we use are Quotation Request and DistributeTaxRequest SOAP calls. When the buyer proceeds to payment, an order is created, and we call back to Vertex with a DistributeTaxRequest sending the order information and tax details. We sync information with Vertex through the order fulfillment lifecycle. To keep things up to date in case an order is canceled or a refund needs to be issued later on, we inform the details of the cancellation and refunds to the tax engine via DistributeTaxRequest. This ensures that when Vertex generates tax reports for us they will be based on a complete record of all the relevant transactions. Etsy collects the tax from the buyers and remits that tax to the taxing authority, when required. Generate tax details for reporting and audit purpose Vertex comes with a variety of report formats out of the box, and gives us tools to define our own. When Etsy calls the Distribute Tax API, Vertex saves the information we pass to it as raw metadata in its tax journal database. A daily cron job in Vertex then moves this data to the transaction detail table, populating it with tax info. When reports and audit data are generated, we download these reports and import to Etsy’s bigdata and the workflow completes. Mapping the Etsy taxonomy to tax categories Etsy maintains product categories to help our buyers find exactly the items they're looking for. To determine whether transactions are taxed or exempt it's not enough to know item prices and buyer locations: we have to map our product categories to Vertex's rule drivers. That was an effort involving not just engineering but also our tax and analytics teams, and with the wide range of Etsy taxonomy categories it was no small task. Handling increased API traffic Coping with the continuous increase in traffic and maintaining the best checkout experience without delays has been a challenge all the time. Out of the different upgrades we did, the most important ones were to switch to multiple instances for vertex calls and shadowing. Multiple Instance upgrade In our initial integration, we were using the same vertex instance for Quotation and Distribute calls. And the same instance was responsible for generating the reports. This report generation started to affect our checkout experience. Reports are generally used by our tax team and they run them on a regular basis. But on top of that, we also run daily reports to feed the data captured by Vertex back into our own system for analytics purposes. We solved this by routing the quotation calls to one instance and then distributing them to the other. This helped in maintaining a clear separation of functionalities, and avoided interference between the two processes. We had to align the configurations between the instances as well. Splitting up the quotation and distribution calls opened up the door to horizontal scaling, now we can add as many instances of each type and load balance the requests between instances. Eg: When a request type lists multiple instances, we load balance between the instances by using the cart_id for quotations and receipt_ids for distributes I.e. cart_id % quotation_instance_count Shadow logging Shadow logging the requests helped us to simulate the stress on Vertex and monitor the checkout experience. We used this technique multiple times in the past. Whenever we had situations like, for example, adding five hundred thousand more listings whose taxes would be passed through the Vertex engine, we were concerned that the increase in traffic might impact buyer experience. To ensure it wouldn't, we tested for a period of time by slowly ramping shadow requests to Vertex: "Shadow requests" are test requests that we send to Vertex from orders, but without applying the calculated tax details to buyers' carts. This will simulate the load on vertex and we can monitor the cart checkout experience. Once we have done shadowing and seen how well Vertex handled the increased traffic, we are confident that releasing the features ensures it would not have any performance implications. Conclusion Given the volume of increasing traffic and the data involved, we will have to keep improving our design to support those. We've also had to address analytics, reporting, configuration sync and many more in designing the system, but we'll leave that story for next time.
A little while ago, Etsy introduced a new feature in its iOS app that could place Etsy sellers' artwork on a user's wall using Apple's Augmented Reality (AR) tools. It let them visualize how a piece would look in their space, and even gave them an idea of its size options. When we launched the feature as a beta, it was only available in "wall art"-related categories, and after the initial rollout we were eager to expand it to work with more categories. What differentiates Etsy is the nature of our sellers’ unique items. Our sellers create offerings that can be personalized in numbers of ways, and they often hand-make orders based on demand. Taking the same approach we did with wall art and attempting to show 3D models of millions of Etsy items – many of which could be further customized – would be a huge undertaking. Nevertheless, with inspiration from Etsy's Guiding Principles, we decided to dig deeper into the feature. What could we improve in the way it worked behind the scenes? What about it would make for a compelling extension into the rest of our vast marketplace? We took steps to improve how we parse seller-provided data, and we used this data with Apple’s AR technology to make it easy for Etsy users to understand the size and scale of an object they might want to buy. We decided we could make tape measures obsolete (or at least not quite as essential) for our home-decor shoppers by building an AR tool to let them visualize–conveniently, accurately, and with minimal effort–how an item would fit in their space. Improving dimension parsing In our original post on the wall art experience, we mentioned the complexity involved in doing things like inferring an item's dimensions from text in its description. Etsy allows sellers to add data about dimensions in a structured way when they create a listing, but that wasn't always the case, and some sellers still provide those details in places like the description or the item's title. The solution was to create a regex-based parser in the iOS App that would glean dimensions (width and height) by sifting through a small number of free-form fields–title, description, customization information, overview–looking for specific patterns. We were satisfied being able to catch most of the formats in which our sellers reported dimensions, handling variable positions of values and units (3 in x 5 in vs 3 x 5 in), different long and short names of units, special unit characters (‘, “), and so on, in all the different languages that Etsy supports. Migrating our parsing functionality to the API backend was a first step towards making the AR measuring tool platform-independent, so we could bring it to our Android App as well. It would also be a help in development, since we could iterate improvements to our regex patterns faster outside the app release schedule. And we’d get more consistent dimensions because we'd be able to cache the results instead of having to parse them live on the client at each visit. We knew that an extended AR experience would need to reliably show our users size options for items that had them, so we prioritized the effort to parse out dimensions from variations in listings. We sanitized free-form text input fields that might contain dimensions—inputs like title or description—so that we could catch a wider range of formats. (Several different characters can be used to write quotation marks, used as shorthand for inches and feet, and we needed to handle special characters for new lines, fraction ligatures like ½ or ¼, etc.) Our regex pattern was split and updated so it could detect: Measurement units in plural forms (inches, feet, etc.); Incorrect spellings (e.g. "foots"); Localization of measurement units in the languages spoken by Etsy’s users ("meters", "metros", and "mètres" in English, Spanish, and French, respectively); Other formats in which dimensions are captured by sellers like dimensions with unit conversions in parentheses (e.g. 12 in x 12 in (30 cm x 30 cm)) or with complex measurements in the imperial system (3’6”). Making our dimension parsing more robust and bringing it server-side had several ancillary benefits. We were able to maintain the functionality of our iOS app while removing a lot of client-side code, even in Etsy’s App Clip, where size is a matter of utmost importance. And though regex processing isn’t that processor-intensive, the view feature performed better once we implemented server-side caching of parsed dimensions. We figured we could even take the parsing offline (rather than parsing every listing on every visit) by running a backfill process to store dimensions in our database and deliver them to the App along with item details. We found, thanks to decoupling our parser work from the App release cycle, that we were able to test hypotheses faster and iterate at a quicker pace. So we could proceed to some improvements that would have been quite resource-intensive if we had tried to implement them on the native app side. Sellers often provide dimensions in inconsistent units, for instance, or they might add the same data multiple times in different fields, when there are variations in properties like material or color. We worked out ways to de-duplicate this data during parsing, to minimize the number of size options we show users. (Though where we find dimensions that are specifically associated with variations, we make sure to retain them, since those will more directly correlate with offering prices.) And we made it possible to prioritize structured dimension data, where sellers have captured it in dedicated fields, as a more reliable source of truth than free-form parsing. Measuring in 3D The box With this new and improved dimension data coming to us from the server, we had to figure out the right way to present it in 3D in iOS. The display needed to be intuitive, so our users would know more or less at a glance what the tool was and how to interact with it. Ultimately, we decided to present a rectangular prism-type object scaled to the parsed dimensions we have for a given listing. Apple's SceneKit framework – specifically its SCNBox class – is what creates this box, which of course we style with the Etsy Orange look. So that users understand the box's purpose, we make sure to display the length on each side. We use SceneKit's SCNNode class to create the pills displaying our measurements. Users drag or tap the measuring box to move it around, and it can rotate on all axes – all made possible by having a different animation for each type of rotation using SCNActions. Rotation is a must-have feature: when we place the measuring box in a user's space, we may not always be able to get the orientation correct. We might, as in the illustration below, place a side table vertically on the floor instead of horizontally. Our users would have a poor experience of the measuring tool if they couldn't adjust for that. (Note that you may see some blinking yellow dots when you try out the AR experience: these are called feature points, and they're useful for debugging, to give us an idea of what surfaces are successfully being detected.) Environment occlusion In addition to ensuring the box would be scaled correctly, we wanted it to "sit" as realistically as possible in the real world, so we built in scene occlusion. When a user places the measuring box in a room with other furniture, scene occlusion lets it interact with real-life objects as if the box were actually there. Users get valuable information this way about how an item will fit in their space. Will that end table go between the wall and couch? Will it be tall enough to be visible from behind the couch? (As demonstrated below, the table will indeed be tall enough.) Environment occlusion became a possibility with LiDAR, a method of determining depth using laser light. Although LiDAR has been around for a few decades, used to map everything from archeological sites to agricultural fields, Apple only included LiDAR scanners in iPhones and iPads beginning in 2020, with the 4th-generation iPad Pro and the iPhone 12 Pro. An iPhone’s LiDAR scanner retrieves depth information from the area it scans and converts it into a series of vertices which connect to form a mesh (or a surface). To add occlusion to our SpriteKit-backed AR feature, we convert the mesh into a 3D object and place it (invisibly to the user) in the environment shown on their phone. As the LiDAR scanner measures more of the environment, we have more meshes to convert into objects and place in 3D. The video below shows an AR session where for debugging purposes we assign a random color to the detected mesh objects. Each different colored outline shown over a real-world object represents a different mesh. Notice how, as we scan more of the room, the device adds more mesh objects as it continues drawing out the environment. The user's device uses these mesh objects to know when and how to occlude the measuring box. Essentially, these mesh objects help determine where the measuring box is relative to all the real-world items and surfaces it should respect. Taking advantage of occlusion gives our users an especially realistic AR experience. In the side-by-side comparison below, the video on the left shows how mesh objects found in the environment determine what part of the measuring box will be hidden as the camera moves in front of the desk. The video on the right shows the exact same thing, but with the mesh objects hidden. Mesh objects are visible Mesh objects are hidden Closing thoughts This project took a 2D concept, our Wall View experience, and literally extended it into 3-dimensional space using Apple’s newest AR tools. And though the preparatory work we did improving our dimension parser may not be anything to look at, without the consistency and accuracy of that parsed information this newly realistic and interactive tool would not have been possible. Nearly a million Etsy items now have real-size AR functionality added to them, viewed by tens of thousands of Etsy users every week. As our marketplace evolves and devices become more powerful, working on features like this only increases our appetite for more and brings us closer to providing our users with a marketplace that lets them make the most informed decision about their purchases effortlessly. Special shoutout to Jacob Van Order and Siri McClean as well as the rest of our team for their work on this.
Introduction Each year, Etsy hosts an event known as “CodeMosaic” - an internal hackathon in which Etsy admin propose and build bold advances quickly in our technology across a number of different themes. People across Etsy source ideas, organize into teams, and then have 2-3 days to build innovative proofs-of-concept that might deliver big wins for Etsy’s buyers and sellers, or improve internal engineering systems and workflows. Besides being a ton of fun, CodeMosaic is a time for engineers to pilot novel ideas. Our team’s project this year was extremely ambitious - we wanted to build a system for stateful machine learning (ML) model training and online machine learning. While our ML pipelines are no stranger to streaming data, we currently don’t have any models that learn in an online context - that is, that can have their weights updated in near-real time. Stateful training updates an already-trained ML model artifact incrementally, sparing the cost of retraining models from scratch. Online learning updates model weights in production rather than via batch processes. Combined, the two approaches can be extremely powerful. A study conducted by Grubhub in 2021 reported that a shift to stateful online learning saw up to a 45x reduction in costs with a 20% increase in metrics, and I’m all about saving money to make money. Day 1 - Planning Of course, building such a complex system would be no easy task. The ML pipelines we use to generate training data from user actions require a number of offline, scheduled batch jobs. As a result it takes quite a while, 40 hours at a minimum, for user actions to be reflected in a model’s weights. To make this project a success over the course of three days, we needed to scope our work tightly across three streams: Real-time training data - the task here was to circumvent the batch jobs responsible for our current training data and get attributions (user actions) right from the source. A service to consume the data stream and learn incrementally - today, we heavily leverage TensorFlow for model training. We needed to be able to load a model's weights into memory, read data from a stream, update that model, and incrementally push it out to be served online. Evaluation - we'd have to make a case for our approach by validating its performance benefits over our current batch processes. No matter how much we limited the scope it wasn't going to be easy, but we broke into three subteams reflecting each track of work and began moving towards implementation. Day 2 - Implementation The real-time training data team began by looking far upstream of the batch jobs that compute training data - at Etsy’s Beacon Main Kafka stream, which contains bot-filtered events. By using Kafka SQL and some real-time calls to our streaming feature platform, Rivulet, we figured we could put together a realistic approach to solving this part of the problem. Of course, as with all hackathon ideas it was easier said than done. Much of our feature data uses the binary avro data format for serialization, and finding the proper schema for deserializing and joining this data was troublesome. The team spent most of the second day munging the data in an attempt to join all the proper sources across platforms. And though we weren't able to write the output to a new topic, the team actually did manage to join multiple data sources in a way that generated real-time training data! Meanwhile the team focusing on building the consumer service to actually learn from the model faced a different kind of challenge: decision making. What type of model were we going to use? Knowing we weren’t going to be able to use the actual training data stream yet - how would we mock it? Where and how often should we push new model artifacts out? After significant discussion, we decided to try using an Ad Ranking model as we had an Ads ML engineer in our group and the Ads models take a long time to train - meaning we could squeeze a lot of benefit out of them by implementing continuous training. The engineers in the group began to structure code that pulled an older Ads model into memory and made incremental updates to the weights to satisfy the second requirement. That meant that all we had left to handle was the most challenging task - evaluation. None of this architecture would mean anything if a model that was trained online performed worse than the model retrained daily in batch. Evaluating a model with more training training periods is also more difficult, as each period we’d need to run the model on some held-out data in order to get an accurate reading without data leakage. Instead of performing an extremely laborious and time-intensive evaluation for continuous training like the one outlined above, we chose to have a bit more fun with it. After all, it was a hackathon! What if we made it a competition? Pick a single high-performing Etsy ad and see which surfaced it first, our continuously trained model or the boring old batch-trained one? We figured if we could get a continuously trained model to recommend a high-performing ad sooner, we’d have done the job! So we set about searching for a high-performing Etsy ad and training data that would allow us to validate our work. Of course, by the time we were even deciding on an appropriate advertised listing, it was the end of day two, and it was pretty clear the idea wasn’t going to play out before it was time for presentations. But still a fun thought, right? Presentation takeaways and impact Day 3 gives you a small window for tidying up work and slides, followed by team presentations. At this point, we loosely had these three things: Training data from much earlier in our batch processing pipelines A Kafka consumer that could almost update a TensorFlow model incrementally A few click attributions and data for a specific listing In the hackathon spirit, we phoned it in and pivoted towards focusing on the theoretical of what we’d been able to achieve! The 1st important potential area of impact was cost savings. We estimated that removing the daily “cold-start” training and replacing it with continuous training would save about $212K annually in Google Cloud costs for the 4 models in ads alone. This is a huge potential win - especially when coupled with the likely metrics gains coming from more reactive models. After all, if we were able to get events to models 40 hours earlier, who knows how much better our ranking could get! Future directions and conclusion Like many hackathon projects, there's no shortage of hurdles getting this work into a production state. Aside from the infrastructure required to actually architect a continuous-training pipeline, we’d need a significant number of high-quality checks and balances to ensure that updating models in real-time didn’t lead to sudden degradations in performance. The amount of development, number of parties involved, and the breadth of expertise to get this into production would surely be extensive. However, as ML continues to mature, we should be able to enable more complex architectures with less overhead.
Introduction Personalization is vital to connect our unique marketplace to the right buyer at the right time. Etsy has recently introduced a novel, general approach to personalizing ML models based on encoding and learning from short-term (one-hour) sequences of user actions through a reusable three-component deep learning module, the adSformer Diversifiable Personalization Module (ADPM). We describe in detail our method in our recent paper, with an emphasis on personalizing the CTR (clickthrough rate) and PCCVR (post-click conversion rate) ranking models we use in Etsy Ads. Here, we'd like to present a brief overview. Etsy offers its sellers the opportunity to place sponsored listings as a supplement to the organic results returned by Etsy search. For sellers and buyers alike, it’s important that those sponsored listings be as relevant to the user’s intent as possible. As Figure 1 suggests, when it comes to search, a “jacket” isn't always just any jacket: Figure 1: Ad results for the query jacket for a user who has recently interacted with mens leather jackets. In the top row, the results without personalized ranking; in the bottom row, the results with session personalization. For ads to be relevant, they need to be personalized. If we define a “session” as a one-hour shopping window, and make a histogram of the total number of listings viewed across a sample of sessions (Fig. 2), we see that a power law distribution emerges. The vast majority of users interact with only a small number of listings before leaving their sessions. Figure 2: A histogram of listing views in a user session. Most users see fewer than ten listings in a one-hour shopping window. Understood simply in terms of listing views, it might seem that session personalization would be an insurmountable challenge. To overcome this challenge we leverage a rich stream of user actions surrounding those views and communicating intent, for example: search queries, item favorites, views, add-to-carts, and purchases. Our rankers can optimize the shopping experience in the moment by utilizing streaming features being made available within seconds of these user actions. Consider a hypothetical sequence of lamps viewed by a buyer within the last hour. Figure 3: An example of a user session with the sequence of items viewed over time. 70s orange lamp ---> retro table lamp --> vintage mushroom lamp Not only is the buyer looking within a particular set of lamps (orange, mushroom-shaped), but they arrived at these lamps through a sequence of query refinements. The search content itself contains information about the visual and textual similarities between the listings, and the order in which the queries occur adds another dimension of information. The content and the sequence of events can be used together to infer what is driving the user’s current interest in lamps. adSformer Diversifiable Personalization Module The adSformer Diversifiable Personalization Module (ADPM), illustrated on the left hand side of Figure 4, is Etsy's solution for using temporal and content signals for session personalization. A dynamic representation of the user is generated from a sequence of the user's most recent streamed actions. The input sequence contains item IDs, queries issued and categories viewed. We consider the item IDs, queries, and categories as “entities” that have recent interactions within the session. For each of these entities we consider different types of actions within a user session–views, recent cart-adds, favorites, and purchases–and we encode each type of entity/action pair separately. This lets us capture fine-grained information about the user's interests in their current session. Figure 4: On the left, a stack representing the ADPM architecture. The right part of the figure is a blown-out illustration of the adSformer Encoder component. Through ablation studies we found that ADPM’s three components work together symbiotically to outperform experiments where each component is considered independently. Furthermore, in deployed applications, the diversity of learned signals improves robustness to input distribution shifts. It also leads to more relevant personalized results, because we understand the user from multiple perspectives. Here is how the three components operate: Component One: The adSformer Encoder The adSformer encoder component uses one or more custom adSformer blocks illustrated in the right panel of Figure 4. This component learns a deep, expressive representation of the one-hour input sequence. The adSformer block modifies the standard transformer block in the attention literature by adding a final global max pooling layer. The pooling layer downsamples the block’s outputs by extracting the most salient signals from the sequence representation instead of outputting the fully concatenated standard transformer output. Formally, for a user, for a one-hour sequence S of viewed item IDs, the adSformer encoder is defined as the output of a stack of layers g(x), where x is the output of each previous layer and o1 is the component’s output. The first layer is an embedding of item and position. Component Two: Pretrained Representations. Component two employs pretrained embeddings of item IDs that users have interacted with together with average pooling to encode the one-hour sequence of user actions. Depending on downstream performance and availability, we choose from multimodal (AIR) representations and visual representations. Thus component two encodes rich image, text and multimodal signals from all the items in the sequence. The advantage of leveraging pretrained item embeddings is that these rich representations are learned efficiently offline using complex deep learning architectures that would not be feasible online in real time. Formally, for a given one-hour sequence of m1hr item IDs pretrained d-dimensional embedding vectors e, we compute a sequence representation as Component Three: Representations Learned "On the Fly" The third component of ADPM introduces representations learned for each sequence from scratch in its own vector space as part of the downstream models. This component learns lightweight representations for many different sequences for which we do not have pretrained representations available, for example sequences of favorited shop ids. Formally, for z one-hour sequences of entities acted upon S we learn embeddings for each entity and sequence to obtain the component’s output o3 as The intermediary outputs of the three components are concatenated to form the final ADPM output, the dynamic user representation u. This user representation is then concatenated to the input vector in various rankers or recommenders we want to real-time personalize. Formally, for one-hour variable length sequences of user actions S, and ADPM’s components outputs o From a software perspective, the module is implemented as a Tensorflow Keras module which can easily be employed in downstream models through a simple import statement. Pretrained Representation Learning The second component of the ADPM includes pretrained representations. We rely on several pretrained representations: image embeddings, text embeddings, and multimodal item representations. Visual Representations In Etsy Ads, we employ image signals across a variety of tasks, such as visually similar candidate generation, search by image, as inputs for learning other pretrained representations, and in the ADPM's second component. To effectively leverage the rich signal encoded in Etsy Ads images we train image embeddings in a multitask classification learning paradigm. By using multiple classification heads, such as taxonomy, color, and material, our representations are able to capture more diverse information about the image. So far we have derived great benefit from our multitask visual embeddings, trained using a lightweight EfficientNetB0 architecture, and weights pretrained on ImageNet as backbone. We replaced the final layer with a 256-dimensional convolutional block, which becomes the output embedding. We apply image random rotation, translation, zoom, and a color contrast transformation to augment the dataset during training. We are currently in the process of updating the backbone architectures to efficient vision transformers to further improve the quality of the image representations and the benefits derived in downstream applications, including the ADPM. Ads Information Retrieval Representations Ads Information Retrieval (AIR) item representations encode an item ID through a metric learning approach, which aims to learn a distance function or similarity metric between two items. Standard approaches to metric learning include siamese networks, contrastive loss, and triplet loss. However, we found more interpretable results using a sampled in-batch softmax loss. This method treats each batch as a classification problem pairing all the items in a batch that were co-clicked. A pseudo-two-tower architecture is used to encode the source items and candidate items towers which share all trainable weights across both towers. Each item tower captures and encodes information about an item’s title, image, primary color, attributes, category, etc. This information diversity is key to our personalization outcomes. By leveraging a variety of data sources, the system can identify patterns and insights that would be missed by a more limited set of inputs. ADPM-Personalized Sponsored Search ADPM’s effectiveness and generality is demonstrated in the way we use it to personalize the CTR prediction model in EtsyAds’ Sponsored Search. The ADPM encodes reverse-chronological sequences of recent user actions (in the sliding one-hour window we've discussed), anywhere on Etsy, for both logged-in and logged-out users. We concatenate ADPM’s output, the dynamic user representation, to the rest of the wide input vector in the CTR model. To fully leverage this even wider input vector, a deep and cross (DCN) interaction module is included in the overall CTR architecture. If we remove the DCN module, the CTR’s model ROC-AUC drops by 1.17%. The architecture of the ADPM-personalized CTR prediction model employed by EtsyAds in sponsored search is given in Figure 5. (We also employ the ADPM to personalize the PCCVR model with a similar architecture, which naturally led to ensembling the two models in a multitask architecture, a topic beyond the scope of this blog post.) Figure 5: An example of how the ADPM is used in a downstream ranking model The ADPM-personalized CTR and PCCVR models outperformed the CTR and PCCVR non-personalized production baselines by +2.66% and +2.42%, respectively, in offline Area Under the Receiver Operating Characteristic Curve (ROC-AUC). Following the robust online gains in A/B tests, we deployed the ADPM-personalized sponsored search system to 100% of traffic. Conclusion The adSformer diversifiable personalization module (ADPM) is a scalable, general approach to model personalization from short-term sequences of recent user actions. Its use in sponsored search to personalize our ranking and bidding models is a milestone for EtsyAds, and is delivering greater relevance in sponsored placements for the millions of buyers and sellers that Etsy's marketplace brings together. If you would like more details about ADPM, please see our paper.
Introduction The Feature Systems team at Etsy is responsible for the platform and services through which machine learning (ML) practitioners create, manage and consume feature data for their machine learning models. We recently made new real-time features available through our streaming feature platform, Rivulet, where we return things like “most recent add-to-carts.” While timeseries data itself wasn’t new to our system, these newer features from our streaming feature service would be the first timeseries inputs to be supplied to our ML models themselves to inform search, ads, and recommendations use cases. Not too long after we made these features available to users for ML model training, we received a message from Harshal, an ML practitioner on Recommendations, warning us of "major problems" lying in wait. Figure 1. A user message alerting us to the possibility of "major problems for downstream ML models" in our use of the timestamp datatype. Harshal told us our choice to export real-time features using a timestamp datatype was going to cause problems in downstream models. The training data that comes from our offline feature store uses the binary Avro file format, which has a logical type called timestamp we used to store these features, with an annotation specifying that they should be at the millisecond precision. The problem, we were being informed, is that this Avro logical type would be interpreted differently in different frameworks. Pandas, NumPy, and Spark would read our timestamps, served with millisecond precision, as datetime objects with nanosecond precision - creating the possibility of a training/serving skew. In order to prevent mismatches, and the risk they posed of silent failures in production, the recommendation was that we avoid the timestamp type entirely and serve our features as a more basic numeric data type, such as Longs. Getting to the root of the issue We started the way software engineers usually do, attempting to break down the problem and get to root causes. Before changing data types, we wanted to understand if the misinterpretation of the precision of the timestamp was an issue with Python, Spark, or even a misuse of the Avro timestamp annotation that we were using to specify the millisecond precision. We were hesitant to alter the data type of the feature without an in-depth investigation. After all, timestamp and datetime objects are typically passed around between systems precisely to resolve inconsistencies and improve communication. We started by attempting to put together a diagram of all the different ways that timestamp features were represented across our systems. The result was a diagram like this: Figure 2. A diagram of all the objects/interpretations of timestamp features across our systems. Though the user only ever sees microseconds, between system domains we see a diversity of representations. While it was clear Spark and other frameworks weren’t respecting the timestamp annotation that specified millisecond precision, we began to realize that that problem was actually a symptom of a larger issue for our ML practitioners. Timestamp features can take a number of different forms before finally being passed into a model. In itself this isn't really surprising. Every type is language-specific in one way or another - the diagram would look similar if we were going to be serializing integers in Scala and deserializing integers in Python. However, the overall disparity between objects is much greater for complex datetime objects than it is for basic data types. There is simply more room for interpretation with datetime objects, and less certainty about how they translate across system boundaries, and for our use case in training ML models uncertainty was exactly what we did not want. As we dug deeper into the question, it started to become clear that we weren’t trying to resolve a specific bug or issue, but reduce the amount of toil for ML practitioners who would be consuming timestamp features long-term. While the ISO-8061 format is massively helpful for sharing datetime and timestamp objects across different systems, it’s less helpful when all you’re looking for is an integer representation at a specific precision. Since these timestamps were features of a machine learning model, we didn’t need all the complexity that datetime objects and timestamp types offered across systems. The use case for this information was to be fed as an integer of a specific precision into an ML model, and nothing more. Storing timestamps as logical types increased cognitive overhead for ML practitioners and introduced additional risk that training with the wrong precision could degrade model quality during inference. Takeaways This small request bubbled into a much larger discussion during one of our organization’s architecture working group meetings. Although folks were initially hesitant to change the type of these features, by the end of the meeting there was a broad consensus that it would be desirable to represent datetime features in our system as a primitive numeric type (unix timestamps with millisecond precision) to promote consistency between model training and inference. Given the wide range of training contexts that all of our features are used in, we decided it was a good idea to promote consistency between training and inference by standardizing on primitive types more generally. Members of the Feature Systems team also expressed a desire to improve documentation around how features are transformed end-to-end throughout the current system to make things easier for customers in the future. We designed our ML features with abstraction and interoperability in mind, as software engineers do. It’s not that ML isn’t a software engineering practice, but that it’s a domain in which the specific needs of ML software didn’t match our mental model of best practices for the system. Although ML has been around for some time, the rapidly-changing nature of the space means the nuance of many ML-specific guidelines are still ill-defined. I imagine this small cross-section of difficulty applying software practices to ML practices will be the first of many as ML continues its trajectory through software systems of all shapes and sizes.
Etsy announced the arrival of a powerful new image-based discovery tool on Etsy’s mobile apps. The ability to search by image gives buyers the opportunity to search the Etsy marketplace using their own photos as a reference. Tap the camera icon in the search bar to take a picture, and in a fraction of a second we’ll surface visually similar results from our inventory of nearly 100 million listings. Searching by image is being rapidly adopted throughout the e-commerce industry, and nowhere does it make more sense than on Etsy, where the uniqueness of our sellers’ creations can’t always be expressed easily with words. In this post we’ll give you a look at the machine-learning architecture behind our search by image feature and the work we did to evolve it. Overview In order to search a dataset of images using another image as the query, we first need to convert all those images into a searchable representation. Such a representation is called an embedding, which is a dense vector existing in some relatively low n-dimensional shared space. Once we have the embedding of our query image, and given the precomputed embeddings for our dataset of listing images, we can use any geometric distance metric to look up the closest set of listings to our query. This type of search algorithm is often referred to as a nearest-neighbor search. Figure 1. A plot that shows embeddings from a random sample of a thousand Etsy images. The embeddings have been reduced to three dimensions so that they can be plotted. In this embedding space, bags and purses are embedded near each other. Separate clusters form around craft supply images on the left and home goods images on the right. At a high level, the visual retrieval system works by using a machine learning model to convert every listing’s image into an embedding. The embeddings are then indexed into an approximate nearest-neighbor (ANN) system which scores a query image for similarity against Etsy's image embeddings in a matter of milliseconds. Multitask Vision Model To convert images to embeddings we use a convolutional neural network (CNN) that has been trained on Etsy data. We can break our approach into three components: the model architecture, the learning objective, and the dataset. Model Architecture Training the entire CNN from scratch can be costly. It is also unnecessary as the early layers of a pretrained CNN can be shared and reused across new model tasks. We leveraged a pre-trained model and applied transfer learning to fine-tune it on Etsy data. Our approach was to download pre-trained weights into the model and replace the “head” of the model with one for our specific task. During training, we then “freeze” most of the pre-trained weights, and only optimize those for the new classification head as well as for the last few layers of the CNN. The particular pre-trained model we used is called EfficientNet: a family of CNNs that have been tuned in terms of width, depth, and resolution, all to achieve optimal tradeoffs between accuracy and efficiency. Learning Objective A proven approach to learning useful embeddings is to train a model on a classification task as a proxy. Then, at prediction time, extracting the penultimate layer just before the classification head produces an embedding instead of a classification probability. Our first attempt at learning image embeddings was to train a model to categorize product images. Not surprisingly, our tests showed that these embeddings were particularly useful in surfacing listings from the same taxonomy. Often though the results were not “visually cohesive”: items were surfaced that didn't match well with the query image in color, material or pattern. To mitigate this problem we switched to a deep metric learning approach utilizing triplet loss. In this approach, the model is trained on triplets of examples where each triplet consists of an anchor, a positive example, and a negative example. After generating an embedding for each of the three examples, the triplet loss function tries to push the anchor and positive examples closer together, while pushing the negative example farther away. In our case, we used pairs of images from the same item as the anchor and positive examples, and an image from a different listing as the negative example. With these triplet embeddings, tests showed that our listings were now visually cohesive, displaying similar colors and patterns. Figure 2. The top row shows a query image. Middle row is a sample of nearest neighbors from image classification learned embeddings. Bottom row is a sample of nearest neighbors from triplet embeddings. The triplet embeddings show improved color and pattern consistency over image classification. But these embeddings lacked categorical accuracy compared to the classification approach. And the training metrics for the triplet approach offered less observability, which made it harder for us to assess the model's learning progress than with classification. Figure 3. While triplet metrics mostly revolve around the change in distances between the anchor and positive/negative examples, classification provides accuracy metrics that can be a proxy for how well the model fares in the task’s specific domain, and are simpler to reason about Taxonomy is not something we can tolerate our model being sloppy about. Since classification had already proven its ability to retrieve items of the same type, we decided to see if a multitask classification approach could be made to produce visually consistent results. Instead of having a single classification head on top of the pre-trained model, we attach separate heads for multiple categorization tasks: Item category (e.g. accessories, home & living), fine-grained item category (belt buckles, dining chairs), primary color, and other item attributes. Loss and training/evaluation metrics are then computed for each task individually while the embedding weights are shared across all of them. One challenge we faced in implementing this approach was that some optional seller input attributes, such as item color and material, can be sparse. The more tasks we added, the harder it was for us to sample training data equally across classes. To overcome this limitation we implemented a data sampler that reads from entirely disjoint datasets, one for each task, and each with its own unique set of labels. At training time, the sampler combines an equal number of examples from each dataset into every minibatch. All examples then pass through the model, but loss from each classification head is calculated only for examples from the head’s corresponding source dataset. Figure 4. Multitask learning architecture Returning to the classification paradigm meant that we could once again rely on accuracy metrics as a proxy for gauging and comparing models’ understanding of the domain of each task. This greatly simplifies the process of iterating and improving the model. Dataset The embeddings produced by multitask classification were now encapsulating more information about the visual attributes we added tasks for. This meant that when we searched using an embedding of some Etsy listing image, our results were both categorically and visually cohesive. However, when we talk about search by image, we're not expecting users to search from Etsy listing photos - the whole point of this is that the user is holding a camera. Photos uploaded by sellers are typically high quality, have professional lighting, and are taken over a white or premeditated background. But photos from a user's phone might be blurry, or poorly lit, or on a diversity of backgrounds that distract from the object the user is searching for. Deep learning is a powerful and useful tool, but training deep learning models is highly susceptible to biases in the data distribution, and training on seller-provided product images was biasing us away from user photos. Fortunately, Etsy allows users to post reviews of items they’ve purchased, and those reviews can have photos attached: photos taken by buyers, often with their phone cameras, very much like the images we expect to see used when searching by image. So we plugged in an additional classification task using the dataset of review photos, expanding the distribution of images our model learns about in training. And indeed with this new component in place we saw significant improvement in the model’s ability to surface visually relevant results. Figure 5. Multitask learning architecture with the added review photos dataset and classification head Inference Pipeline and Serving Our inference pipeline is an orchestrated ensemble of data processing jobs that turns the entire Etsy inventory of nearly 100M active listings into a searchable framework. We construct an approximate nearest neighbor (ANN) index using an inverted file (IVF) algorithm. The IVF algorithm divides the embedding space into clusters of listings. Later, at query time, we only look at the nearest subset of clusters to the query embedding, which greatly reduces search latency while only marginally impairing accuracy. Figure 6. DAG tasks to generate ANN index While the listing images are indexed in batch offline, the query photo is taken by users on the fly, so we have to infer those in real time - and fast. Due to the size of CNN models, it can take a long time to inference on a CPU. To overcome this hurdle we partnered with Etsy’s ML platform team to bring the first use case of real-time GPU inferencing at Etsy. Figure 7. Request flow We hope this feature gives even more of our buyers a new way to find exactly what they’re looking for on Etsy. So, the next time you come across something you love, snap a photo and search by image on Etsy! Acknowledgements This project started as part of CodeMosaic, Etsy’s annual “hackathon” week, where engineers can practice new skills and drive projects not necessarily related to their day-to-day work. We’re proud that we were able to take a proof-of-concept hackathon project, and turn it into a production feature to help make the millions of unique and special items on Etsy more discoverable for buyers. In particular, we’d like to thank the App Core Experience team for taking a chance and prioritizing this project. We could not have done it without the buy-in of leadership and help from our engineering enablement teams. More on this in our next article!
There are more than 100 million unique listings on Etsy, so we provide buyers recommendations to help them find that one special item that stands out to them. Recommendations are ubiquitous across Etsy, tailored for different stages of a user's shopping mission. We call each recommendation set a module, and there are hundreds of them both on the web and on mobile apps. These help users find trending items, pick up shopping from where they left off, or discover new content and interests based on their prior activity. Modules in an enterprise-scale recommendation system usually work in two phases: candidate set selection and candidate set ranking. In the candidate set selection phase, the objective is to retrieve a small set of relevant items out of the entire inventory, as quickly as possible. The second phase then ranks the items in the candidate set using a more sophisticated machine learning model, typically with an emphasis on the user's current shopping mission, and decides on the best few items to offer as recommendations. We call these models rankers and that's the focus of this post. Figure 1. Two recommendation modules sharing the same item page. The "more from this shop" module recommends similar items from the shop the user is currently looking at; "you may also like" finds relevant items from across Etsy shops. Rankers score candidate items for relevance based on both contextual attributes, such as the user's recent purchases and most clicked categories, as well as item attributes, such as an item's title and taxonomy. (In machine learning we refer to such attributes as features.) Rankers are optimized against a specific user engagement metric–clickthrough rate, for example, or conversion rate–and trained on users' interactions with the recommendation module, which is known as implicit feedback. Etsy has historically powered its recommendation modules on a one-to-one basis: one ranker for each module, trained exclusively on data collected from that module. This approach made it easy to recommend relevant items for different business purposes, but as we got into the hundreds of modules it became burdensome. On the engineering side, the cost of maintaining and iterating on so many rankers, running hundreds of daily pipelines in the process, is prohibitive. And as it becomes harder to iterate, we could lose opportunities to incorporate new features and best practices in our rankers. Without a solution, eventually the quality of our recommendations could degrade and actually do harm to the user experience. To address this potential problem, we pivoted to what we call canonical rankers. As with single-purpose rankers, these are models optimized for a particular user-engagement metric, but the intention is to train them so that they can power multiple modules. We expect these rankers to perform at least on par with module-specific rankers, while at the same time being more efficient computationally, and less costly to train and maintain. A Canonical Frequency Ranker We want Etsy to be not just an occasional stop for our users but a go-to destination, and that means paying attention to what will inspire future shopping missions after a user is finished with their current one. Our first canonical ranker was focused on visit frequency. We wanted to be able to identify latent user interests and surface recommendations that could expose a user to the breadth of inventory on Etsy at moments that might impact a decision to return to the site: for example, showing them complementary items right after their purchase, to suggest a new shopping journey ahead. Data and goal of the model Compared to metrics like conversion rate, revisit frequency is difficult to optimize for: there are no direct and immediate signals within a given visit session to indicate that a buyer is likely to return. There are, however, a multitude of ways for an Etsy user to interact with one of our items: they can click it, favorite it, add it to a collection or to a shopping cart, and of course they can purchase it. Of all of these, data analysis suggests that favoriting is most closely related to a user's intention to come back to the site, so we decided that our ranker would optimize on favorite rate as the best available surrogate for revisit frequency. Favoriting doesn't always follow the same pattern as purchasing, though. And we needed to be wary of the possibility that favorites-based recommendations, not being closely enough related to what a user wanted to buy in their current session, might create a distraction and could actually jeopardize sales. It would be important for us to keep an eye on purchase rate as we developed our frequency ranker. We also had to find appropriate modules to provide training data for the new ranker. There are a lot of them, on both mobile and web, and they appear in a lot of different user contexts. Most mobile app users are signed in, for example, while a large proportion of desktop users aren't. Interaction patterns are different for different modules: some users land on an item page via external search, or from a Google ad, and they tend to look for specific items, whereas habitual mobile app users are more exploratory. It was important for us that the few modules we trained the frequency ranker on should be as representative as possible of the data generated by these many different modules occurring on many different pages and platforms. We wanted to be confident that our ranker would really be canonical, and able to generalize from its training set to power a much wider range of modules. Model structure The requirement that our ranker should optimize for favorites, while at the same time not reducing purchase rate, naturally lent itself to a multi-task learning framework. For a given item we want to predict both the probability of the item being favorited and that of it being purchased. The two scores are then combined to produce the final ranking score. This sort of framework is not directly supported by the tree-based models that have been powering Etsy's recommendations in the past. However, neural models have many advantages over tree-based models and one of them is their ability to handle multi-task architectures. So it was a natural call to build our frequency ranker on a neural model. Favoriting and purchasing are obviously not completely unrelated tasks, so it is reasonable to assume that they share some common factors. This assumption suggests a shared-bottom structure, as illustrated on the right in the figure below: what the tasks have in common is expressed in shared layers at the bottom of the network, and they diverge into separate layers towards the top. A challenge that arises is to balance the two tasks. Both favorites and purchases are positive events, so sample weights can be assigned in both the loss function computation and the final score step. We devoted a lot of manual effort to finding the optimal weights for each task, and this structure became our first milestone. The simple shared-bottom structure proved to be an efficient benchmark model. To improve the model's performance, we moved forward with adding an expert layer following the Multi-gate Mixture Model of Experts (MMOE) framework proposed by Zhao et al. In this framework, favorites and purchases have more flexibility to learn different representations from the embeddings, which leads to more relevant recommendations with little extra computation cost. Figure 2. Model input and structure. The table on the left illustrates how training data is formatted, with samples from multiple modules concatenated and passed in to the ranker. On the right, a workflow that describes the multi-task model structure. A module name is appended on each layer. Building a canonical ranker In addition to using data from multiple modules, we also took training data and model structure into account when developing the canonical ranker. We did offline tests of how a naive multi-task model performed on eight different modules, where the training data was extracted from only a subset of them. Performance varied a lot, and on several modules we could not achieve parity against the existing production ranker. As expected, modules that were not included in the training data had the worst performance. We also observed that adding features or implementing an architectural change often led to opposite results on different modules. We took several steps to address these problems: We added a feature, module_name, representing which module each sample comes from. Given how much patterns can vary across different modules, we believe this is a critical feature, and we manually stack it in each layer of the neural net. It’s possible that the module_name passed to the ranker during inference is not one that it saw during training. (Remember that we only train the model on data from a subset of modules.) We account for this by randomly sampling 10% of the training data and replacing the module_name feature with a dummy name, which we can use in inference to cover that training gap when it occurs. User segments distribution and user behaviors vary a lot across modules, so it’s important to keep a balance of training data across different user segments. We account for this during training data sampling. We assign different weights to different interaction types, (e.g., impressions, clicks, favorites, and purchases), and the weights may vary depending on module. The intuition is that the correlation between interactions may be different across modules. For example, click may show the same pattern as favorite on module X, but a different pattern on module Y. To help ensure that the canonical ranker can perform well on all modules, we carefully tune the weights to achieve a balance. Launching experiments After several iterations, including pruning features to meet latency requirements and standardizing the front-end code that powers recommendation modules, in Q2 of 2022 we launched the first milestone ranker on an item page module and a homepage module. We observed as much as a 12.5% improvement on module-based favorite NDCG and significant improvements on the Etsy-wide favorite rate. And though our main concern was simply to not negatively impact purchases, we were pleased to observe significant improvements on purchase metrics as well as other engagement metrics. We also launched experiments to test our ranker on a few other modules, whose data are not used during training, and have observed that our ranker outperformed the module-specific rankers in production. These observations suggest that the model is in fact a successful canonical ranker. In Q3, we launched our second milestone model, which proved to be better than the first one and improved engagement even further. As of now, the ranker is powering multiple modules on both web and app, and we anticipate that it will be applied in more places. For machine learning at Etsy, the frequency ranker marks a paradigm shift in how we build recommendations. From the buyer's perspective, not only does the ranker provide more personalized recommendations, but employing the same ranker across multiple modules and platforms also helps guarantee a more consistent user experience. Moving forward, we’ll continue iterating on this ranker to improve our target metrics, making the ranker more contextual and testing other novel model architectures. Acknowledgements Thanks Davis Kim and Litao Xu for engineering support of the work and Murium Iqbal for internal review of this post. Special thanks to folks in recs-platform, feature-system, ml-platform and search-ranking.
Machine learning (ML) model deployment is one of the most common topics of discussion in the industry. That’s because deployment represents a meeting of two related but dramatically different domains, ML practice and software development. ML work is experimental: practitioners iterate on model features and parameters, and tune various aspects of their models to achieve a desired performance. The work demands flexibility and a readiness to change. But when they’re deployed, models become software: they become subject to the rigorous engineering constraints that govern the workings of production systems. The process can frequently be slow and awkward, and the interfaces through which we turn models into deployed software are something we devote a lot of attention to, looking for ways to save time and reduce risk. At Etsy, we’ve been developing ML deployment tools since 2017. Barista, the ML Model Serving team’s flagship product, manages lifecycles for all types of models - from Recommendations and Vision to Ads and Search. The Barista interface has evolved dramatically alongside the significant evolution in the scope and range of our ML practice. In this post we’ll walk you through the story of where we started with this interface, where we ended up, and where we intend to keep going. Arc 1: Managing Deployed Model Configs as Code Like many companies deploying ML models to production, Etsy’s ML platform team uses Kubernetes to help scale and orchestrate our services. At its core, Barista itself is a Python+Flask-based application that utilizes the Kubernetes Python API to create all the Kubernetes objects necessary for a scalable ML deployment (Deployment, Service, Ingress, Horizontal Pod Autoscaler, etc.). Barista takes a model deployment configuration specified by users and performs CRUD operations on model deployments in our Kubernetes cluster. The Barista UI in 2020 In the initial version of Barista, the configurations for these deployments were managed as code, via a simple, table-based, read-only UI. Tight coupling of configuration with code is typically frowned upon, but on the scale of our 2017-era ML practice it made a lot of sense. Submitting config changes to the Barista codebase as PRs, which required review and approval, made it easy for us to oversee and audit those changes. It was no real inconvenience for us to require our relatively small corps of ML practitioners to be capable of adding valid configurations to the codebase, especially if it meant we always knew the who, what, and when of any action that affected production. Our old Python file where infrastructure configurations for model deployments were stored Updating configurations in this tightly coupled system required rebuilding and deploying the entire Barista codebase. That process took 20-30 minutes to complete, and it could be further blocked by unrelated code errors or by issues in the build pipeline itself. As ML efforts at Etsy began to ramp up and our team grew, we found ourselves working with an increasing number of model configurations, hundreds of them ultimately, defined in a single large Python file thousands of lines long. With more ML practitioners making more frequent changes, a build process that had been merely time-consuming was becoming a bottleneck. And that meant we were starting to lose the advantage in visibility that had justified the configuration-as-code model for us in the first place. Arc 2: Decoupling Configs with a Database By 2021, we knew we had to make changes. Working with Kubernetes was becoming particularly worrisome. We had no safe way to quickly change Kubernetes settings on models in production. Small changes like raising the minimum replicas or CPU requests of a single deployment required users to push Barista code, seek PR approval, then merge and push that code to production. Even though in an emergency ML platform team members could use tools like kubectl or the Google Cloud Console to directly edit deployment manifests, that didn't make for a scalable or secure practice. And in general, supporting our change process was costing us significant developer hours. So we decoupled. We designed a simple CRUD workflow backed by a CloudSQL database that would allow users to make instantaneous changes through a Barista-provided CLI. Early architecture diagrams from the design doc The new system gave us a huge boost in productivity. ML practitioners no longer had to change configs in code, or develop in our codebases at all. They could perform simple CRUD operations against our API that were DB-backed and reflected by the deployments in our cluster. By appropriately storing both the live configuration of models and an audit log of operations performed against production model settings, we maintained the auditability we had in the PR review process and unblocked our ML practitioners to deploy and manage models faster. We designed the CLI to be simple to use, but it still required a certain degree of developer setup and acumen that was inconvenient for many of our ML practitioners. Even simple CLIs have their quirks, and they can be intimidating to people who don't routinely work on the command line. Increasingly the platform team was being called on to help with understanding and running CLI commands and fixing bash issues. It began to look as if we'd traded one support burden for another, and might see our productivity gains start to erode Arc 3: A UI Built Mostly by Non-UI People We’d always had an API, and now we had a database backing that API. And we had the command line: but what we needed, if we wanted wide adoption of our platform across an increasing user base, was a product. A purpose-built, user-friendly Web interface atop our API would let ML practitioners manage their model deployments directly from the browser, making CLI support unnecessary, and could compound the hours we'd saved moving to the CRUD workflow. So, in the summer of 2021 we started building it. The vaunted Barista web app Now this is certainly not the most aesthetic web app ever built – none of us on the ML platform team are front-end developers. What it is, though, is a fairly robust and complete tool for managing ML deployments on Kubernetes. We've given our users the ability to update anything to do with their models, from application settings to model artifacts to Kubernetes settings like CPU resources or replicas. The UI provides useful integrations to the Google Kubernetes Engine (GKE) console to render information about Pods in a deployment, and even integrates with the API of our cost tool so practitioners can understand the cost of serving their models live in production. In 2017, Barista was an HTML table and Python config file. Now, in 2023, it’s fully functioning web interface that integrates with multiple internal and third-party APIs to render useful information about models, and gives users complete and immediate control over their model deployments. Small changes that might have taken hours can happen in seconds now, unblocking users and reducing the toll of unnecessary workflows. Arc 4: Security and Cost The Barista UI made it so much easier to serve models at Etsy that it helped drive up the rate of experimentation and the number of live models. Suddenly, over the course of a few months, Barista was seeing hundreds of models both in production and on dev servers. And while we were thrilled about the product’s success, it also raised some concerns: specifically, that the simplicity and user-friendliness of the process might lead to spikes in cloud cost and an increase in misconfigurations. Serving a model on Barista accrues cost from Google Cloud, Etsy’s cloud computing service partner. This cost can range anywhere from a few hundred to thousands of dollars per month, depending on the model. While the costs were justified in production, in the development system we were seeing high daily CPU costs with relatively low usage of the environment, which was something that needed a remedy. Unfortunately, by default the Kubernetes Horizontal Pod Autoscaler, which we were using to manage our replicas, doesn't let you scale down below 1. With the increased flow of models through Barista, it became harder for ML practitioners to remember to remove resources when they were no longer needed–and unless a deployment was completely deleted and recreated, we were going to keep incurring costs for it. To mitigate the issue we added the Kube Downscaler to our development cluster. This allowed us to scale deployment replicas to zero both off-hours and on the weekends, saving us about $4k per week. We still had deployments sitting around unused on weekdays, though, so we decided to build Kube Downscaler functionality directly into Barista. This is a safety feature that pauses model deployments: by automatically scaling models in development to zero replicas after three hours, or on user demand. We're now seeing savings of up to 82% in dev cost during periods of inactivity. And we've avoided the runaway cost scenario (where models are not scaled down after test), which could have resulted in annualized costs in excess of $500k. How to use the Barista UI From a certain angle, this has mostly been an article about how it took us three years to write a pretty standard web app. The app isn’t the point, though. The story is really about paying attention to the needs of our ML users and to the scale of their work within Etsy, and above all about resisting premature optimization. Over the years we’ve moved from meeting the basic technical requirements of an ML platform to building a complete product for it. But we never aimed at a product, not till it became necessary, and so we were able to avoid getting ahead of ourselves. The goals of the ML platform team have broadened now that Barista is where it is. We continue to try to enable faster experimentation and easier deployment. The consensus of our ML practitioners is that they have the tools they need, but that our greatest area of opportunity is to improve in the cohesiveness of our suite of services - most importantly, automation. And that’s where we’re investing now: in tooling to provide optimal infrastructure settings for model deployments, for instance, so we can reduce tuning time in serving them and further minimize our cloud costs. Our ML practice continues to grow, both in number of models and team size, and our platform is going to have to continue to grow to keep up with them.
The question of documentation—not just formats and standards, but the tools and processes that can make documenting code a normal and repeatable part of development workflows—has been a live one in the software industry for a long time. The docs-as-code approach has emerged as a way of integrating documentation into development on the principle that the same sets of tools and procedures should be used for both. This might mean, for instance, versioning documentation the same way code is versioned, formatting it using plain-text markup like Markdown or reStructuredText, and employing automation tools to build and deploy documentation. At Etsy, we've developed an internal tool called Docsbuilder to implement our own docs-as-code approach. It's our primary tool for maintaining and discovering software documentation at Etsy, and in this post we'll give you an overview of it. What is Docs-as-code? Docs-as-code aims to manage software documentation with the same level of rigor and attention typically brought to managing software source code. The goal of the methodology is to make the work of authoring, versioning, reviewing and publishing documentation smoother, more efficient, more repeatable and reliable. The main principles of docs-as-code are: Documentation is a first-class citizen in the software development process As such, it should be versioned and stored in source control repositories such as Git Documentation should be written in plain-text formats like Markdown, to support versioning and collaboration Documentation should be reviewed and tested before being published The documentation workflow should be automated where possible to improve the reliability and user experience of publishing sites Fundamentally, docs-as-code encourages an approach that balances documentation writing with coding. Developers who tag and release documentation the same way they do source code, with the same tools and procedures, will tend to be more organized and structured about handling their documentation. And the documentation itself becomes easier to audit for quality, and to update and maintain. Docs-as-code at Etsy Etsy's Docsbuilder tool is a collection of bin scripts, pipelines and processes that let developers create and maintain software documentation. Docsbuilder is Markdown-based; we use Docusaurus, a Facebook-created tool for generating project documentation websites, to convert Markdown files to HTML. Each project team at Etsy owns its own Docsbuilder repository, a mix of Markdown files, configurations and NPM dependencies, and we use a secure GitOps-based workflow to review and merge changes to these repositories. To simplify the process of creating new sites, we developed a "docsbuilder-create" bin script. The script checks to make sure a supplied site name is valid and unique, then it bootstraps a Docusaurus site and installs all necessary Etsy-specific plugins. Once the site is created, users can push documentation to their Git branch and open a pull request. To validate and test the PR, we use Google Cloud Build to clone a preconfigured Node.js container, which builds the entire documentation website and runs some integration tests to make sure the site is working properly. When the tests are completed and the PR is approved, we merge the code into the main branch of the repository and another Cloud Build job then builds and deploys the site to production. To make it easier for users within Etsy to find documentation, we developed a docs search engine that scans and indexes all Docsbuilder sites and all available pages. To simplify navigation, a React component displays a list of all important and frequently used sites on the homepage of our documentation hub. In total, we currently host about 6.2k pages on 150 Docsbuilder project sites. What’s Next Discoverability and ease of use are important factors in making documentation more available, and that’s an area where we’re aiming to improve. Better links between doc sites and related topics will surface documentation better, as will improvements to the docs search engine. With the same idea in mind, we want to focus on creating more organized and navigable documentation pages. And there’s no reason not to make the content itself more engaging: we’re looking into adding more support for screenshots and other imagery, and integrating with React plugins to draw diagrams (e.g. flowcharts) and other visuals that can illustrate key concepts and ideas.
Between Dec 2020 and May 2022, the Etsy Payments Platform, Database Reliability Engineering and Data Access Platform teams moved 23 tables, totaling over 40 billion rows, from four unsharded payments databases into a single sharded environment managed by Vitess. While Vitess was already managing our database infrastructure, this was our first usage of vindexes for sharding our data. This is part 3 in our series on Sharding Payments with Vitess. In the first post, we focused on challenges in the application and data model. In part two, we discussed the challenges of cutting over a crucial high traffic system, and in this final post we will discuss different classes of errors that might crop up when cutting over traffic from an unsharded keyspace to a sharded keyspace. As an engineer on Etsy’s Data Access Platform team, my role bridges the gap between product engineering and infrastructure. My team builds and maintains Etsy’s in-house ORM, and we are experts in both software engineering and the database software that our code relies on. Myself and members of my team have upstreamed over 30 patches to Vitess main. We also maintain an internal fork of Vitess, containing a handful of patches necessary to adapt Vitess to the specifics of our infrastructure. In our sharding project, my role was to ensure that the queries our ORM generates would be compatible with vindexes and to ensure that we configured Vitess correctly to preserve all of the data guarantees that our application expects. Cutover risks Vitess does an excellent job of abstracting the concept of sharding. At a high level, Vitess allows your application to interact with a sharded keyspace much the same as it would with an unsharded one. However, in our experience sharding with Vitess, we found a few classes of errors in which things could break in new ways after cutting over to the sharded keyspace. Below, we will discuss these classes of errors and how to reduce the risk of encountering them. Transaction mode errors When sharding data with Vitess, your choice of transaction_mode is important, to the extent that you care about atomicity. The transaction_mode setting is a VTGate flag. The default value is multi, which allows for transactions to span multiple shards. The documentation notes that “partial commits are possible” when using multi. In particular, when a transaction fails in multi mode, writes may still be persisted on one or more shards. Thus, the usual atomicity guarantees of database transactions change significantly: typically when transactions fail, we expect no writes to be persisted. Using Vitess’s twopc (two-phase commit) transaction mode solves this atomicity problem, but as Vitess documentation notes, “2PC is an experimental feature and is likely not robust enough to be considered production-ready.” “Transaction commit is much slower when using 2PC," say the docs. "The authors of Vitess recommend that you design your VSchema so that cross-shard updates (and 2PC) are not required.” As such, we did not seriously consider using it. Given the shortcomings of the multi and twopc transaction modes, we opted for single transaction mode instead. With single mode, all of the usual transactional guarantees are maintained, but if you attempt to query more than one shard within a database transaction, the transaction will fail with an error. We decided that the semantics of single as compared with multi would be easier to communicate to product developers who are not database experts, less surprising, and provide more useful guarantees. Using the single transaction mode is not all fun and games, however. There are times when it can seem to get in the way. For instance, a single UPDATE statement which touches records on multiple shards will fail in single mode, even if the UPDATE statement is run with autocommit. While this is understandable, sometimes it is useful to have an escape hatch. Vitess provides one, with an API for changing the transaction mode per connection at runtime via SET transaction_mode=? statements. Your choice of transaction mode is not very important when using Vitess with an unsharded keyspace – since there is only a single database, it would be impossible for a transaction to span multiple databases. In other words, all queries will satisfy the single transaction mode in an unsharded keyspace. But when using Vitess with a sharded keyspace, the choice of transaction mode becomes relevant, and your transactions could start failing if they issue queries to multiple shards. To minimize our chances of getting a flood of transaction mode errors at our cutover, we exhaustively audited all of the callsites in our codebase that used database transactions. We logged the SQL statements that were executed and pored over them manually to determine which shards they would be routed to. Luckily, our codebase uses transactions relatively infrequently, since we can get by with autocommit in most cases. Still it was a painstaking process. In the end, our hard work paid off: we did not observe any transaction mode-related errors at our production cutover. Reverse VReplication breaking Pre-cutover, VReplication copies data from the original unsharded keyspace to the new sharded keyspace, to keep the data in sync. Post-cutover, VReplication switches directions: it copies data back from the new sharded keyspace to the original unsharded keyspace. We’ll refer to this as reverse VReplication. This ensures that if the cutover needs to be reversed, the original keyspace is kept in sync with any writes that were sent to the sharded keyspace. If reverse VReplication breaks, a reversal of the cutover may not be possible. Reverse VReplication broke several times in our development environment due to enforcement of MySQL unique keys. In an unsharded keyspace, a unique key in your MySQL schema enforces global uniqueness. In a sharded keyspace, unique keys can only enforce per-shard uniqueness. It is perfectly possible for two shards to share the same unique key for a given row. When reverse VReplication attempts to write such rows back to the unsharded database, one of those writes will fail, and reverse VReplication will grind to a halt. This form of broken VReplication can be fixed in one of two ways: Delete the row corresponding to the write that succeeded in the unsharded keyspace. This will allow the subsequent row to reverse vreplicate without violating the unique-key constraint. Skip the problematic row by manually updating the Pos column in Vitess’s internal _vt.vreplication table. It was important to us that reverse VReplication be rock solid for our production cutover. We didn’t want to be in a situation where we would be unable to reverse the cutover in the event of an unexpected problem. Thus, before our production cutover, we created alerts that would page us if reverse VReplication broke. Furthermore, we had a runbook that we would use to fix any issues with reverse VReplication. In the end, reverse VReplication never broke in production. The reason it broke in our development environment, it turned out, was due to a workflow specific to that environment. As an aside, we later discovered that Vitess does in fact provide a mechanism for enforcing global uniqueness on a column in a sharded keyspace. Scatter queries In a sharded keyspace, if you forget to include the sharding key (or another vindexed column) in your query’s WHERE clause, Vitess will default to sending the query to all shards. This is known as a scatter query. Unfortunately, it can be easy to overlook adding the sharding key to one or more of your queries before the cutover, especially if you have a large codebase with many types of queries. The situation might only become obvious to you after cutover to the sharded cluster. If you start seeing a much higher than expected volume of queries post-cutover, scatter queries are likely the cause. See part 2 in this series of posts for an example of how we were impacted by scatter queries in one cutover attempt. Having scatter queries throw a wrench in one of our earlier cutover attempts got us thinking about how we could identify them more easily. Vitess provides several tools that can show its execution plan for a given query, including whether or not it will scatter: vtexplain, EXPLAIN FORMAT=vitess …, and EXPLAIN FORMAT=vtexplain. At the time of our earlier cutover, we had not been in the habit of regularly analyzing our queries with these tools. We used them before the next cutover attempt, though, and made sure all the accidental scatter queries got fixed. Useful as Vitess query analysis is, there is always some chance an accidental scatter query will slip through and surprise you during a production cutover. Scatter queries have a multiplicative impact: they are executed on every shard in your keyspace, and at a high enough volume can push the aggregate query volume in your sharded keyspace to be several times what you would see in an unsharded one. If query volume is sufficient to overwhelm all shards after the cutover, accidental scatter queries can result in a total keyspace outage. It seemed to us it would limit the problem if Vitess had a feature to prevent scatter queries by default. Instead of potentially taking down the entire sharded keyspace in a surge of query traffic, with a no-scatter default only accidental scatter queries would fail. At our request, PlanetScale later implemented a feature to prevent scatter queries. The feature is enabled by starting VTGate with the --no_scatter flag. Scatter queries are still allowed if a comment directive is included in the query: /*vt+ ALLOW_SCATTER */. While this feature was not yet available during our earlier cutover attempts, we have since incorporated it into our Vitess configuration. Incompatible queries Some types of SQL queries that work on unsharded keyspaces are incompatible with sharded keyspaces. Those queries can start failing after cutover. We ran into a handful of them when we tested queries in our development environment. One such class of queries were scatter queries with a GROUP BY on a VARCHAR column. As an example, consider a table in a sharded keyspace with the following schema: CREATE TABLE `users` ( `user_id` int(10) unsigned NOT NULL, `username` varchar(32) NOT NULL, `status` varchar(32) NOT NULL, PRIMARY KEY (`user_id`) ) Assume the users table has a primary vindex on the column user_id and no additional vindexes. On such a table in a sharded keyspace, the following query will be a scatter query: SELECT status, COUNT(*) FROM users GROUP BY status; This query failed with an error on our version of Vitess: ERROR 2013 (HY000): Lost connection to MySQL server during query. After diving into Vitess’s code to determine the cause of the error, we came up with a workaround: we could CAST the VARCHAR column to a BINARY string: > SELECT CAST(status AS BINARY) AS status, COUNT(*) FROM users GROUP BY status; +----------+----------+ | status | count(*) | +----------+----------+ | active | 10 | | inactive | 2 | +----------+----------+ We occasionally ran into edge cases like this where Vitess’s query planner did not support specific query constructs. But Vitess is constantly improving its SQL compatibility – in fact, this specific bug was fixed in later versions of Vitess. In our development environment, we exhaustively tested all queries generated by our application for compatibility with the new sharded keyspace. Thanks to this careful testing, we managed to identify and fix all incompatible queries before the production cutover. Conclusion There are several classes of errors that might only start appearing after the cutover from an unsharded keyspace to a sharded keyspace. This makes cutover a risky process. Although cutovers can generally be easily reversed, so long as reverse VReplication doesn’t break, the impact of even a short disruption can be large, depending on the importance of the data being migrated. Through careful testing before cutover, you can reduce your exposure to these classes of errors and guarantee yourself a much more peaceful cutover. This is part 3 in our series on Sharding Payments with Vitess. Check out part 1 to read about the challenges of migrating our data models and part 2 to read about the challenges of cutting over a crucial high-traffic system.
Between Dec 2020 and May 2022, the Etsy Payments Platform, Database Reliability Engineering and Data Access Platform teams moved 23 tables, totaling over 40 billion rows, from four unsharded payments databases into a single sharded environment managed by Vitess. While Vitess was already managing our database infrastructure, this was our first usage of vindexes for sharding our data. This is part 2 of our series on Sharding Payments with Vitess. In the first post, we focused on making application data models available for sharding. In this part, we’ll discuss what it takes to cut over a crucial high traffic system, and part 3 will go into detail about reducing the risks that emerge during the cutover. Migrating Data Once we had chosen a sharding method, our payments data had to be redistributed to forty new shards, the data verified as complete, reads and writes switched over from existing production databases, and any additional secret sauce applied. Considering this was only the payments path for all of Etsy.com, the pressure was on to get it right. The only way to make things work would be to test, and test again. And while some of that could be done on an ad-hoc basis by individual engineers, to reorganize a system as critical as payments there is no substitute for a production-like environment. So we created a staging infrastructure where we could operationally test Vitess's internal tooling against snapshots of our production MySQL data, make whatever mistakes we needed to make, and then rebuild the environment, as often as necessary. Staging was as real as we could make it: we had a clone of the production dataset, replicas and all, and were equipped with a full destination set of forty Vitess shards. Redistributing the data in this playground, seeing how the process behaved, throwing it all away and doing it again (and again), we built confidence that we could safely wrangle the running production infrastructure. We discovered, and documented, barriers and unknowns that had to be adjusted for (trying to shard on a nullable column, for instance). We learned to use VDiff to confirm data consistency and verify performance of the tooling, we learned to run test queries through Vitess proper, checking behavior and estimating workload, and we discovered various secondary indexing methods and processes (using CreateLookupVindex and ExternalizeVindex) to overcome some of these barriers. One of the key things we found out about was making the read/write switch. Vitess's VReplication feature is the star performer here. Via the MoveTables workflow, VReplication sets up streams that replicate all writes in whichever direction you're currently operating. If you're writing to the source side, VReplication has streams to replicate those writes into the sharded destination hosts, properly distributing them to the correct shard. When you switch writes with Vitess, those VReplication streams are also reversed, now writing from the destination shards to the source database(s). Knowing we had this as an option was a significant confidence booster. If the first switch wasn't perfect, we could switch back and try again: both sides would stay consistent and in sync. Nothing would get lost in between. Top: The MoveTables workflow has VReplication setup to move existing data to the new destination shards and to propagate new writes. Bottom: After SwitchWrites the Vreplication stream reverses to the source database, keeping the origin data in sync with new writes to the sharded setup. Cutting Over Production VReplication paved most of the road for us to cut over production, but we did have our share of “whoa, I didn’t expect that” moments. As powerful as the Vitess workflow system is, it does have some limitations when attempting to move a substantial dataset, and that requires finding additional solutions for the same problem. We were traversing new ground here for Etsy, so even as we pushed onward with the migration we were still learning how to operate within Vitess. In one phase of the project we switched traffic and within a couple of minutes saw a massive 40x jump in query volume (from ~5k/second to ~200k/second) on the same workload. What is going on here? After some quick investigation we found that a query gathering data formerly colocated on a monolithic database now required that data from some portion of the new shard set. The problem was, Vitess didn’t know exactly which of our forty shards that data existed on, since it was being requested via a column that was not part of the sharding column. An explosion: This massive increase in query volume surprised us initially. A dive into what was generating this volume quickly revealed scatter queries were at the root of it. Enter the story's second superhero: CreateLookupVindex. While we switched traffic back and headed to our staging environment to come up with a solution, we quickly realized the 40x increase was due to Vitess using a scatter gather process, sending queries to all shards in an effort to see which of them had pieces of the data. Besides those scatter queries being inefficient, they were also bombarding the databases with requests that returned empty result sets. To combat this, we created Secondary Vindexes to tell Vitess where the data was located using a different column than the Primary Vindex. These additional Vindexes allowed the VTGates to use a lookup table to identify the shards housing the relevant data and make sure they were the only ones receiving queries. A solution: Application of Lookup Vindexes quelled the rampant scatter queries immediately, returning query volume to manageable levels. Knowing there were additional datasets that would behave similarly, we were able to find those callsites and query patterns early, and apply those Secondary Vindexes while the MoveTables workflow was in progress. This prevented the same overwhelming traffic pattern from recurring and kept the final transitions from being disruptively resource-intensive. (For more information on the scatter traffic pattern, part 3 of this series has additional details.) As we grew more comfortable with our command of the migration process, we decided that since Vitess supported multiple workflows per keyspace, we could get both the MoveTables and the Secondary Vindexes deployed simultaneously. However, there's a caveat: due to the way Secondary Vindexes are maintained by VReplication streams, the Vindexes cannot be externalized (made available to the system) until after the switching of writes is complete. These indexes work by inserting rows into a lookup table as writes are made to the owner table, keeping the lookup table current. While the MoveTables is in motion, the Secondary Vindex is doing its job inserting while the data is being sharded via the vReplication streams. And there's the rub: if you externalize the Vindex before writes are switched, there aren’t any writes being made to the destination shards, and you are going to miss all of those lookup records coming from the vReplication stream. Taking all this into account, we performed the switch writes, externalized the two Secondary Vindexes we were creating, and found the machines we were running on couldn’t handle the query load effectively. No problem: we'd switch the writes back to the source database. Oops! We just broke our fancy new Vindexes because CreateLookupVindex was no longer running! While it wasn’t the end of the world, it did mean we had to remove those Secondary Vindexes, remove all of their artifacts from the sharded dataset (drop the lookup tables), then rebuild them a second time. During the small window this work created we raised the specs for the destination cluster, then did the dance again. Switch writes, externalize the Vindexes, and roll forward. This time, happily, roll forward is exactly what things did. Conclusion Thanks to extensive testing on a full production copy in our staging environment, Vitess tooling being as powerful and performing as well as we expected, and meticulous precursor work on the data models, we encountered no disruption, no downtime, and no impact to normal operation throughout the course of these migrations. While there was tolerance for the idea of some disruption, the most important thing to our team was that our processes should run transparently. Completing this monumental project on a highly sensitive path without anyone noticing (inside Etsy or out) was very satisfying. This is part 2 in our series on Sharding Payments with Vitess. Check out part 1 to read about the challenges of migrating our data models, and part 3 to read about reducing cutover risks.
At the end of 2020, Etsy’s Payments databases were in urgent need of scaling. Specifically, two of our databases were no longer vertically scalable — they were using the highest resource tier offered on the Google Cloud platform. These databases were crucial to our day-to-day payment processing, so it was a high-risk situation: spikes in traffic could have led to performance issues or even loss of transactions. Our sellers depend on our payments system to get paid for their hard work, making this reliability effort even more critical. A stable and scalable payments platform provides the best possible experience for our sellers. Between Dec 2020 and May 2022, the Etsy Payments Platform, Database Reliability Engineering and Data Access Platform teams moved 23 tables, totaling over 40 billion rows, from four unsharded payments databases into a single sharded environment managed by Vitess. While Vitess was already managing our database infrastructure, this was our first usage of vindexes for sharding our data. We did this work in two phases. First we migrated our seller ledger infrastructure, a contained domain that determines seller bills and payouts. For the second phase, we worked to reduce load on our primary payments database, which houses transaction data, payment data and much more. This database has been around for over a decade and hosted nearly 90 tables before the migration. To cut down the scope of the project we strategically chose to migrate just a subset of tables with the highest query volume, adding in others related to those high-volume tables as needed. In the end, even operating on just that subset, we were able to reduce load by 60%, giving us room to scale for many years down the road. Throughout this project, we encountered challenges across the stack. This is the first in a series of posts in which we’ll share how we approached those challenges, both in application and infrastructure. Here we’ll be focusing on making application data models available for sharding. In part 2, we’ll discuss what it takes to cut over a crucial high-traffic system, and part 3 will go into detail about reducing the risks that emerge during the cutover. An Ideal Sharding Model: Migrating the Ledger A data model’s shape impacts how easy it is to shard, and its resiliency when sharded. Data models that are ideal for sharding are shallow, with a single root entity that all other entities reference via foreign key. For example, here is a generic data model for a system with Users, and some data directly related to Users: This type of data model allows all tables in the domain to share a shardifier (in this example, user_id), meaning that related records are colocated. Even with Vitess handling them, cross-shard operations can be inefficient and difficult to reason about; colocated data makes it possible for operations to take place on a single shard. Colocation can also reduce how many shards a given request depends on, mitigating user impacts from a shard going down or having performance issues. Etsy's payments ledger, our first sharding migration, was close to this ideal shape. Each Etsy shop has a ledger of payments activity, and all entities in this domain could be related to a single root Ledger entity. The business rule that maintains one ledger to a shop means that ledger_id and shop_id would both be good candidates for a shardifier. Both would keep all shop-related data on a single shard, isolating shard outages to a minimal number of shops. We went with shop_id because it's already in use as a shardifier elsewhere at Etsy, and we wanted to stay consistent. It also future-proofs us in case we ever want to allow a shop to have multiple ledgers. This "ideal" model may have been conceptually simple, but migrating it was no small task. We needed to add a shop_id column to tables and to modify constraints such as primary keys, unique keys, and other indexes, all while the database was resource-constrained. Then we had to backfill values to billions of existing rows, while–again–the database was resource-constrained. (We came up with a real-time tunable script that could backfill at up to 60x faster using INSERT … ON DUPLICATE KEY UPDATE statements.) And when we had our new shardifier in place there were hundreds of existing queries to update with it, so Vitess would know how to route to the appropriate shard, and hundreds of tests whose test fixture data had to be updated. Non-ideal Models: Sharding Payments For our second phase of work, which would reduce load on our primary payments database, the data model was less straightforward. The tables we were migrating have grown and evolved over a decade-plus of changing requirements, new features, changes in technology and tight deadlines. As such, the data model was not a simple hierarchy that could lend itself to a standard Etsy sharding scheme. Here’s a simplified diagram of the payments model: Each purchase can relate to multiple shops or other entities. Payments and PaymentAdjustments are related to multiple types of transactions, CreditCardTransactions and PayPalTransations. Payments are also related to many ShopPayments with different shop_ids, so sharding on that familiar Etsy key would spread data related to a single payment across many shards. PaymentAdjustments meanwhile are related to Payment by payment_id, and to the two transaction types by reference_id (which maps to payment_adjustment_id). This is a much more complex case than the Ledger model, and to handle it we considered two approaches, described below. As with any technical decision, there were tradeoffs to be negotiated. To evaluate our options, we spiked out changes, diagrammed existing and possible data models, and dug into production to understand how existing queries were using the data. Option 1: Remodel the core The first approach we considered was to modify the structure of our data model with sharding in mind. The simplified diagram below illustrates the approach: We’ve renamed our existing “Payment” models “Purchase,” to better reflect what they represent in the system. We’ve created a new entity called Payment that groups together all entities related to any kind of payment that happens on Etsy. With this model we move closer to an “ideal” sharding model, where all records related to a single payment live on the same shard. We can shard everything by payment_id and enable colocation of all these related entities, with the attendant benefits of resilience and predictability that we've already noted. Introducing a consequential change to an important data model is costly. It would require sweeping changes to core application models and business logic and engineers would have to learn the new model. Etsy Payments is a large and complex codebase, and integrating it with this a new built-to-shard model would lead to a scope of work well beyond our goal of scaling the database. Option 2: Find shardifiers where they live The second approach was to shard smaller hierarchies using primary and foreign keys already in our data model, as illustrated below: Here we’ve sharded sub-hierarchies in the data together, using reference_id as a polymorphic shardifier in the transaction model so we can collocate transactions with their related Payment or PaymentAdjustment entities. (The downside of this approach is that PaymentAdjustments are also related to Payments, and those two models will not be colocated.) Considering how urgent it was that we move to a scalable infrastructure, and the importance of keeping Etsy Payments reliable in the meantime, this more modest approach is the one we opted for. As discussed above, most of the effort in the Ledger phase of the project went towards adding columns, modifying primary keys to existing data tables, backfilling data, and modifying queries to add a new shardifier. In contrast, taking established primary and foreign keys as shardifiers whenever possible would cut out nearly all of that effort from the Payments work, giving us a much shorter timeline to completion while still achieving horizontal scalability. Without having to manage the transition to a new data model, we could focus on scaling with Vitess. As it happens, Vitess has re-sharding features that give us flexibility to change our shard design in future, if we see the need; sharding on the basis of the legacy payments model was not a once-and-forever decision. Vitess can even overcome some of the limitations of our "non-ideal" data model using features such as secondary indexes: lookup tables that Vitess maintains to allow shard targeting even when the shardifier is not in a query. This was part 1 of our series on Sharding Payments with Vitess. We'll have a lot more to say about our real-world experience of working with Vitess in part 2 and part 3 of this series. Please join us for those discussions.
The first time I performed a live upgrade of Etsy's Kafka brokers, it felt exciting. There was a bit of an illicit thrill in taking down a whole production system and watching how all the indicators would react. The second time I did the upgrade, I was mostly just bored. There’s only so much excitement to be had staring at graphs and looking at patterns you've already become familiar with. Platform upgrades made for a tedious workday, not to mention a full one. I work on what Etsy calls an “Enablement” team -- we spend a lot of time thinking about how to make working with streaming data a delightful experience for other engineers at Etsy. So it was somewhat ironic that we would spend an entire day staring at graphs during this change, a developer experience we would not want for the end users of our team’s products. We hosted our Kafka brokers in the cloud on Google's managed Kubernetes. The brokers were deployed in the cluster as a StatefulSet, and we applied changes to them using the RollingUpdate strategy. The whole upgrade sequence for a broker went like this: Kubernetes sends a SIGTERM signal to a broker-pod-n The Kafka process shuts down cleanly, and Kubernetes deletes the broker-pod Kubernetes starts a new broker-pod with any changes applied The Kafka process starts on the new broker-pod, which begins recovering by (a) rebuilding its indexes, and (b) catching up on any replication lag Once recovered, a configured readiness probe marks the new broker as Ready, signaling Kubernetes to take down broker-pod-(n-1). Kafka is an important part of Etsy's data ecosystem, and broker upgrades have to happen with no downtime. Topics in our Kafka cluster are configured to have three replicas each to provide us with redundancy. At least two of them have to be available for Kafka to consider the topic “online.” We could see to it that these replicas were distributed evenly across the cluster in terms of data volume, but there were no other guarantees around where replicas for a topic would live. If we took down two brokers at once, we'd have no way to be sure we weren't violating Kafka's two-replica-minimum rule. To ensure availability for the cluster, we had to roll out changes one broker at a time, waiting for recovery between each restart. Each broker took about nine minutes to recover. With 48 Kafka brokers in our cluster, that meant seven hours of mostly waiting. Entering the Multizone In the fall of 2021, Andrey Polyakov led an exciting project that changed the design of our Kafka cluster to make it resilient to a full zonal outage. Where Kafka had formerly been deployed completely in Google’s us-central1-a region, we now distributed the brokers across three zones, ensuring that an outage in any one region would not make the entire cluster unavailable. In order to ensure zonal resilience, we had to move to a predictable distribution of replicas across the cluster. If a replica for a topic was on a broker located in zone A, we could be sure the second replica was on a broker in zone C with the third replica in zone F. Figure 1: Distribution of replicas for a topic are now spread across three GCP zones. Taking down only brokers in Zone A is guaranteed to leave topic partitions in Zone C and Zone F available. This new multizone Kafka architecture wiped out the one-broker-at-a-time limitation on our upgrade process. We could now take down every single broker in the same zone simultaneously without affecting availability, meaning we could upgrade as many as twelve brokers at once. We just needed to find a way to restart the correct brokers. Kubernetes natively provides a means to take down multiple pods in a StatefulSet at once, using the partitioned rolling updates rollout strategy. However, that requires pods to come down in sequential order by pod number, but our multizonal architecture placed every third broker in the same zone. We could have reassigned partitions across brokers, but that would have been a lengthy and manual process, so we opted instead to write some custom logic to control the updates. Figure 2. Limitations of Kubernetes sequential ordering in a multizone architecture: we can concurrently update pods 11 and 10, for example, but not pods 11 and 8. We changed the RolloutPolicy on the StatefulSet Kubernetes object for Kafka to OnDelete. This means that once an update is applied to a StatefulSet, Kubernetes will not automatically start rolling out changes to the pods, but will expect users to explicitly delete the pods to roll out their changes. The main loop of our program essentially finds some pods in a zone that haven’t been updated and updates them. It waits until the cluster has recovered, and then moves on to the next batch. Figure 3. Polling for pods that still need updates. We didn’t want to have to run something like this from our local machines (what if someone has a flakey internet connection?), so we decided to dockerize the process and run it as a Kubernetes batch job. We set up a small make target to deploy the upgrade script, with logic that would prevent engineers from accidently deploying two versions of it at once. Performance Improvements Figure 4. Visualization of broker upgrades We tested our logic out in production, and with a parallelism of three we were able to finish upgrades in a little over two hours. In theory we could go further and restart all the Kafka brokers in a zone en masse, but we have shied away from that. Part of broker recovery involves catching up with replication lag: i.e., reading data from the brokers that have been continuing to serve traffic. A restart of an entire zone would mean increased load on all the remaining brokers in the cluster as they saw their number of client connections jump -- we would find ourselves essentially simulating a full zonal outage every time we updated. It’s pretty easy just to look at the reduced duration of our upgrades and call this project a success -- we went from spending seven hours rolling out changes to about two. And in terms of our total investment of time—time coding the new process vs. time saved on upgrades—I suspect by now, eight months in, we’ve probably about broken even. But my personal favorite way to measure the success of this project is centered around toil -- and every upgrade I’ve performed these last eight months has been quick, peaceful, and over by lunchtime.
In 2018, with its decision to choose Google Cloud Platform as its provider, Etsy began a long project of migration to the cloud. This wasn't just a question of moving out of our data center. Things like on-demand capacity scaling and multi-zone/region resilience don't just magically happen once you’re "in the cloud." To get the full benefits, we undertook a major redesign to host our Kafka brokers (and clients) on Google’s managed Kubernetes (GKE). Fast forward a few years: we now have many more Kafka clients with streaming applications that power critical user-facing features, like search indexing. Our initial architecture, which saved on cost by operating (mostly) in a single availability zone, was beginning to look shaky. The consequences of a Kafka outage had grown, for example resulting in stale search results. This would have a negative impact on our buyers and sellers alike, and would mean a direct revenue loss to Etsy. We decided it was time to reevaluate. With some research and thinking, and a good bit of experimentation, we put together a plan to make our Kafka cluster resilient to zonal failures. For such an important production service, the migration had to be accomplished with zero downtime in production. This post will discuss how we accomplished that feat, and where we're looking to optimize costs following a successful rollout. Single-Zone Kafka Inter-zone network costs in Google Cloud can add up surprisingly quickly, even exceeding the costs of VMs and storage. Mindful of this, our original design operated the Kafka cluster in one zone only, as illustrated below: Figure 1. An illustration of our original single-zone architecture. The Kafka cluster operates entirely within zone “a”. Only a few critically important components, such as Zookeeper in this example, run in multiple zones. Those with keen eyes might have noticed the Persistent Disks drawn across a zone boundary. This is to indicate that we’re using Google’s Regional PDs, replicated to zones “a” and “b”. (Regular PDs, by contrast, are zonal resources, and can only be accessed from a GCE instance in the same zone.) Even though we weren’t willing to pay the network costs for operating Kafka in multiple zones at the time, we wanted at least some capability to handle Google Cloud zonal outages. The worst-case scenario in the design above is a “zone-a” outage taking down the entire Kafka cluster. Until the cluster came back up, consumers would be dealing with delayed data, and afterward they'd have to spend time processing the backlog. More concerning is that with the cluster unavailable producers would have nowhere to send their data. Our primary producer, Franz, has an in-memory buffer and unlimited retries as a hedge against the problem, but producer memory buffers aren't unlimited. In the event of a sustained zone outage our runbook documented a plan to have a team member manually relocate the Kafka brokers to zone-b, where disks and historical data have been safely stored. A quick response would be critical to prevent data loss, which might be a lot to ask of a possibly sleepy human. Multizone Kafka As Kafka's importance to the business grew–and even though the single-zone architecture hadn’t suffered an outage yet–so did our discomfort with the limitations of manual zone evacuation. So we worked out a design that would give the Kafka cluster zonal resilience: Figure 2. A multizone design for Kafka on GKE. Notice that persistent disks are no longer drawn across zone boundaries. The most crucial change is that our Kafka brokers are now running in three different zones. The GKE cluster was already regional, so we applied Kubernetes Pod Topology Spread Constraints with topologyKey: topology.kubernetes.io/zone to ensure an even distribution across zones. Less apparent is the fact that topic partition replicas also need to be evenly distributed. We achieve this by setting Kafka’s broker.rack configuration based on the zone where a broker is running. This way, in the event of an outage of any one zone, two out of three partition replicas are still available. With this physical data layout, we no longer need persistent disks to be regional, since Kafka is providing inter-zone replication for us. Zero-Downtime Migration Satisfied with the design, we still had the challenge of applying the changes to our production Kafka cluster without data loss, downtime, or negative impact to client applications. The first task was to move broker Pods to their correct zones. Simply applying the Topology Constraints alone resulted in Pods stuck in “Pending” state when they were recreated with the new configs. The problem was that the disks and PVCs for the pods are zonal resources that can only be accessed locally (and even regional disks are limited to just two zones). So we had to move the disks first, then the Pods could follow. One way to accomplish the move is by actually deleting and recreating the disks. Since we have a replication factor of three on all our topics, this is safe if we do it one disk at a time, and Kafka will re-replicate the missing data onto the blank disk. Testing showed, however, that completing the procedure for all brokers would take an unacceptably long time, on the order of weeks. Instead, we took advantage of Google’s disk snapshotting feature. Automated with some scripting, the main loop performs roughly the following steps: Create a “base” snapshot of the Kafka broker disk, while the broker is still up and running Halt the broker: kubectl delete statefulset --cascade=orphan kubectl delete pod Create the “final” snapshot of the same Kafka broker disk, referencing the base snapshot Create a brand new disk from the final snapshot, in the correct zone Delete the original disk Recreate the StatefulSet, which recreates the Pod and starts up the broker Wait until the cluster health returns to normal (under replicated partitions is zero) Repeat for each broker. The two-step base/final snapshot process is just an optimisation. The second snapshot is much faster than the first, which minimizes broker downtime, and also reduces the time for partition replication to catch up afterwards. Now that we’ve got brokers running in the correct zone location, what about partitions? Kafka doesn’t provide automatic partition relocation, and the broker.rack configuration only applies to newly created partitions. So this was a do-it-yourself situation, which at high level involved: Generating a list of topic-partitions needing relocation, based on the requirement that replicas need to be distributed evenly across all zones. After some scripting, this list contained 90% of the cluster’s partitions. Generating a new partition assignment plan in JSON form. Kafka provides some CLI tools for the job, but we used the SiftScience/kafka-assigner tool instead (with a few of the open PRs applied). This allowed us to minimize the amount of data movement, saving time and reducing load on the cluster. Applying partition assignments using the official “kafka-reassign-partitions” CLI tool. To prevent overwhelming the cluster, we throttled the data migration, going in small batches rather than all at once (we had something like 10k partitions to move), and grouping migrations by class of topic, large, small, or empty (with the empty ones probably available for deletion). It was delicate work that took days of manual effort and babysitting to complete, but the result was a completely successful, zero-downtime partition migration. Post Migration In 2021, a company-wide initiative to test and understand zonal resilience in a large number of Etsy's systems, led by Jeremy Tinley, gave us a perfect opportunity to put our multizone Kafka design through its paces. We performed our testing in production, like many other teams (staging environments not being 100% representative), and brought down an entire zone, a third of the Kafka cluster. As leader election and partition replicas became unavailable, client requests automatically switched to still-available brokers, and any impact turned out to be minimal and temporary. Some napkin math at the time of redesign led us to believe that we would see only minimal cost increases from our multizone setup. In particular, eliminating regional disks (the most expensive Google Cloud SKU in our single-zone design) in favor of Kafka replication would halve our significant storage expense. By current measurements, though, we've ended up with a roughly 20% increase in cost since migration, largely due to higher-than-anticipated inter-zone network costs. We expected some increase, of course: we wanted an increase, since the whole point of the new architecture is to make data available across zones. Ideally, we would only be traversing zone boundaries for replication, but in practice that ideal is hard to achieve. Kafka's follower fetching feature has helped us make progress on this front. By default, consumers read from the leader replica of a partition, where records are directly produced: but if you're willing to accept some replication latency (well within our application SLOs), follower fetching lets you consume data from same-zone replica partitions, eliminating extra hops across boundaries. The feature is enabled by specifying the client.rack configuration to consumers, and the RackAwareReplicaSelector class for the broker-side replica.selector.class config. This isn’t the only source of inter-zone traffic however, many of our client applications are not pure consumers but also produce data themselves, and when their data is written back to the Kafka cluster across zones. (We also have some consumers in different Google Cloud projects outside our team's control that we haven't been able to update yet.) Arguably, increased network costs are worth it to be insured against catastrophic zone failures. (Certainly there are some possibly sleepy humans who are glad they won't be called upon to manually evacuate a bad zone.) We think that with continued work we can bring our costs more in line with initial expectations. But even as things stand, the benefits of automatic and fast tolerance to zone outages are significant, and we'll pay for them happily.
One of our Guiding Principles at Etsy is that we “commit to our craft.” This means that we have a culture of learning, in which we’re constantly looking for opportunities to improve and learn, adopt industry best practices, and share our findings with our colleagues and our community. As part of that process, Etsy recently adopted Jetpack Compose – Android's modern toolkit for defining native UIs – as our preferred means of building our Android app. The process of adoption consisted of a gradual expansion in the size and complexity of features built using Compose, eventually culminating in a full rewrite of one of the primary screens in the app. The results of that rewrite gave us the confidence to recommend Compose as the primary tool for our Android engineers to build UIs going forward. Adoption Our engineers are always investigating the latest industry trends and technologies, but in this case a more structured approach was warranted due to the central nature of UI toolkits in the development process. Several engineers on the Android team were assigned to study the existing Compose documentation and examples we had used in prior builds and then create a short curriculum based on what they learned. Over several months, the team held multiple information sessions with the entire Android group, showing how to use Compose to build simple versions of some of our real app screens. Part of our in-house curriculum for learning Jetpack Compose via small modules. Each module built upon the previous module to build more complex versions of various features in our real app. Next, our Design Systems team started creating Compose versions of our internal UI toolkit components, with the goal of having a complete Compose implementation of our design system before major adoption. Compose is designed for interoperability with our existing toolkit, XML Views, providing an uncomplicated migration path that enables us to start using these new toolkit components in our existing XML Views with minimal disruption. This was our first chance to validate that the performance of Compose would be as good as or better than our existing toolkit components. This also gave the wider Android community at Etsy a chance to start using Compose in their day-to-day work and get comfortable with the new patterns Compose introduced. A partial list of the design system components our team was able to make available in Compose. Our Design Systems team also made heavy use of one of Compose’s most powerful features: Previews. Compose Previews allow a developer to visualize Composables in as many configurations as they want using arbitrary test data, all without having to run the app on a device. Every time the team made a change to a Design Systems Composable, they could validate the effect in a wide range of scenarios. After a few months of building and adopting toolkit components in Compose, our team felt it was time for a more significant challenge: rebuilding an entire screen. To prevent inadvertently causing a disruption for buyers or sellers on Etsy, we chose a heavily used screen only available in our backend for development builds. This step exposed us to a much wider scope of concerns: navigation, system UI, data fetching using coroutines from our API, and the orchestration of multiple Compose components interacting with each other. Using Kotlin Flows, we worked out how to structure our business and UI logic around a unidirectional data flow, a key unlock for future integration of Compose with Macramé – our standard architecture for use across all screens in the Etsy app. With a full internal screen under our belts, it was time to put Compose in front of real users. A few complex bottom sheets were the next pieces of our app to get the Compose treatment. For the first time, we exposed a major part of our UI, now fully written in Compose, to buyers and sellers on Etsy. We also paired a simple version of our Macramé architecture with these bottom sheets to prove that the two were compatible. A bottom sheet fully using Compose hosted inside of a screen built using Views. After successfully rolling out bottom sheets using Compose, we saw an opportunity to adopt Compose on a larger scale in the Shop screen. The existing Shop screen code was confusing to follow and very difficult to run experiments on – limiting our ability to help sellers improve their virtual storefronts. Compose and Macramé held the promise of addressing all these concerns. The Shop screen, fully built using Compose. In just around three months, our small team completed the rebuild. Our first order of business was to run an A/B experiment on the Shop screen to compare old vs. new. The results gave Compose even better marks than we had hoped for. Initial screen rendering time improved by 5%, and subjective interactions with the Shop screen, like taps and scrolls, were quicker and more fluid. User analytics showed the new screen improved conversion rate, add to cart actions, checkout starts, shop favoriting, listing views, and more – meaning these changes made a tangible, positive impact for our sellers. For the engineers tasked with coding the Shop screen, the results were just as impressive. An internal survey of engineers who had worked with the Shop screen before the rewrite showed a significant improvement in overall developer satisfaction. Building features required fewer lines of code, our respondents told us, and thanks to the Macramé architecture, testing was much easier and enabled us to greatly increase test coverage of business logic. Similar to what we learned during the development of our Design System components, Compose Previews were called out as a superpower for covering edge cases, and engineers said they were excited to work in a codebase that now featured a modern toolkit. Learnings We've learned quite a lot about Compose on our path to adopting it: Because of the unidirectional data flow of our Macramé architecture and stateless components built with Compose, state is decoupled from the UI and business logic is isolated and testable. The combination of Macramé and Compose has become the standard way we build features for our app. Colocation of layout and display logic allows for much easier manipulation of spacing, margins, and padding when working with complex display logic. Dynamic spacing is extremely difficult to do with XML layouts alone, and requires code in separate files to keep it all in sync. Creating previews of all possible Compose states using mock data has eliminated a large source of rework, bugs, and bad experiences for our buyers. Our team found it easier to build lazy-style lists in Compose compared to managing all the pieces involved with using RecyclerView, especially horizontal lazy lists. Interoperability between Compose and Views in both directions enabled a gradual adoption of Compose. Animation of Composables can be triggered automatically by data changes–no writing extra code to start and stop the animations properly. While no individual tool is perfect, we’re excited about the opportunities and efficiencies Compose has unlocked for our teams. As with any new technology, there's a learning curve, and some bumps along the way. One issue we found was in a 3rd party library we use. While the library has support for Compose, at the time of the Shop screen conversion, that support was still in alpha stage. After extensive testing, we decided to move forward using the alpha version, but an incompatibility could have necessitated us finding an alternative solution. Another learning is that LazyRows and LazyColumns, while similar in some respects to RecyclerView, come with their own specific way of handling keys and item reuse. This new lazy list paradigm has occasionally tripped us up and resulted in some unexpected behavior. Conclusion We’re thrilled with our team’s progress and outcomes in adopting this new toolkit. We’ve now fully rewritten several key UI screens, including Listing, Favorites, Search, and Cart using Compose, with more to come. Compose has given us a set of tools that lets us be more productive when delivering new features to our buyers, and its gradual rollout in our codebase is a tangible example of the Etsy team's commitment to our craft.
At Etsy, we’re focused on elevating the best of our marketplace to help creative entrepreneurs grow their businesses. We continue to invest in making Etsy a safe and trusted place to shop, so sellers’ extraordinary items can shine. Today, there are more than 100 million unique items available for sale on our marketplace, and our vibrant global community is made up of over 90 million active buyers and 7 million active sellers, the majority of whom are women and sole owners of their creative businesses. To support this growing community, our Trust & Safety team of Product, Engineering, Data, and Operations experts are dedicated to keeping Etsy's marketplace safe by enforcing our policies and removing potentially violating or infringing items at scale For that, we make use of community reporting and automated controls for removing this potentially violating content. In order to continue to scale and enhance our detections through innovative products and technologies, we also leverage state-of-the-art Machine Learning solutions which we have already used to identify and remove over 100,000 violations during the past year on our marketplace. In this article, we are going to describe one of our systems to detect policy violations that utilizes supervised learning, a family of algorithms that uses data to train their models to recognize patterns and predict outcomes. Datasets In Machine Learning, data is one of the variables we have the most control over. Extracting data and building trustworthy datasets is a crucial step in any learning problem. In Trust & Safety, we are determined to keep our marketplace and users safe by identifying violations to our policies. For that, we log and annotate potential violations that enable us to collect datasets reliably. In our approach, these are translated into positives, these were indeed violations, and negatives, these were found not to be offending for a given policy. The latter are also known as hard negatives as they are close to our positives and can help us to better learn how to partition these two sets. In addition, we also add easy or soft negatives by adding random items to our datasets. This allows us to give further general examples to our models for listings that do not violate any policy, which is the majority in our marketplace and improve generalizability. The number of easy negatives to add is a hyper-parameter to tune, more will mean higher training time and fewer positive representations. For each training example, we extract multimodal signals, both textual and imagery from our listings. Then, we split our datasets by time using progressive evaluation, to mimic our production usecase and learn to adapt to recent behavior. These are split into training, used to train our models and learn patterns, validation to fine tune our training hyper-parameters such as learning rate and to evaluate over-fitting, and test to report our metrics in an unbiased manner. Model Architecture After usual transformations and extraction of a set of offline features from our datasets, we are all set to start training our Machine Learning model. The goal is to predict whether a given listing violates any of our predefined set of policies, or in contrast, it doesn’t violate any of them. For that, we added a neutral class that depicts the no violation class, where the majority of our listings fall into. This is a typical design pattern for these types of problems. Our model architecture includes a text encoder and an image encoder to learn representations (aka embeddings) for each modality. Our text encoder currently employs a BERT-based architecture to extract context-full representations of our text inputs. In addition, to alleviate compute time, we leverage ALBERT, a lighter BERT with 90% fewer parameters as the transformer blocks share them. Our initial lightweight representation used an in-house model trained for Search usecases. This allowed us to quickly start iterating and learning from this problem. Our image encoder currently employs EfficientNet, a very efficient and accurate Convolutional Neural Network (CNN). Our initial lightweight representation used an in-house model for category classification using CNNs. We are experimenting with transformer-based architectures, similar to our text encoders, with vision transformers but its performance has not been significantly improved. Inspired by EmbraceNet, our architecture then further learns more constrained representations for both text and image embeddings separately, before they are concatenated to form a unique multimodal representation. This is then sent to a final softmax activation that maps logits to probabilities for our internal use. In addition, in order to address the imbalanced nature of this problem, we leverage focal loss that penalizes more hard misclassified examples. Figure 1 shows our model architecture with late concatenation of our text and image encoders and final output probabilities on an example. Model Architecture. Image is obtained from @charlesdeluvio on Unsplash Model Evaluation First, we experimented and iterated by training our model offline. To evaluate its performance, we established certain benchmarks, based on the business goal of minimizing the impact of any well-intentioned sellers while successfully detecting any offending listings in the platform. This results in a typical evaluation trade-off between precision and recall, precision being the fraction of correct predictions over all predictions made, and recall being the fraction of correct predictions over the actual true values. However, we faced the challenge that recall is not possible to compute, as it’s not feasible to manually review the millions and millions of new listings per day so we had to settle for a proxy for recall from what has been annotated. Once we had a viable candidate to test in production, we deployed our model as an endpoint and built a service to perform pre-processing and post-processing steps before and after the call to our endpoint that can be called via an API. Then, we ran an A/B test to measure its performance in production using a canary release approach, slowly rolling out our new detection system to a small percentage of traffic that we keep increasing while we validate an increase in our metrics and no unexpected computation overload. Afterwards, we iterated and every time we had a promising offline candidate, named challenger, that improved our offline performance metrics, we A/B tested it with respect to our current model, named champion. We designed guidelines for model promotion to increase our metrics and our policy coverage. Now, we monitor and observe our model predictions and trigger re-training when our performance degrades. Results Our supervised learning system has been continually learning as we train frequently, run experiments with new datasets and model architectures, A/B test them and deploy them in production. We have added violations as additional classes to our model. As a result, we have identified and removed more than 100,000 violations using these methodologies, in addition to other tools and services that continue to detect and remove violations. This is one of our approaches to identify potentially offending content among others such as explicitly using the policy information and leverage the latest in Large Language Models (LLMs) and Generative AI. Stay tuned! "To infinity and beyond!" –Buzz Lightyear, Toy Story
In 2020, Etsy concluded its migration from an on-premise data center to the Google Cloud Platform (GCP). During this transition, a dedicated team of program managers ensured the migration's success. Post-migration, this team evolved into the Etsy FinOps team, dedicated to maximizing the organization's cloud value by fostering collaborations within and outside the organization, particularly with our Cloud Providers. Positioned within the Engineering organization under the Chief Architect, the FinOps team operates independently of any one Engineering org or function and optimizes globally rather than locally. This positioning, combined with Etsy's robust engineering culture focused on efficiency and craftsmanship, has fostered what we believe is a mature and successful FinOps practice at Etsy. Forecast Methodology A critical aspect of our FinOps approach is a strong forecasting methodology. A reliable forecast establishes an expected spending baseline against which we track actual spending, enabling us to identify deviations. We classify costs into distinct buckets: Core Infrastructure: Includes the costs of infrastructure and services essential for operating the Etsy.com website. Machine Learning & Product Enablement: Encompasses costs related to services supporting machine learning initiatives like search, recommendations, and advertisements. Data Enablement: Encompasses costs related to shared platforms for data collection, data processing and workflow orchestration. Dev: Encompasses non-production resources. The FinOps forecasting model relies on a trailing Cost Per Visit (CPV) metric. While CPV provides valuable insights into changes, it's not without limitations: A meaningful portion of web traffic to Etsy involves non-human activity, like web crawlers that’s not accounted for in CPV. Some services have weaker correlations to user visits. Dev, data, and ML training costs lack direct correlations to visits and are susceptible to short-term spikes during POCs, experiments or big data workflows. A/B tests for new features can lead to short-term CPV increases, potentially resulting in long-term CPV changes upon successful feature launches. Periodically, we run regression tests to validate if CPV should drive our forecasts. In addition to visits we have looked into headcount, GMV(Gross Merchandise Value) and revenue as independent variables. Thus far, visits have consistently exhibited the highest correlation to costs. Monitoring and Readouts We monitor costs using internal tools built on BigQuery and Looker. Customized dashboards for all of our Engineering teams display cost trends, CPV, and breakdowns by labels and workflows. Additionally, we've set up alerts to identify sudden spikes or gradual week-over-week/month-over-month growth. Collaboration with the Finance department occurs weekly to compare actual costs against forecasts, identifying discrepancies for timely corrections. Furthermore, the FinOps team conducts recurring meetings with major cost owners and monthly readouts for Engineering and Product leadership to review forecasted figures and manage cost variances. While we track costs at the organization/cost center level, we don't charge costs back to the teams. This both lowers our overhead and more importantly, provides flexibility to make tradeoffs that enable Engineering velocity. Cost Increase Detection & Mitigation Maintaining a healthy CPV involves swiftly identifying and mitigating cost increases, to achieve this we: Analysis: Gather information on the increase's source, whether from specific cloud products, workflows, or usage pattern changes (ie variance in resource utilization). Collaboration: Engage relevant teams, sharing insights and seeking additional context. Validation: Validate cost increases from product launches or internal changes, securing buy-in from leadership if needed. Mitigation: Unexpected increases undergo joint troubleshooting, where we outline and assign action items to owners, until issues are resolved. Communication: Inform our finance partners about recent cost trends and their incorporation into the expected spend forecast post-confirmation or resolution with teams and engineering leadership. Cost Optimization Initiatives Another side of maintaining a healthy CPV involves cost optimization, offsetting increases from product launches. Ideas for cost-saving come as a result of collaboration between FinOps and engineering teams, with the Architecture team validating and implementing efficiency improvements. Notably we focus on the engineering or business impact of the cost optimization rather than solely on savings, recognizing that inefficiencies often signal larger problems. Based on effort vs. value evaluations, some ideas are added to backlogs, while major initiatives warrant dedicated squads.Below is a breakout of some of the major wins we have had in the last year or so. GCS Storage Optimization - In 2023 we stood up a squad focused on optimizing Etsy’s use of GCS, as it has been one of the largest growth areas for us over the past few years. The squad delivered a number of improvements including improved monitoring of usage, automation features for Data engineers, implementation of TTLs that match data access patterns/business needs and the adoption of Intelligent tiering. Due to these efforts, Etsy’s GCS usage is now less than it was 2 years ago. Compute Optimization - Migrated over 90% of Etsy infrastructure that is serving traffic to the latest and greatest CPU platform. This improved our serving latency while reducing cost. Increased Automation for model deployment - In an effort to improve the developer experience, our machine learning enablement team developed a tool to automate the compute configurations for new models being deployed, which also ended up saving us money. Network Compression - Enabling network compression between our high throughput services both improved the latency profile and drastically reduced the networking cost. What's Next While our core infrastructure spend is well understood, our focus is on improving visibility into our Machine Learning platform's spend. As these systems are shared across teams, dissecting costs tied to individual product launches is challenging. Enhanced visibility will help us refine our ROI analysis of product experiments and pinpoint future areas of opportunity for optimization.
Etsy features a diverse marketplace of unique handmade and vintage items. It’s a visually diverse marketplace as well, and computer vision has become increasingly important to Etsy as a way of enhancing our users’ shopping experience. We’ve developed applications like visual search and visually similar recommendations that can offer buyers an additional path to find what they’re looking for, powered by machine learning models that encode images as vector representations. Figure 1. Visual representations power applications such as visual search and visually similar recommendations Learning expressive representations through deep neural networks, and being able to leverage them in downstream tasks at scale, is a costly technical challenge. The infrastructure required to train and serve large models is expensive, as is the iterative process that refines them and optimizes their performance. The solution is often to train deep learning architectures offline and use the pre-computed pretrained visual representations in downstream tasks served online. (We wrote about this in a previous blog post on personalization from real-time sequences and diversity of representations.) In any application where a query image representation is inferred online, it's important that you have low latency, memory-aware models. Efficiency becomes paramount to the success of these models in the product. We can think about efficiency in deep learning along multiple axes: efficiency in model architecture, model training, evaluation and serving. Model Architecture The EfficientNet family of models features a convolutional neural network architecture. It uniformly optimizes for network width, depth, and resolution using a fixed set of coefficients. By allowing practitioners to start from a limited resource budget and scale up for better accuracy as more resources are available, EfficientNet provides a great starting point for visual representations. We began our trials with EfficientNetB0, the smallest size model in the EfficientNet family. We saw good performance and low latency with this model, but the industry and research community have touted Vision Transformers (ViT) as having better representations. We decided to give that a try. Transformers lack the spatial inductive biases of CNN, but they outperform CNN when trained on large enough datasets and may be more robust to domain shifts. ViT decomposes the image into a sequence of patches (16X16 for example) and applies a transformer architecture to incorporate more global information. However, due to the massive number of parameters and compute-heavy attention mechanism, ViT-based architectures can be many times slower to train and inference than lightweight Convolutional Networks. Despite the challenges, more efficient ViT architectures have recently begun to emerge, featuring clever pooling, layer dropping, efficient normalization, and efficient attention or hybrid CNN-transformer designs. We employ the EfficientFormer-l3 to take advantage of these ViT improvements. The EfficientFormer architecture achieves efficiency through downsampling multiple blocks and employing attention only in the last stage. This derived image representation mechanism differs from the standard vision transformer, where embeddings are extracted from the first token of the output. Instead, we extract the attention from the last block for the eight heads and perform average pooling over the sequence. In Figure 2 we illustrate these different attention weights with heat maps overlaid on an image, showing how each of the eight heads learns to focus on a different salient part. Figure 2. Probing the EfficientFormer-l3 pre-trained visual representations through attention heat maps. Model Training Fine-Tuning With our pre-trained backbones in place, we can gain further efficiencies via fine tuning. For the EfficientNetB0 CNN, that means replacing the final convolutional layer and attaching a d-dimensional embedding layer followed by m classification heads, where m is the number of tasks. The embedding head consists of a new convolutional layer with the desired final representation dimension, followed by a batch normalization layer, a swish activation and a global average pooling layer to aggregate the convolutional output into a single vector per example. To train EfficientNetB0, new attached layers are trained from scratch for one epoch with the backbone layers frozen, to avoid excessive computation and overfitting. We then unfreeze 75 layers from the top of the backbone and finetune for nine additional epochs, for efficient learning. At inference time we remove the classification head and extract the output of the pooling layer as the final representation. To fine-tune the EfficientFormer ViT we stick with the pretraining resolution of 224X224, since using sequences longer than the recommended 384X384 in ViT leads to larger training budgets. To extract the embedding we average pool the last hidden state. Then classification heads are added as with the CNN, with batch normalization being swapped for layer normalization. Multitask Learning In a previous blog post we described how we built a multitask learning framework to generate visual representations for Etsy's search-by-image experience. The training architecture is shown in Figure 3. Figure 3. A multitask training architecture for visual representations. The dataset sampler combines examples from an arbitrary number of datasets corresponding to respective classification heads. The embedding is extracted before the classification heads. Multitask learning is an efficiency inducer. Representations encode commonalities, and they perform well in diverse downstream tasks when those are learned using common attributes as multiple supervision signals. A representation learned in single-task classification to the item’s taxonomy, for example, will be unable to capture visual attributes: colors, shapes, materials. We employ four classification tasks: a top-level taxonomy task with 15 top-level categories of the Etsy taxonomy tree as labels; a fine-grained taxonomy task, with 1000 fine-grained leaf node item categories as labels; a primary color task; and a fine-grained taxonomy task (review photos), where each example is a buyer-uploaded review photo of a purchased item with 100 labels sampled from fine-grained leaf node item categories. We are able to train both EfficientNetB0 and EfficientFormer-l3 on standard 16GB GPUs (we used two P100 GPUs). For comparison, a full sized ViT requires a larger 40GB RAM GPU such as an A100, which can increase training costs significantly. We provide detailed hyperparameter information for fine-tuning either backbone in our article. Evaluating Visual Representations We define and implement an evaluation scheme for visual representations to track and guide model training, on three nearest neighbor retrieval tasks. After each training epoch, a callback is invoked to compute and log the recall for each retrieval task. Each retrieval dataset is split into two smaller datasets: “queries” and “candidates.” The candidates dataset is used to construct a brute-force nearest neighbor index, and the queries dataset is used to look up the index. The index is constructed on the fly after each epoch to accommodate for embeddings changing between training epochs. Each lookup yields K nearest neighbors. We compute Recall@5 and @10 using both historical implicit user interactions (such as “visually-similar ad clicks”) and ground truth datasets of product photos taken from the same listing (“intra-item”). The recall callbacks can also be used for early stopping of training to enhance efficiency. The intra-item retrieval evaluation dataset consists of groups of seller-uploaded images of the same item. The query and candidate examples are randomly selected seller-uploaded images of an item. A candidate image is considered a positive example if it is associated with the same item as the query. In the “intra-item with reviews” dataset, the query image is a randomly selected buyer-uploaded review image of an item, with seller-uploaded images providing candidate examples. The dataset of visually similar ad clicks associates seller-uploaded primary images with primary images of items that have been clicked in the visually similar surface on mobile. Here, a candidate image is considered a positive example for some query image if a user viewing the query image has clicked it. Each evaluation dataset contains 15,000 records for building the index and 5,000 query images for the retrieval phase. We also leverage generative AI for an experimental new evaluation scheme. From ample, multilingual historical text query logs, we build a new retrieval dataset that bridges the semantic gap between text-based queries and clicked image candidates. Text-to-image generative stable diffusion makes the information retrieval process language-agnostic, since an image is worth a thousand (multilingual) words. A stable diffusion model generates high-quality images which become image queries. The candidates are images from clicked items corresponding to the source text query in the logs. One caveat is that the dataset is biased toward the search-by-text production system that produced the logs; only a search-by-image-from-text system would produce truly relevant evaluation logs. The source-candidate image pairs form the new retrieval evaluation dataset which is then used within a retrieval callback. Of course, users entering the same text query may have very different ideas in mind of, say, the garment they’re looking for. So for each query we generate several images: formally, a random sample of length 𝑛 from the posterior distribution over all possible images that can be generated from the seed text query. We pre-condition our generation on a uniform “fashion style.” In a real-world scenario, both the text-to-image query generation and the image query inference for retrieval happen in real time, which means efficient backbones are necessary. We randomly select one of the 𝑛 generated images to replace the text query with an image query in the evaluation dataset. This is a hybrid evaluation method: the error inherent in the text-to-image diffusion model generation is encapsulated in the visually similar recommendation error rate. Future work may include prompt engineering to improve the text query prompt itself, which as input by the user can be short and lacking in detail. Large memory requirements and high inference latency are challenges in using text-to-image generative models at scale. We employ an open source fast stable diffusion model through token merging and float 16 inference. Compared to the standard stable diffusion implementation available at the time we built the system, this method speeds up inference by 50% with a 5x reduction in memory consumption, though results depend on the underlying patched model. We can generate 500 images per hour with one T4 GPU (no parallelism) using the patched stable diffusion pipeline. With parallelism we can achieve further speedup. Figure 4 shows that for the English text query “black bohemian maxi dress with orange floral pattern” the efficient stable diffusion pipeline generates five image query candidates. The generated images include pleasant variations with some detail loss. Interestingly, mostly the facial details of the fashion model are affected, while the garment pattern remains clear. In some cases degradation might prohibit display, but efficient generative technology is being perfected at a fast pace, and prompt engineering helps the generative process as well. Figure 4. Text-to-image generation using a generative diffusion model, from equivalent queries in English and French Efficient Inference and Downstream Tasks Especially when it comes to latency-sensitive applications like visually similar recommendations and search, efficient inference is paramount: otherwise, we risk loss of impressions and a poor user experience. We can think of inference along two axes: online inference of the image query and efficient retrieval of top-k most similar items via approximate nearest neighbors. The dimension of the learned visual representation impacts the efficient retrieval design as well, and the smaller 256d derived from the EfficientNetB0 presents an advantage. EfficientNet B0 is hard to beat in terms of accuracy-to-latency trade-offs for online inference, with ~5M parameters and around 1.7ms latency on iPhone 12. The EfficientFormer-l3 has ~30M parameters and gets around 2.7ms latency on iPhone 12 with higher accuracy (while for example MobileViT-XS scores around 7ms with a third of accuracy; very large ViT are not considered since latencies are prohibitive). In offline evaluation, the EfficientFormer-l3-derived embedding achieves around +5% lift in the Intra-L Recall@5 evaluation, a +17% in Intra-R Recall@5, and a +1.8% in Visually Similar Ad clicks Recall@5. We performed A/B testing on the EfficientNetB0 multitask variant across visual applications at Etsy with good results. Additionally, the EfficientFormer-l3 visual representations led to a +0.65% lift in CTR, and a similar lift in purchase rate in a first visually-similar-ads experiment when compared to the production variant of EfficientNetB0. When included in sponsored search downstream rankers, the visual representations led to a +1.26% lift in post-click purchase rate. Including the efficient visual representation in Ads Information Retrieval (AIR), an embedding-based retrieval method used to retrieve similar item ad recommendations caused an increase in click-recall@100 of 8%. And when we used these representations to compute image similarity and included them directly in the last-pass ranking function, we saw a +6.25% lift in clicks. The first use of EfficientNetB0 visual embeddings was in visually similar ad recommendations on mobile. This led to a +1.92% increase in ad return-on-spend on iOS and a +1.18% increase in post-click purchase rate on Android. The same efficient embedding model backed the first search-by-image shopping experience at Etsy. Users search using photos taken with their mobile phone’s camera and the query image embedding is inferred efficiently online, which we discussed in a previous blog post. Learning visual representations is of paramount importance in visually rich e-commerce and online fashion recommendations. Learning them efficiently is a challenging goal made possible by advances in the field of efficient deep learning in computer vision. If you'd like a more in-depth discussion of this work, please see our full accepted paper to the #fashionXrecsys workshop at the Recsys 2023 conference.
Easily the most important and complex screen in the Buy on Etsy Android app is the listing screen, where all key information about an item for sale in the Etsy marketplace is displayed to buyers. Far from just a title and description, a price and a few images, over the years the listing screen has come to aggregate ratings and reviews, seller and shipping and stock information, and gained a variety of personalization and recommendation features. As information-rich as it is, as central as it is to the buying experience, for product teams the listing screen is an irresistible place to test out new methods and approaches. In just the last three years, apps teams have run nearly 200 experiments on it, often with multiple teams building and running experiments in parallel. Eventually, with such a high velocity of experiment and code change, the listing screen started showing signs of stress. Its architecture was inconsistent and not meant to support a codebase expanding so much and so rapidly in size and complexity. Given the relative autonomy of Etsy app development teams, there ended up being a lot of reinventing the wheel, lots of incompatible patterns getting layered atop one another; in short the code resembled a giant plate of spaghetti. The main listing Fragment file alone had over 4000 lines of code in it! Code that isn’t built for testability doesn’t test well, and test coverage for the listing screen was low. VERY low. Our legacy architecture made it hard for developers to add tests for business logic, and the tests that did get written were complex and brittle, and often caused continuous integration failures for seemingly unrelated changes. Developers would skip tests when it seemed too costly to write and maintain them, those skipped tests made the codebase harder for new developers to onboard into or work with confidently, and the result was a vicious circle that would lead to even less test coverage. Introducing Macramé We decided that our new architecture for the listing screen, which we’ve named Macramé, would be based on immutable data propagated through a reactive UI. Reactive frameworks are widely deployed and well understood, and we could see a number of ways that reactivity would help us untangle the spaghetti. We chose to emulate architectures like Spotify’s Mobius, molded to fit the shape of Etsy’s codebase and its business requirements. At the core of the architecture is an immutable State object that represents our data model. State for the listing screen is passed to the UI as a single data object via a StateFlow instance; each time a piece of the data model changes the UI re-renders. Updates to State can be made either from a background thread or from the main UI thread, and using StateFlow ensures that all updates reach the main UI thread. When the data model for a screen is large, as it is for the listing screen, updating the UI from a single object makes things much simpler to test and reason about than if multiple separate models are making changes independently. And that simplicity lets us streamline the rest of the architecture. When changes are made to the State, the monolithic data model gets transformed into a list of smaller models that represent what will actually be shown to the user, in vertical order on the screen. The code below shows an example of state held in the Buy Box section of the screen, along with its smaller Title sub-component. data class BuyBox( val title: Title, val price: Price, val saleEndingSoonBadge: SaleEndingSoonBadge, val unitPricing: UnitPricing, val vatTaxDescription: VatTaxDescription, val transparentPricing: TransparentPricing, val firstVariation: Variation, val secondVariation: Variation, val klarnaInfo: KlarnaInfo, val freeShipping: FreeShipping, val estimatedDelivery: EstimatedDelivery, val quantity: Quantity, val personalization: Personalization, val expressCheckout: ExpressCheckout, val cartButton: CartButton, val termsAndConditions: TermsAndConditions, val ineligibleShipping: IneligibleShipping, val lottieNudge: LottieNudge, val listingSignalColumns: ListingSignalColumns, val shopBanner: ShopBanner, ) data class Title( val text: String, val textInAlternateLanguage: String? = null, val isExpanded: Boolean = false, ) : ListingUiModel() In our older architecture, the screen was based on a single scrollable View. All data was bound and rendered during the View's initial layout pass, which created a noticeable pause the first time the screen was loaded. In the new screen, a RecyclerView is backed by a ListAdapter, which allows for asynchronous diffs of the data changes, avoiding the need to rebind portions of the screen that aren't receiving updates. Each of the vertical elements on the screen (title, image gallery, price, etc.) is represented by its own ViewHolder, which binds whichever of the smaller data models the element relies on. In this code, the BuyBox is transformed into a vertical list of ListingUiModels to display in the RecyclerView. fun BuyBox.toUiModels(): List { return listOf( price, title, shopBanner, listingSignalColumns, unitPricing, vatTaxDescription, transparentPricing, klarnaInfo, estimatedDelivery, firstVariation, secondVariation, quantity, personalization, ineligibleShipping, cartButton, expressCheckout, termsAndConditions, lottieNudge, ) } An Event dispatching system handles user actions, which are represented by a sealed Event class. The use of sealed classes for Events, coupled with Kotlin "when" statements mapping Events to Handlers, provides compile-time safety to ensure all of the pieces are in place to handle the Event properly. These Events are fed to a single Dispatcher queue, which is responsible for routing Events to the Handlers that are registered to receive them. Handlers perform a variety of tasks: starting asynchronous network calls, dispatching more Events, dispatching SideEffects, or updating State. We want to make it easy to reason about what Handlers are doing, so our architecture promotes keeping their scope of responsibility as small as possible. Simple Handlers are simple to write tests for, which leads to better test coverage and improved developer confidence. In the example below, a click handler on the listing title sets a State property that tells the UI to display an expanded title: class TitleClickedHandler constructor() { fun handle(state: ListingViewState.Listing): ListingEventResult.StateChange { val buyBox = state.buyBox return ListingEventResult.StateChange( state = state.copy( buyBox = buyBox.copy( title = title.copy(isExpanded = true) ) ) ) } } SideEffects are a special type of Event used to represent, typically, one-time operations that need to interact with the UI but aren’t considered pure business logic: showing dialogs, logging events, performing navigation or showing Snackbar messages. SideEffects end up being routed to the Fragment to be handled. Take the scenario of a user clicking on a listing's Add to Cart button. The Handler for that Event might: dispatch a SideEffect to log the button click start an asynchronous network call to update the user’s cart update the State to show a loading indicator while the cart update finishes While the network call is running on a background thread, the Dispatcher is free to handle other Events that may be in the queue. When the network call completes in the background, a new Event will be dispatched with either a success or failure result. A different Handler is then responsible for handling both the success and failure Events. This diagram illustrates the flow of Events, SideEffects, and State through the architecture: Figure 1. A flow chart illustrating system components (blue boxes) and how events and state changes (yellow boxes) flow between them. Results The rewrite process took five months, with as many as five Android developers working on the project at once. One challenge we faced along the way was keeping the new listing screen up to date with all of the experiments being run on the old listing screen while development was in progress. The team also had to create a suite of tests that could comprehensively cover the diversity of listings available on Etsy, to ensure that we didn’t forget any features or break any. With the rewrite complete, the team ran an A/B experiment against the existing listing screen to test both performance and user behavior between the two versions. Though the new listing screen felt qualitatively quicker than the old listing screen, we wanted to understand how users would react to subtle changes in the new experience. We instrumented both the old and the new listing screens to measure performance changes from the refactor. The new screen performed even better than expected. Time to First Content was decreased by 18%, going from 1585 ms down to 1298 ms. This speedup resulted in the average number of listings viewed by buyers increasing 2.4%, add to carts increasing 0.43%, searches increasing by 2%, and buyer review photo views increasing by 3.3%. On the developer side, unit test coverage increased from single digit percentages to a whopping 76% code coverage of business logic classes. This significantly validates our decision to put nearly all business logic into Handler classes, each responsible for handling just a single Event at a time. We built a robust collection of tools for generating testing States in a variety of common configurations, so writing unit tests for the Handlers is as simple as generating an input event and validating that the correct State and SideEffects are produced. Creating any new architecture involves making tradeoffs, and this project was no exception. Macramé is under active development, and we have a few pieces of feedback on our agenda to be addressed: There is some amount of boilerplate still needed to correctly wire up a new Event and Handler, and we'd like to make that go away. The ability of Handlers to dispatch their own Events sometimes makes debugging complex Handler interactions more difficult than previous formulations of the same business logic. On a relatively simple screen, the architecture can feel like overkill. Adding new features correctly to the listing screen is now the easy thing to do. The dual benefit of increasing business metrics while also increasing developer productivity and satisfaction has resulted in the Android team expanding the usage of Macramé to two more of the key screens in the app (Cart and Shop), both of which completely rewrote their UI using Jetpack Compose: but those are topics for future Code as Craft posts.
Balancing Engineering Ambition with Product Realism Introduction In July of 2023, Etsy’s App Updates team, responsible for the Updates feed in Etsy’s mobile apps, set off with an ambitious goal: to revamp the Updates tab to become Deals, a home for a shopper’s coupons and sales, in time for Cyber Week 2023. The Updates tab had been around for years, and in the course of its evolution ended up serving multiple purposes. It was a hub for updates about a user’s favorite shops and listings, but it was also increasingly a place to start new shopping journeys. Not all updates were created equal. The most acted-upon updates in the tab were coupons offered for abandoned cart items, which shoppers loved. We spotted an opportunity to clarify intentions for our users: by refactoring favorite-based updates into the Favorites tab, and (more boldly), by recentering Updates and transforming it into a hub for a buyer’s deals. Technical Opportunity While investigating the best way to move forward with the Deals implementation, iOS engineers on the team advocated for developing a new tab from the ground up. Although it meant greater initial design and architecture effort, an entirely new tab built on modern patterns would let us avoid relying on Objective C, as well as internal frameworks like SDL (server-driven layout), which is present in many legacy Etsy app screens and comes with a variety of scalability and performance issues, and is in the process of being phased out. At the same time, we needed a shippable product by October. Black Friday and Cyber Week loomed on the horizon in November, and it would be a missed opportunity, for us and for our users, not to have the Deals tab ready to go. Our ambition to use modern, not yet road-tested technologies would have to balance with realism about the needs of the product, and we were conscious of maintaining that balance throughout the course of development. In comes Swift UI and Tuist! Two new frameworks were front of mind when starting this project: Swift UI and Tuist. Swift UI provides a clear, declarative framework for UI development, and makes it easy for engineers to break down views into small, reusable components. Maybe Swift UI’s biggest benefit is its built-in view previews: in tandem with componentization, it becomes a very straightforward process to build a view out of smaller pieces and preview at every step of the way. Our team had experimented with Swift UI in the past, but with scopes limited to small views, such as headers. Confident as we were about the framework, we expected that building out a whole screen in Swift UI would present us some initial hurdles to overcome. In fact, one hurdle presented itself right away. In a decade-old codebase, not everything is optimized for use with newer technologies. The build times we saw for our Swift UI previews, which were almost long enough to negate the framework’s other benefits, testified to that fact. This is where Tuist comes in. Our App Enablement team, which has been hard at work over the past few years modernizing the Etsy codebase, has adopted Tuist as a way of taming the monolith making it modular. Any engineer at Etsy can declare a Tuist module in their project and start working on it, importing parts of the larger codebase only as they need them. (For more on Etsy’s usage of Tuist, check out this article by Mike Simons from the App Enablement team.) Moving our work for the Deals tab into a Swift-based Tuist module gave us what it took to make a preview-driven development process practical: our previews build nearly instantly, and so long as we’re only making changes in our framework the app recompiles with very little delay. Figure 1. A view of a goal end state of a modular Etsy codebase, with a first layer of core modules (in blue), and a second layer of client-facing modules that combine to build the Etsy app. Our architecture The Deals tab comprises a number of modules for any given Etsy user, including a Deals Just for You module with abandoned cart coupons, and a module that shows a user their favorite listings that are on sale. Since the screen is just a list of modules, the API returns them as an array of typed items with the following structure: { "type": "", "": { ... } } Assigning each module a type enables us to parse it correctly on the client, and moves us away from the anonymous component-based API models we had used in the past. Many models are still used across modules, however. These include, but are not limited to, buttons, headers and footers, and listing cards. To parse a new module, we either have to build a new component if it doesn't exist yet, or reuse one that does. Adding a footer to a module, for example, can be as simple as: // Model { "type": "my_module", "my_module": { "target_listing": { }", "recommended_listings": [ ], "footer": { } // Add footer here } } // View var body: some View { VStack { ListingView(listing: targetListing) ListingCarouselView(listings: recommendedListings) MyFooterView(footer: footer) // Add footer here } } We also used Decodable implementations for our API parsing, leading to faster, clearer code and an easier way to handle optionals. With Etsy’s internal APIv3 framework built on top of Apple’s Decodable protocol, it is very straightforward to define a model and decide what is and isn’t optional, and let the container handle the rest. For example, if the footer was optional, but the target and recommended listings are required, decoding would look like this: init(from decoder: Decoder) throws { let container = try decoder.containerV3(keyedBy: CodingKeys.self) // These will throw if they aren't included in the response self.targetListing = try container.requireV3(forKey: .targetListing) self.recommendedListings = try container.requireV3(forKey: .recommendedListings) // Footer is optional self.footer = container.decodeV3(forKey: .footer) } As for laying out the view on the screen, we used a Swift UI List to make the most of the under-the-hood cell reuse that List uses. Figure 2. On the left-hand side, a diagram of how the DealsUI view is embedded in the Etsy app. On the right-hand side, a diagram of how the DeasUI framework takes the API response and renders a list of module views with individual components. Previews, previews, more previews If we were going to maintain a good development pace, we needed to figure out a clean way to use Swift previews. Previewing a small component, like a header that takes a string, is simple enough: just initialize the header view with the header string. For more complex views, though, it gets cumbersome to build a mock API response every time you need to preview. This complexity is only amplified when previewing an entire Deals module. To streamline the process, we decided to add a Previews enum to our more complex models. A good example of this is in the Deals Just for You coupon cards. These cards display an image or an array of images, a few lines of custom text (depending on the coupon type), and a button. Our previews enum for this API model looks like this: // In an extension to DealsForYouCard enum Previews { static var shopCouponThreeImage: ResponseModels.DealsForYouCard { let titleText = "IrvingtonWoodworksStudio" let images = [...] // Three images let button = ResponseModels.Button( buttonText: "10% off shop", action: .init(...) ) return ResponseModels.DealsForYouCard( button: button, saleBadge: "20% off", titleText: titleText, subtitleText: "Favorited shop", action: .init(...), images: images ) } static var listingCoupon: ResponseModels.DealsForYouCard { ... } } Then, previewing a variety of coupon cards, it’s as straightforward as: #Preview { DealsForYouCardView(coupon: .Previews.listingCoupon) } #Preview { DealsForYouCardView(coupon: .Previews.shopCouponThreeImage) } The other perk of this architecture is that it makes it very easy to nest previews, for example when previewing an entire module. To build preview data for the Deals for You module, we can use our coupon examples this way: // In an extension to DealsForYouModule enum Previews { static var mockModule: ResponseModels.DealsForYouModule { let items: [ResponseModels.DealsForYouCard] = [.Previews.listingCoupon, .Previews.shopCouponThreeImage, .Previews.shopCouponTwoImage] let header = ResponseModels.DealsForYouHeader(title: "Deals just for you") return .init(header: header, items: items) } } These enums are brief, clear, and easy to understand; they allow us to lean into the benefits of modularity. This architecture, along with our Decodable models, also enabled us to clear a roadblock that used to occur when our team had to wait for API support before we could build new modules. For example, both the Similar Items on Sale and Extra Special Deals modules in the Deals tab were built via previews, and were ready approximately two weeks before the corresponding API work was complete, helping us meet deadlines and not have to wait for a new App Store release. By taking full advantage of Swift UI's modularity and previewability, not only were we able to prove out a set of new technologies, we also exceeded product expectations by significantly beating our deadlines even with the initial overhead of adopting the framework. Challenges: UIKit interoperability Particularly when it came to tasks like navigation and favoriting, interfacing between our module and the Etsy codebase could pose challenges. An assumption that we had as engineers going into this project was that the code to open a listing page, for example, would just be readily available to use; this was not the case, however. Most navigation code within the Etsy codebase is handled by an Objective C class called EtsyScreenController. While in the normal target, it’s as straightforward as calling a function to open a listing page, that functionality was not available to us in our Deals module. One option would have been to build our own navigation logic using Swift UI Navigation stacks; we weren’t trying to reinvent the wheel, however. To balance product deadlines and keep things as simple as possible, we decided not to be dogmatic, and to handle navigation outside of our framework. We did this by building a custom @Environment struct, called DealsAction, which passes off responsibility for navigation back to the main target, and uses the new Swift callAsFunction() feature so we can treat this struct like a function in our views. We have a concept of a DealsAction type in our API response, which enables us to match an action with an actionable part of the screen. For example, a button response has an action that will be executed when a user taps the button. The DealsAction handler takes that action, and uses our existing UIKit code to perform it. The Deals tab is wrapped in a UIHostingController in the main Etsy target, so when setting up the Swift UI view, we also set the DealsAction environment object using a custom view modifier: let dealsView = DealsView() .handleDealsAction { [weak self] in self?.handleAction(action: $0) } ... func handleDealsAction(action: DealsAction) { // UIKit code to execute action } Then, when we need to perform an action from a Swift UI view, the action handler is present at any layer within the view hierarchy within the Deals tab. Performing the action is as simple as: @Environment(\.handleDealsAction) var handleDealsAction: DealsAction ... MyButton(title: buttonText, fillWidth: false) { handleDealsAction(model.button?.action) } We reused this pattern for other existing functionality that was only available in the main target. For example, we built an environment object for favoriting listings, or for following a shop, and for logging performance metrics. This pattern allows us to include environment objects as needed, and it simplifies adding action handling to any view. Instead of rebuilding this functionality in our Tuist module in pure Swift, which could have taken multiple sprints, we struck a balance between modernization and the need to meet product deadlines. Challenges: Listing Cards The listing card view is a common component used across multiple screens within the Etsy app. This component was originally written in Objective-C and throughout the years has been enhanced to support multiple configurations and layouts, and to be available for A/B testing. It also has built-in functionality like favoriting, which engineers shouldn't have to reimplement each time they want to present a listing card. Figure 3. A diagram of how listing card views are conventionally built in UIKit, using configuration options and the analytics framework to combine various UIKit subviews. It's been our practice to reuse this same single component and make small modifications to support changes in the UI, as per product or experimentation requirements. This means that many functionalities, such as favoriting, long-press menus, and image manipulation, are heavily coupled with this single component, many parts of which are still written in Objective C. Early in the process of developing the new tab, we decided to scope out a way of supporting conventional listing card designs—ones that matched existing cards elsewhere in the app—without having to rebuild the entire card component in Swift UI. We knew a rebuild would eventually be necessary, since we expected to have to support listing cards that differed significantly from the standard designs, but the scope of such a rebuild was a known unknown. To balance our deadlines, we decided to push this more ambitious goal until we knew we had product bandwidth. Since the listing card view is heavily coupled with old parts of the codebase, however, it wasn’t as simple as just typing import ListingCard and flying along. We faced two challenges: first, the API model for a listing card couldn’t be imported into our module, and second the view couldn’t be imported for simple use in a UIViewRepresentable wrapper. To address these, we deferred responsibility back up to the UIKit view controller. Our models for a listing card component look something like this: struct ListingCard { public let listingCardWrapper: ListingCardWrapper let listingCard: TypedListingCard } The model is parsed in two ways: as a wrapper, where it is parsed as an untyped dictionary that will eventually be used to initialize our legacy listing card model, and as a TypedListingCard, which is used only within the Deals tab module. Figure 4. A diagram of how a UIKit listing card builder is passed from the main target to the Deals framework for rendering listing cards. To build the listing card view, we pass in a view builder to the SwiftUI DealsView initializer in the hosting controller code. Here, we are in the full Etsy app codebase, meaning that we have access to the legacy listing card code. When we need to build a listing card, we use this view builder as follows: var body: some View { LazyVGrid(...) { ForEach(listings) { listing in cardViewBuilder(listing) // Returns a UIViewRepresentable } } } There was some initial overhead involved in getting these cards set up, but it was worth it to guarantee that engineering unknowns in a Swift UI rewrite wouldn’t block us and compromise our deadlines. Once built, the support for legacy cards enabled us to reuse them easily wherever they were needed. In fact, legacy support was one of the things that helped us move faster than we expected, and it became possible to stretch ourselves and build at least some listing cards in the Deals tab entirely in Swift UI. This meant that writing the wrapper ultimately gave us the space we needed to avoid having to rely solely on the wrapper! Conclusion After just three months of engineering work, the Deals tab was built and ready to go, even beating product deadlines. While it took some engineering effort to overcome initial hurdles, as well as the switch in context from working in UIKit in the main target to working in Swift UI in our own framework, once we had solutions to those challenges and could really take advantage of the new architecture, we saw a very substantial increase in our engineering velocity. Instead of taking multiple sprints to build, new modules could take just one sprint or less; front-end work was decoupled from API work using Previews, which meant we no longer had to wait for mock responses or even API support at all; and maybe most important, it was fun to use Swift UI’s clear and straightforward declarative UI building, and see our changes in real time! From a product perspective, the Deals tab was a great success: buyers converted their sessions more frequently, and we saw an increase in visits to the Etsy app. The tab was rolled out to all users in mid October, and has seen significant engagement, particularly during Black Friday and Cyber Monday. By being bold and by diving confidently into new frameworks that we expected to see benefits from, we improved engineer experience and not just met but beat our product deadlines. More teams at Etsy are using Swift UI and Tuist in their product work now, thanks to the success of our undertaking, sometimes using our patterns to work through hurdles, sometimes creating their own. We expect to see more of this kind of modernization start to make its way into the codebase. As we iterate on the Deals tab over the next year, and make it even easier for buyers to find the deals that mean the most to them, we look forward to continuing to work in the same spirit. Special thanks to Vangeli Ontiveros for the diagrams in this article, and a huge shoutout to the whole App Deals team for their hard work on this project!
In the past, sellers were responsible for managing and fulfilling their own tax obligations. However, more and more jurisdictions are now requiring marketplaces such as Etsy to collect the tax from buyers and remit the tax to the relevant authorities. Etsy now plays an active role in collecting tax from buyers and remitting it all over the world. In this post, I will walk you through our tax calculation infrastructure and how we adapted to the ongoing increase in traffic and business needs over the years. The tax calculation workflow We determine tax whenever a buyer adds an item to their Etsy shopping cart. The tax determination is based on buyer and seller location and product category, and a set of tax rules and mappings. To handle the details of these calculations we partner with Vertex, and issue a call to their tax engine via the Quotation Request API to get the right amount to show in our buyer's cart. Vertex ensures accurate and efficient tax management and continuously updates the tax rules and rates for jurisdictions around the world. The two main API calls we use are Quotation Request and DistributeTaxRequest SOAP calls. When the buyer proceeds to payment, an order is created, and we call back to Vertex with a DistributeTaxRequest sending the order information and tax details. We sync information with Vertex through the order fulfillment lifecycle. To keep things up to date in case an order is canceled or a refund needs to be issued later on, we inform the details of the cancellation and refunds to the tax engine via DistributeTaxRequest. This ensures that when Vertex generates tax reports for us they will be based on a complete record of all the relevant transactions. Etsy collects the tax from the buyers and remits that tax to the taxing authority, when required. Generate tax details for reporting and audit purpose Vertex comes with a variety of report formats out of the box, and gives us tools to define our own. When Etsy calls the Distribute Tax API, Vertex saves the information we pass to it as raw metadata in its tax journal database. A daily cron job in Vertex then moves this data to the transaction detail table, populating it with tax info. When reports and audit data are generated, we download these reports and import to Etsy’s bigdata and the workflow completes. Mapping the Etsy taxonomy to tax categories Etsy maintains product categories to help our buyers find exactly the items they're looking for. To determine whether transactions are taxed or exempt it's not enough to know item prices and buyer locations: we have to map our product categories to Vertex's rule drivers. That was an effort involving not just engineering but also our tax and analytics teams, and with the wide range of Etsy taxonomy categories it was no small task. Handling increased API traffic Coping with the continuous increase in traffic and maintaining the best checkout experience without delays has been a challenge all the time. Out of the different upgrades we did, the most important ones were to switch to multiple instances for vertex calls and shadowing. Multiple Instance upgrade In our initial integration, we were using the same vertex instance for Quotation and Distribute calls. And the same instance was responsible for generating the reports. This report generation started to affect our checkout experience. Reports are generally used by our tax team and they run them on a regular basis. But on top of that, we also run daily reports to feed the data captured by Vertex back into our own system for analytics purposes. We solved this by routing the quotation calls to one instance and then distributing them to the other. This helped in maintaining a clear separation of functionalities, and avoided interference between the two processes. We had to align the configurations between the instances as well. Splitting up the quotation and distribution calls opened up the door to horizontal scaling, now we can add as many instances of each type and load balance the requests between instances. Eg: When a request type lists multiple instances, we load balance between the instances by using the cart_id for quotations and receipt_ids for distributes I.e. cart_id % quotation_instance_count Shadow logging Shadow logging the requests helped us to simulate the stress on Vertex and monitor the checkout experience. We used this technique multiple times in the past. Whenever we had situations like, for example, adding five hundred thousand more listings whose taxes would be passed through the Vertex engine, we were concerned that the increase in traffic might impact buyer experience. To ensure it wouldn't, we tested for a period of time by slowly ramping shadow requests to Vertex: "Shadow requests" are test requests that we send to Vertex from orders, but without applying the calculated tax details to buyers' carts. This will simulate the load on vertex and we can monitor the cart checkout experience. Once we have done shadowing and seen how well Vertex handled the increased traffic, we are confident that releasing the features ensures it would not have any performance implications. Conclusion Given the volume of increasing traffic and the data involved, we will have to keep improving our design to support those. We've also had to address analytics, reporting, configuration sync and many more in designing the system, but we'll leave that story for next time.
A little while ago, Etsy introduced a new feature in its iOS app that could place Etsy sellers' artwork on a user's wall using Apple's Augmented Reality (AR) tools. It let them visualize how a piece would look in their space, and even gave them an idea of its size options. When we launched the feature as a beta, it was only available in "wall art"-related categories, and after the initial rollout we were eager to expand it to work with more categories. What differentiates Etsy is the nature of our sellers’ unique items. Our sellers create offerings that can be personalized in numbers of ways, and they often hand-make orders based on demand. Taking the same approach we did with wall art and attempting to show 3D models of millions of Etsy items – many of which could be further customized – would be a huge undertaking. Nevertheless, with inspiration from Etsy's Guiding Principles, we decided to dig deeper into the feature. What could we improve in the way it worked behind the scenes? What about it would make for a compelling extension into the rest of our vast marketplace? We took steps to improve how we parse seller-provided data, and we used this data with Apple’s AR technology to make it easy for Etsy users to understand the size and scale of an object they might want to buy. We decided we could make tape measures obsolete (or at least not quite as essential) for our home-decor shoppers by building an AR tool to let them visualize–conveniently, accurately, and with minimal effort–how an item would fit in their space. Improving dimension parsing In our original post on the wall art experience, we mentioned the complexity involved in doing things like inferring an item's dimensions from text in its description. Etsy allows sellers to add data about dimensions in a structured way when they create a listing, but that wasn't always the case, and some sellers still provide those details in places like the description or the item's title. The solution was to create a regex-based parser in the iOS App that would glean dimensions (width and height) by sifting through a small number of free-form fields–title, description, customization information, overview–looking for specific patterns. We were satisfied being able to catch most of the formats in which our sellers reported dimensions, handling variable positions of values and units (3 in x 5 in vs 3 x 5 in), different long and short names of units, special unit characters (‘, “), and so on, in all the different languages that Etsy supports. Migrating our parsing functionality to the API backend was a first step towards making the AR measuring tool platform-independent, so we could bring it to our Android App as well. It would also be a help in development, since we could iterate improvements to our regex patterns faster outside the app release schedule. And we’d get more consistent dimensions because we'd be able to cache the results instead of having to parse them live on the client at each visit. We knew that an extended AR experience would need to reliably show our users size options for items that had them, so we prioritized the effort to parse out dimensions from variations in listings. We sanitized free-form text input fields that might contain dimensions—inputs like title or description—so that we could catch a wider range of formats. (Several different characters can be used to write quotation marks, used as shorthand for inches and feet, and we needed to handle special characters for new lines, fraction ligatures like ½ or ¼, etc.) Our regex pattern was split and updated so it could detect: Measurement units in plural forms (inches, feet, etc.); Incorrect spellings (e.g. "foots"); Localization of measurement units in the languages spoken by Etsy’s users ("meters", "metros", and "mètres" in English, Spanish, and French, respectively); Other formats in which dimensions are captured by sellers like dimensions with unit conversions in parentheses (e.g. 12 in x 12 in (30 cm x 30 cm)) or with complex measurements in the imperial system (3’6”). Making our dimension parsing more robust and bringing it server-side had several ancillary benefits. We were able to maintain the functionality of our iOS app while removing a lot of client-side code, even in Etsy’s App Clip, where size is a matter of utmost importance. And though regex processing isn’t that processor-intensive, the view feature performed better once we implemented server-side caching of parsed dimensions. We figured we could even take the parsing offline (rather than parsing every listing on every visit) by running a backfill process to store dimensions in our database and deliver them to the App along with item details. We found, thanks to decoupling our parser work from the App release cycle, that we were able to test hypotheses faster and iterate at a quicker pace. So we could proceed to some improvements that would have been quite resource-intensive if we had tried to implement them on the native app side. Sellers often provide dimensions in inconsistent units, for instance, or they might add the same data multiple times in different fields, when there are variations in properties like material or color. We worked out ways to de-duplicate this data during parsing, to minimize the number of size options we show users. (Though where we find dimensions that are specifically associated with variations, we make sure to retain them, since those will more directly correlate with offering prices.) And we made it possible to prioritize structured dimension data, where sellers have captured it in dedicated fields, as a more reliable source of truth than free-form parsing. Measuring in 3D The box With this new and improved dimension data coming to us from the server, we had to figure out the right way to present it in 3D in iOS. The display needed to be intuitive, so our users would know more or less at a glance what the tool was and how to interact with it. Ultimately, we decided to present a rectangular prism-type object scaled to the parsed dimensions we have for a given listing. Apple's SceneKit framework – specifically its SCNBox class – is what creates this box, which of course we style with the Etsy Orange look. So that users understand the box's purpose, we make sure to display the length on each side. We use SceneKit's SCNNode class to create the pills displaying our measurements. Users drag or tap the measuring box to move it around, and it can rotate on all axes – all made possible by having a different animation for each type of rotation using SCNActions. Rotation is a must-have feature: when we place the measuring box in a user's space, we may not always be able to get the orientation correct. We might, as in the illustration below, place a side table vertically on the floor instead of horizontally. Our users would have a poor experience of the measuring tool if they couldn't adjust for that. (Note that you may see some blinking yellow dots when you try out the AR experience: these are called feature points, and they're useful for debugging, to give us an idea of what surfaces are successfully being detected.) Environment occlusion In addition to ensuring the box would be scaled correctly, we wanted it to "sit" as realistically as possible in the real world, so we built in scene occlusion. When a user places the measuring box in a room with other furniture, scene occlusion lets it interact with real-life objects as if the box were actually there. Users get valuable information this way about how an item will fit in their space. Will that end table go between the wall and couch? Will it be tall enough to be visible from behind the couch? (As demonstrated below, the table will indeed be tall enough.) Environment occlusion became a possibility with LiDAR, a method of determining depth using laser light. Although LiDAR has been around for a few decades, used to map everything from archeological sites to agricultural fields, Apple only included LiDAR scanners in iPhones and iPads beginning in 2020, with the 4th-generation iPad Pro and the iPhone 12 Pro. An iPhone’s LiDAR scanner retrieves depth information from the area it scans and converts it into a series of vertices which connect to form a mesh (or a surface). To add occlusion to our SpriteKit-backed AR feature, we convert the mesh into a 3D object and place it (invisibly to the user) in the environment shown on their phone. As the LiDAR scanner measures more of the environment, we have more meshes to convert into objects and place in 3D. The video below shows an AR session where for debugging purposes we assign a random color to the detected mesh objects. Each different colored outline shown over a real-world object represents a different mesh. Notice how, as we scan more of the room, the device adds more mesh objects as it continues drawing out the environment. The user's device uses these mesh objects to know when and how to occlude the measuring box. Essentially, these mesh objects help determine where the measuring box is relative to all the real-world items and surfaces it should respect. Taking advantage of occlusion gives our users an especially realistic AR experience. In the side-by-side comparison below, the video on the left shows how mesh objects found in the environment determine what part of the measuring box will be hidden as the camera moves in front of the desk. The video on the right shows the exact same thing, but with the mesh objects hidden. Mesh objects are visible Mesh objects are hidden Closing thoughts This project took a 2D concept, our Wall View experience, and literally extended it into 3-dimensional space using Apple’s newest AR tools. And though the preparatory work we did improving our dimension parser may not be anything to look at, without the consistency and accuracy of that parsed information this newly realistic and interactive tool would not have been possible. Nearly a million Etsy items now have real-size AR functionality added to them, viewed by tens of thousands of Etsy users every week. As our marketplace evolves and devices become more powerful, working on features like this only increases our appetite for more and brings us closer to providing our users with a marketplace that lets them make the most informed decision about their purchases effortlessly. Special shoutout to Jacob Van Order and Siri McClean as well as the rest of our team for their work on this.
Introduction Each year, Etsy hosts an event known as “CodeMosaic” - an internal hackathon in which Etsy admin propose and build bold advances quickly in our technology across a number of different themes. People across Etsy source ideas, organize into teams, and then have 2-3 days to build innovative proofs-of-concept that might deliver big wins for Etsy’s buyers and sellers, or improve internal engineering systems and workflows. Besides being a ton of fun, CodeMosaic is a time for engineers to pilot novel ideas. Our team’s project this year was extremely ambitious - we wanted to build a system for stateful machine learning (ML) model training and online machine learning. While our ML pipelines are no stranger to streaming data, we currently don’t have any models that learn in an online context - that is, that can have their weights updated in near-real time. Stateful training updates an already-trained ML model artifact incrementally, sparing the cost of retraining models from scratch. Online learning updates model weights in production rather than via batch processes. Combined, the two approaches can be extremely powerful. A study conducted by Grubhub in 2021 reported that a shift to stateful online learning saw up to a 45x reduction in costs with a 20% increase in metrics, and I’m all about saving money to make money. Day 1 - Planning Of course, building such a complex system would be no easy task. The ML pipelines we use to generate training data from user actions require a number of offline, scheduled batch jobs. As a result it takes quite a while, 40 hours at a minimum, for user actions to be reflected in a model’s weights. To make this project a success over the course of three days, we needed to scope our work tightly across three streams: Real-time training data - the task here was to circumvent the batch jobs responsible for our current training data and get attributions (user actions) right from the source. A service to consume the data stream and learn incrementally - today, we heavily leverage TensorFlow for model training. We needed to be able to load a model's weights into memory, read data from a stream, update that model, and incrementally push it out to be served online. Evaluation - we'd have to make a case for our approach by validating its performance benefits over our current batch processes. No matter how much we limited the scope it wasn't going to be easy, but we broke into three subteams reflecting each track of work and began moving towards implementation. Day 2 - Implementation The real-time training data team began by looking far upstream of the batch jobs that compute training data - at Etsy’s Beacon Main Kafka stream, which contains bot-filtered events. By using Kafka SQL and some real-time calls to our streaming feature platform, Rivulet, we figured we could put together a realistic approach to solving this part of the problem. Of course, as with all hackathon ideas it was easier said than done. Much of our feature data uses the binary avro data format for serialization, and finding the proper schema for deserializing and joining this data was troublesome. The team spent most of the second day munging the data in an attempt to join all the proper sources across platforms. And though we weren't able to write the output to a new topic, the team actually did manage to join multiple data sources in a way that generated real-time training data! Meanwhile the team focusing on building the consumer service to actually learn from the model faced a different kind of challenge: decision making. What type of model were we going to use? Knowing we weren’t going to be able to use the actual training data stream yet - how would we mock it? Where and how often should we push new model artifacts out? After significant discussion, we decided to try using an Ad Ranking model as we had an Ads ML engineer in our group and the Ads models take a long time to train - meaning we could squeeze a lot of benefit out of them by implementing continuous training. The engineers in the group began to structure code that pulled an older Ads model into memory and made incremental updates to the weights to satisfy the second requirement. That meant that all we had left to handle was the most challenging task - evaluation. None of this architecture would mean anything if a model that was trained online performed worse than the model retrained daily in batch. Evaluating a model with more training training periods is also more difficult, as each period we’d need to run the model on some held-out data in order to get an accurate reading without data leakage. Instead of performing an extremely laborious and time-intensive evaluation for continuous training like the one outlined above, we chose to have a bit more fun with it. After all, it was a hackathon! What if we made it a competition? Pick a single high-performing Etsy ad and see which surfaced it first, our continuously trained model or the boring old batch-trained one? We figured if we could get a continuously trained model to recommend a high-performing ad sooner, we’d have done the job! So we set about searching for a high-performing Etsy ad and training data that would allow us to validate our work. Of course, by the time we were even deciding on an appropriate advertised listing, it was the end of day two, and it was pretty clear the idea wasn’t going to play out before it was time for presentations. But still a fun thought, right? Presentation takeaways and impact Day 3 gives you a small window for tidying up work and slides, followed by team presentations. At this point, we loosely had these three things: Training data from much earlier in our batch processing pipelines A Kafka consumer that could almost update a TensorFlow model incrementally A few click attributions and data for a specific listing In the hackathon spirit, we phoned it in and pivoted towards focusing on the theoretical of what we’d been able to achieve! The 1st important potential area of impact was cost savings. We estimated that removing the daily “cold-start” training and replacing it with continuous training would save about $212K annually in Google Cloud costs for the 4 models in ads alone. This is a huge potential win - especially when coupled with the likely metrics gains coming from more reactive models. After all, if we were able to get events to models 40 hours earlier, who knows how much better our ranking could get! Future directions and conclusion Like many hackathon projects, there's no shortage of hurdles getting this work into a production state. Aside from the infrastructure required to actually architect a continuous-training pipeline, we’d need a significant number of high-quality checks and balances to ensure that updating models in real-time didn’t lead to sudden degradations in performance. The amount of development, number of parties involved, and the breadth of expertise to get this into production would surely be extensive. However, as ML continues to mature, we should be able to enable more complex architectures with less overhead.
Introduction Personalization is vital to connect our unique marketplace to the right buyer at the right time. Etsy has recently introduced a novel, general approach to personalizing ML models based on encoding and learning from short-term (one-hour) sequences of user actions through a reusable three-component deep learning module, the adSformer Diversifiable Personalization Module (ADPM). We describe in detail our method in our recent paper, with an emphasis on personalizing the CTR (clickthrough rate) and PCCVR (post-click conversion rate) ranking models we use in Etsy Ads. Here, we'd like to present a brief overview. Etsy offers its sellers the opportunity to place sponsored listings as a supplement to the organic results returned by Etsy search. For sellers and buyers alike, it’s important that those sponsored listings be as relevant to the user’s intent as possible. As Figure 1 suggests, when it comes to search, a “jacket” isn't always just any jacket: Figure 1: Ad results for the query jacket for a user who has recently interacted with mens leather jackets. In the top row, the results without personalized ranking; in the bottom row, the results with session personalization. For ads to be relevant, they need to be personalized. If we define a “session” as a one-hour shopping window, and make a histogram of the total number of listings viewed across a sample of sessions (Fig. 2), we see that a power law distribution emerges. The vast majority of users interact with only a small number of listings before leaving their sessions. Figure 2: A histogram of listing views in a user session. Most users see fewer than ten listings in a one-hour shopping window. Understood simply in terms of listing views, it might seem that session personalization would be an insurmountable challenge. To overcome this challenge we leverage a rich stream of user actions surrounding those views and communicating intent, for example: search queries, item favorites, views, add-to-carts, and purchases. Our rankers can optimize the shopping experience in the moment by utilizing streaming features being made available within seconds of these user actions. Consider a hypothetical sequence of lamps viewed by a buyer within the last hour. Figure 3: An example of a user session with the sequence of items viewed over time. 70s orange lamp ---> retro table lamp --> vintage mushroom lamp Not only is the buyer looking within a particular set of lamps (orange, mushroom-shaped), but they arrived at these lamps through a sequence of query refinements. The search content itself contains information about the visual and textual similarities between the listings, and the order in which the queries occur adds another dimension of information. The content and the sequence of events can be used together to infer what is driving the user’s current interest in lamps. adSformer Diversifiable Personalization Module The adSformer Diversifiable Personalization Module (ADPM), illustrated on the left hand side of Figure 4, is Etsy's solution for using temporal and content signals for session personalization. A dynamic representation of the user is generated from a sequence of the user's most recent streamed actions. The input sequence contains item IDs, queries issued and categories viewed. We consider the item IDs, queries, and categories as “entities” that have recent interactions within the session. For each of these entities we consider different types of actions within a user session–views, recent cart-adds, favorites, and purchases–and we encode each type of entity/action pair separately. This lets us capture fine-grained information about the user's interests in their current session. Figure 4: On the left, a stack representing the ADPM architecture. The right part of the figure is a blown-out illustration of the adSformer Encoder component. Through ablation studies we found that ADPM’s three components work together symbiotically to outperform experiments where each component is considered independently. Furthermore, in deployed applications, the diversity of learned signals improves robustness to input distribution shifts. It also leads to more relevant personalized results, because we understand the user from multiple perspectives. Here is how the three components operate: Component One: The adSformer Encoder The adSformer encoder component uses one or more custom adSformer blocks illustrated in the right panel of Figure 4. This component learns a deep, expressive representation of the one-hour input sequence. The adSformer block modifies the standard transformer block in the attention literature by adding a final global max pooling layer. The pooling layer downsamples the block’s outputs by extracting the most salient signals from the sequence representation instead of outputting the fully concatenated standard transformer output. Formally, for a user, for a one-hour sequence S of viewed item IDs, the adSformer encoder is defined as the output of a stack of layers g(x), where x is the output of each previous layer and o1 is the component’s output. The first layer is an embedding of item and position. Component Two: Pretrained Representations. Component two employs pretrained embeddings of item IDs that users have interacted with together with average pooling to encode the one-hour sequence of user actions. Depending on downstream performance and availability, we choose from multimodal (AIR) representations and visual representations. Thus component two encodes rich image, text and multimodal signals from all the items in the sequence. The advantage of leveraging pretrained item embeddings is that these rich representations are learned efficiently offline using complex deep learning architectures that would not be feasible online in real time. Formally, for a given one-hour sequence of m1hr item IDs pretrained d-dimensional embedding vectors e, we compute a sequence representation as Component Three: Representations Learned "On the Fly" The third component of ADPM introduces representations learned for each sequence from scratch in its own vector space as part of the downstream models. This component learns lightweight representations for many different sequences for which we do not have pretrained representations available, for example sequences of favorited shop ids. Formally, for z one-hour sequences of entities acted upon S we learn embeddings for each entity and sequence to obtain the component’s output o3 as The intermediary outputs of the three components are concatenated to form the final ADPM output, the dynamic user representation u. This user representation is then concatenated to the input vector in various rankers or recommenders we want to real-time personalize. Formally, for one-hour variable length sequences of user actions S, and ADPM’s components outputs o From a software perspective, the module is implemented as a Tensorflow Keras module which can easily be employed in downstream models through a simple import statement. Pretrained Representation Learning The second component of the ADPM includes pretrained representations. We rely on several pretrained representations: image embeddings, text embeddings, and multimodal item representations. Visual Representations In Etsy Ads, we employ image signals across a variety of tasks, such as visually similar candidate generation, search by image, as inputs for learning other pretrained representations, and in the ADPM's second component. To effectively leverage the rich signal encoded in Etsy Ads images we train image embeddings in a multitask classification learning paradigm. By using multiple classification heads, such as taxonomy, color, and material, our representations are able to capture more diverse information about the image. So far we have derived great benefit from our multitask visual embeddings, trained using a lightweight EfficientNetB0 architecture, and weights pretrained on ImageNet as backbone. We replaced the final layer with a 256-dimensional convolutional block, which becomes the output embedding. We apply image random rotation, translation, zoom, and a color contrast transformation to augment the dataset during training. We are currently in the process of updating the backbone architectures to efficient vision transformers to further improve the quality of the image representations and the benefits derived in downstream applications, including the ADPM. Ads Information Retrieval Representations Ads Information Retrieval (AIR) item representations encode an item ID through a metric learning approach, which aims to learn a distance function or similarity metric between two items. Standard approaches to metric learning include siamese networks, contrastive loss, and triplet loss. However, we found more interpretable results using a sampled in-batch softmax loss. This method treats each batch as a classification problem pairing all the items in a batch that were co-clicked. A pseudo-two-tower architecture is used to encode the source items and candidate items towers which share all trainable weights across both towers. Each item tower captures and encodes information about an item’s title, image, primary color, attributes, category, etc. This information diversity is key to our personalization outcomes. By leveraging a variety of data sources, the system can identify patterns and insights that would be missed by a more limited set of inputs. ADPM-Personalized Sponsored Search ADPM’s effectiveness and generality is demonstrated in the way we use it to personalize the CTR prediction model in EtsyAds’ Sponsored Search. The ADPM encodes reverse-chronological sequences of recent user actions (in the sliding one-hour window we've discussed), anywhere on Etsy, for both logged-in and logged-out users. We concatenate ADPM’s output, the dynamic user representation, to the rest of the wide input vector in the CTR model. To fully leverage this even wider input vector, a deep and cross (DCN) interaction module is included in the overall CTR architecture. If we remove the DCN module, the CTR’s model ROC-AUC drops by 1.17%. The architecture of the ADPM-personalized CTR prediction model employed by EtsyAds in sponsored search is given in Figure 5. (We also employ the ADPM to personalize the PCCVR model with a similar architecture, which naturally led to ensembling the two models in a multitask architecture, a topic beyond the scope of this blog post.) Figure 5: An example of how the ADPM is used in a downstream ranking model The ADPM-personalized CTR and PCCVR models outperformed the CTR and PCCVR non-personalized production baselines by +2.66% and +2.42%, respectively, in offline Area Under the Receiver Operating Characteristic Curve (ROC-AUC). Following the robust online gains in A/B tests, we deployed the ADPM-personalized sponsored search system to 100% of traffic. Conclusion The adSformer diversifiable personalization module (ADPM) is a scalable, general approach to model personalization from short-term sequences of recent user actions. Its use in sponsored search to personalize our ranking and bidding models is a milestone for EtsyAds, and is delivering greater relevance in sponsored placements for the millions of buyers and sellers that Etsy's marketplace brings together. If you would like more details about ADPM, please see our paper.
You can subscribe to this RSS to get more information