In great app design, the UI is more than just a skin—it’s the primary bridge between a user’s needs and a digital solution. Great apps often begin as collaborative spark, sometimes even sketches on a whiteboard or elsewhere during a brainstorming session. However, the path from these initial creative bursts to a functional prototype is often where momentum dies. And for developers whose primary computing device is a smartphone, creating and revising an XML-based UI, even with an IDE that offers a visual layout editor, can be slow.
Code on the Go’s new Sketch to UI feature is designed to close that gap. Draw a rough layout on paper (yes, even on a napkin), photograph it with your phone, and then the app generates the XML layout that you can connect to your app. All of this happens entirely on the phone, without needing internet access or external APIs. Here’s how we did it.
Challenges we faced
The most obvious approach to recognizing elements in a sketch is to run the image through a general-purpose vision model or send it to a remote API, but we ruled both of those out right away.
First, Code on the Go is designed to work without requiring an internet connection, which means the entire recognition pipeline has to run locally on a single Android device, including mid-range and older hardware (such as 32-bit ARM devices). That rules out heavy computational models as well as models that require any kind of server round-trip.
The second constraint is signal quality. Hand-drawn sketches tend to be messy, with irregular strokes and lines that don’t always cross in the same place or that cross where they shouldn’t. Hand-written letters aren’t always consistent and sometimes they are similar to other figures and shapes. Standard optical character recognition (OCR) often performs poorly on imprecise handwriting, especially if the handwriting is positioned near drawings that represent the visual part of the UI.
The third constraint involves the metadata gap. A hand-drawn rectangle doesn’t always tell you whether it’s a button, a TextView, or an ImageView, and it may not have a clear reference ID, size, or color. For the system to generate useful XML, it would need to not only recognize shapes, but also understand what properties the developer may intend for each one.
Code on the Go’s “three-zone” approach
The solution to all three problems starts with structure. Instead of trying to interpret a sketch as a single unstructured image, Sketch to UI asks developers to organize their sketches into three zones:
- The canvas (center): where you draw the UI widgets themselves
- The left and right margins: where you write the metadata for each widget (type, properties, and values)
- Tags: small labels that link each widget drawing on the canvas to its corresponding metadata in the margins
The image below shows an example of a hand-drawn UI sketch containing the canvas area and two margin areas on the left and right.
This is similar to the way that engineers annotate technical drawings. The drawing communicates shape and position, and the annotations in the margin communicate specification. By separating these concerns spatially, the system can apply different recognition strategies to each zone rather than trying to do everything at once.
Finding the boundaries
The system’s first task is to identify where the canvas ends and the margins begin. That boundary can vary from sketch to sketch, depending on how much the creator draws and how much they write.
Sketch to UI relies on vertical projection analysis to find the boundaries. The algorithm scans the image column by column, looking for vertical bands of white space where the canvas and margin naturally separate. The boundary is set at the midpoint of the largest gap on each side.
Because camera distortion and paper edges tend to create false signals, the outer 5% of the image on each side is ignored to make boundary detection more reliable. Vertical Sobel filtering is applied to emphasize vertical lines while suppressing horizontal ruled lines, which would otherwise confuse the analysis. And if the handwriting sits too close to the drawings for a clean gap to appear, the algorithm adjusts its threshold rather than simply failing. The result is a boundary detector that adapts to each sketch instead of requiring the developer to draw within a pre-fixed template.
Recognizing the widgets: YOLO on Android
Training the YOLOv8 model required solving a data problem first. Because different contributors draw widgets differently, existing public datasets for hand-drawn UI sketches use inconsistent visual conventions. As a result, training on those datasets produced a model that generalized poorly.
The solution was synthetic data generation. Rather than collecting and hand-annotating real sketches, the team built a code-driven pipeline that programmatically generates artificial sketches. Each widget type has its own generative script that produces varied but controlled examples of how that widget might be drawn. Because the same code that generates the images also emits the labels, every training image comes with “exact ground truth” (the verified and unambiguous correct answer) at essentially zero annotation cost.
This approach allowed rapid iteration. When the model struggled with a particular widget type, the team could adjust and retrain the corresponding generative script instead of collecting and annotating more real-world examples.
The current model performs well on common widgets like buttons and image placeholders, but more visually ambiguous widgets (like sliders, dropdown menus, and text entry boxes) remain more challenging. The primary active development focus is to improve precision and recall across those harder cases, with a target accuracy goal above 90%.
Reading the margins: OCR and fuzzy matching
Once the widget positions are identified from the canvas, the system turns to the margins to extract their metadata (types, IDs, and property values).
OCR is applied only to the margin zones, not the full image, because running OCR on the entire image produced too much noise from the sketch content. By limiting OCR to the margins, where only text metadata should appear, we significantly improved accuracy. When text-like content is detected inside widget bounding boxes on the canvas, OCR is run on those specific regions rather than the full canvas area.
TextView in the output rather than an unrecognized token that breaks the generated XML. What the output looks like
The pipeline produces valid Android XML layout code that can be built directly in Code on the Go. A sketch of a screen with a top app bar, a scrollable list, and a floating action button in the corner becomes a runnable scaffold in a few seconds, without typing a single line of layout code.
Why this matters for the global developer
<Button android:id="@+id/submit" ... />, the developer simply draws a rectangle, writes “B-1” next to it, and writes “B-1: submit” in the margin. Code on the Go handles the rest. This system supports rapid prototyping in environments where traditional tools simply won’t run, creating a streamlined creative workflow regardless of developer constraints.
What’s coming next
Our goal is to reach a precision score of over 90% for all standard Android widgets. We are also working on features that remove even more friction, such as:
- Direct image import: If the system detects an “image placeholder” in your sketch, you can tap the box on your screen to immediately select a photo from your phone’s gallery to fill it.
- Custom models: We want to allow power users to upload their own YOLO models into Code on the Go, enabling the community to extend the system to support even more specialized UI components.
The future of software development isn’t just about bigger screens and more cloud power. It’s about creating tools that work wherever the developer is…even if their “office” is just a piece of paper, a pen, and a phone.