> "One way we explored approaching this was using puppeteer to automate opening websites in a web browser, taking a screenshot of the site, and traversing the HTML to find the img tags.
> We then used the location of the images as the output data and the screenshot of the webpage as the input data. And now we have exactly what we need — a source image and coordinates of where all the sub-images are to train this AI model."
I don't quite understand this part. How does this lead to a model that can generate code from a UI?
If I'm understanding correctly, they are talking about how they are solving very specific problems with their models.
In this case, if you look two images up you will see e-commerce image with many images composted into one image/layer. How will their system automatically decide whether all those should be separate images/layers or one composted image? To do so they trained a model that examines web pages and <img> tags and see's their location. Basically, they are under the assumption that their data has good decisions and you can learn in which cases people use multiple vs one image.
They have a known system that can go from specified coordinates to images in the form of puppeteer (chromium) and so they can run it on lots of websites to generate [coordinates, output image] pairs to use for training data. In general, if you have a transform and input data, you can use it to train a model to learn the reverse transform.
> We then used the location of the images as the output data and the screenshot of the webpage as the input data. And now we have exactly what we need — a source image and coordinates of where all the sub-images are to train this AI model."
I don't quite understand this part. How does this lead to a model that can generate code from a UI?