Every year, Apple promotes its Worldwide Developers Conference, which this year gathered 6000 atendees from more than 70 countries. Sice I am attending, I am using this blog to publish my completely disorganized and not-for-public-consumption notes.

Create ML for Object Detection and Sound

This is a tutorial about the Create ML app.

Image classification: a model that provides a simple description for the elements of the image. Aims to determine, based on this description, to which category the image belongs.

Object detection: analyzes an image to provide information such as label, size, and coordinates of objects in an image. Training data for object detectors consists of annotated images including bounding boxes for the objects of interest together with labels. The training data is organized in terms of images and a JSON file describing the data. For dice, for example, there are six different classes of data, pertaining to the six sides of a die. If we want to exclude certain confounding objects, we need to take pictures that include the objects of interest as well as the objects that we want to exclude and only annotate the former.

Some principles to be followed: (i) balanced number of images in each class; (ii) at least 30 images per class; (iii) multiple angles; (iv) variations of background; (v) different lighting conditions; (vi) other objects in images.

After the model is built, using its functionality in an app can be made by using either Vision or CoreML.

Sound classification: the task of taking sound and putting it into one of many categories. An application: creating a “babá eletrônica”. To determine ground truth, sound files can be put into different directories corresponding to the classifier’s labels.

Considerations for audio files: (i) data should match real-world scenarios; (ii) microphones matter; (iii) model architecture, for example, this is something that works for sound classification but would not be appropriate for speech recognition.

New framework for analyzing sound: Sound Analysis.

Understanding Images in Vision Framework

Saliency: the first thing one notices when looking at an image. Attention-based saliency models are trained on where humans looked first when looking at an image. Objectness-based depends on the features of the image, such as contrast between subjects and backgrounds. For attention-based saliency, the faces are the most salient. For objectness-based, the entire body. Saliency detection can produces a heatmap that highlights salient parts of the image or a bounding boxes for the salient parts.

Saliency can be combined with image classification technologies, for example, to detect what are the salient objects. Large-scale image classification is hard: requires lots os annotated data and computer power/time.

The Taxonomy of the classifiers that exist within the Vision framework includes more than 1000 classes. It is a hierarchical classification that includes classes that are visually identifiable. It avoids professions because these are very hard to classify. The classification is multilabel and returns probabilities that each element of the Taxonomy is in the image. Determining whether an element is correctly classified depends on user-set thresholds for precision and recall.

Image similarity: a method to describe two or more images and compare the similarities of those descriptions. Using just the pixies is not enough because altering basic properties of images, such as contrast and inverting it can lead to low similarity when it should be high. Descriptions can be based on word vectors. The downside in this case is that it may lead to images that produce the same words but are substantially different. This is not a problem if one is interested in semantic similarity instead of just visual similarity.

Face detection now uses 76 points to detect face landmarks. It used to be 65. The set of points that are used to detect face landmarks is called “constellation”. In particular, pupil detection is much better.

Basic template for using the Vision Framework: create a request object (e.g.,VNDetectFaceLandmarkRequest), pass it to a VNImageRequestHandler (for a given image), perform the request and collect the results (casting the result object to the appropriate type).

Face capture quality: Lighting, blur, occlusion, expression… All these factors are taken into account to produce a floating point number that corresponds to a score for the image. This is useful to determine what image is better, which one to choose among a set of similar images. Face capture quality should not be compared against a threshold. Instead, face capture qualities for different images of the same subject should be compared against themselves, as a ranking score.

New detectors in the Vision framework: human, cat, and dog detectors. They produce bounding boxes, amongst other features, for detected elements.

Object trackers are useful to keep track of objects in videos. To use this kind of feature, the template is similar but not different. Basically, it is necessary to process the video frame by frame explicitly, for example, within a loop.

SwiftUI

Container views: as the name implies, they encompass other views. The closure parameter passed to container views, e.g., vStack, is annotated with @ViewBuilder. The closer is of type ()->Content. Many controls in Swift UI are containers. For example, Button, Toggle, and Stepper are also containers.

A binding is a managed refernece: $order.includeSalt is a binding. Basically, the value of order.quantify is modified based on what happens while the user interacts with a specific view element, for example, a togger.

Text("ok").font(.title) => the font method is a modifier: it just creates a view based on another view. Each modifier adds another object to the view hierarchy which wraps the one received as a parameter. The example with padding and background is very illustrative: the fact the new views are added to the hierarchy explicitly creates a notion of order among the modifiers. Therefore, applying a background before padding will make the background not encompass the padding whereas inverting the order among them (with the background being added last) will extend the background to encompass the padding. In addition, this organization means that properties that are not relevant for a specific view (for example, it does not require opacity) do not need to be included in the corresponding object hierarchy. Thus, views become more lightweight and View can be a protocol, instead of a class.

TableView is now a List.

Text, Color, Spacer, Shape, Divider, Image are primitive views. They don’t have a body property.

A ForEach view just adds its own contents to its container.

A Form is similar to a VStack, but accounts for the exhibition of heterogeneous controls and organizing them consistently across different visualizations and using sections.

Button has an action and a label. It is a container and the label can be an arbitrarily complex view.

Toggle takes a binding to a Bool: Binding<Bool>. It is also a container.

The .accessibility modifier, as the name suggests, supports accessibility such as support for text to be read by VoiceOver.

Picker has selection, the currently selected option, a label, and a set of options, provided by a closure. The style of picker (whether it is a pop-up or radio buttons) is defined by the .pickerStyle modifier.

Adopting Swift Packages in Xcode

Part of the open source Swift toolchain. It is now integrated into Xcode. A package is a directory that contains a Package.swift file that defines a manifest for the package. It also includes directories for Source, Tests, and different subdirectories of the former for each of the buildable targets of that package.

What’s new in Clang and LLVM

Xcode now has a new optimization level, -Oz, which aims to reduce code size more aggressively. Function Outlining is a code optimization. Basically, it performs extract method for duplicated code at the machine IR level. This optimization, based on Apple’s tests, reduces up to 25% of the code size. This duplicate code does not stem from application-level duplicated code. Function prologues and epilogues, which are inserted by the compiler, are good candidates for function outlining. This modifies program control flow, since it introduces additional function calls. Moreover, it may make the program take longer to run (because of extra function calls).

Options -O2 and -O3 prioritize speed. -Os and -Oz prioritize size. -Os is the default optimization level. The compiler offers additional options, such as PGO (profile-guided optimization), which collects profiling info about execution at runtime. This info can then be used to guide optimization. LTO waits until link-time to optimize. It is possible to go to Build Phases > Compiler Sources > Compiler Flags to configure which options will be used.

Static analyzer checks. The static analyzer not only finds bugs, but it can also show the sequence of steps that leads to the occurrence of the bug.