I agree, the discussion is biased to supervised learning and I’m not sure how it generalizes.
With regard to data… It seems to me that the selection of data and it’s correctness is not that different from how to select whom to interview and how to elicit true opinions from interviewees. With ML, we derive the specification from data (in an automated fashion), such as a requirements engineer would synthesize the specification from interviews (in a manual and less deterministic process).
Neither interviews nor data are a ground truth. They may not represent what one really wants of the system (or what all kinds of different stakeholders might want).
My pre-Dagstuhl view was to see the data similar to code or a DSL that is processed by a compiler. If the model is wrong, then we don’t blame the compiler or the binary, but the incorrect code used as input. Similarly, the model is derived (more or less deterministically) from the data (+ hyperparameters), hence we shouldn’t blame the model, but the data. Still this leaves the problem of what a bug really would be. We can have incorrectly labeled data, that’s easy, but what about having a biased sample — is this a bug? The challenge with finding bugs would be origin tracking: For a crashing binary we can often track back to the line of code that caused the crash whereas a wrong prediction is influenced by many model parameters which again are influenced by most input values, making origin tracking very hard (though some work exists along these lines). None of this resolves my issues with the bug terminology that a specification would be needed — which we don’t have. I find the requirements engineering view more helpful for getting concepts straight in my head than the compiler analogy.