Lightweight Conjecture Records for Research Teamwork

Intended for Data Science AI/ML Research Teams, but generally applicable.

Slug

conjecture-records-improve-research-teamwork

Context

Data Science is now a first class citizen of the technical world, but that is only a recent development and it still lags behind hardware and software in terms of ecosystem maturity. One area that is still behind the curve is the area of teamwork and working on large scale objectives. From Fred Brooks to Michael Nygard the software and system architecture challenges have always been the same - how best to communicate the solution in your head. So following in the footsteps of LADR files, and in the style of the original post...

Conjecture

We posit that keeping a collection of "domain significant" conjectures will improve research teamwork; these conjectures put forward experimental thinking that affect dimensionality, data characteristics, pre-processing options, calibrations, qualitative analysis and underlying factors (and more).

A conjecture record is a short text file (commonly in markdown format). Each record describes a step of thinking that puts forward a conjecture based on the forces involved in the domain of research. Although each conjecture may be a single vector of thinking many conjectures may tackle the same problem space and conjectures may apply to different areas of the problem domain.

We will keep conjecture records in a project repository under doc/conj/<slug>.md (or .txt, .rst).

We should use a lightweight formatting language like markdown or reStructuredText.

Conjecture records will be identified by a canonical slug that represents its citation reference for other areas of the project. Conjectures from other projects can be referenced by a broader Clean URL (e.g. https://mycompany.com/otherproject/<slug>).

If a conjecture is disproved we will mark it as disproven (but we will still keep it around as it is relevant to know that it has been disproven). We may choose to move disproven conjectures to an archive folder to de-clutter the current conjecture space.

We will use a format with just a few parts, so each document is easy to digest. The format shall be:

Title - a short readable title for the conjecture
Slug - this short form of the conjecture represents its citation string, research isn't linear so we don't need to use a numerical sequence
Context - describe the problems space from which the conjecture emanates and the provenance of the idea behind it
Conjecture - the text that describes the conjecture in a complete but concise way. It is stated in full sentences, with active voice. "We posit that ..." (feel free to use wording you are comfortable with)
Status - proposed, explorable, refined, proven, disproven; typically a conjecture will be proposed by one member of the team before being reviewed and deemed explorable. The exploration will eventually lead to the conjecture being either proven or disproven and also commonly refined into a more qualified form (referenced under the Related conjectures)
Impact - describe the indicators that support and potentially refute the conjecture, and build up any evidential case from related work; as more is discovered in through exploration of the conjecture so this section should be expanded to capture that knowledge. Where possible experimental results should be provided as links.
Related - list the other conjectures that might define a refined form of this conjecture, provide complementary support for the conjecture, or essentially conflict with this conjecture

The whole document should be one or two pages long. We will write each conjecture as if it is a conversation with a future researcher. This requires good writing style, with full sentences organized into paragraphs.

To make full use of the project repository Pull Request and review functionality (such as those provided by github.com) we will have a long lived branch called conjecture-changes that will be regularly reviewed under a PR and merged into the master/main branch periodically. Researchers will commit new and edited conjecture files into this branch so that they go through a review round with the whole team before becoming accepted into the core conjecture space.

Status

explorable

Impact

So far we are seeing:

New researchers get a broad overview of the current research vectors that the team is involved in and often generate new proposed conjectures as they bring a fresh mind to problems.
Agile research is simpler to manage because it is clearer what each team member is working towards and research sprints become shorter because the scope of each conjecture is well defined.
The process of starting with a conjecture helps researchers build explainability into their experiments (but we are still worried about confirmation bias).
Having the set of proven conjectures available encourages better data hygiene practices among junior researchers.
Writing down the conjectures in a project repository means that distributed researchers have more opportunity to contribute.
The review of impact changes to conjectures provides a good way to show progress to the team.
The concept of "proven" still needs more validation, here we have used the term in its broadest sense, but we still need to be careful about assuming theories and laws of our domain.

None (if you have something then get in touch)

Conclusion

In keeping with the original LADR post, this post is itself laid out as a Conjecture Record.

While it is marked as status explorable, early indications are good enough for the technique to merit adoption consideration.

We are going to keep using them until something better comes along.

Examples

The following are simple examples of conjecture records:

/doc/conj/rolling-average-filters-smooth-out-noise-in-feature-x.md

# Rolling Average Filters Smooth Out Noise in Feature X
## Slug
rolling-average-filters-smooth-out-noise-in-feature-x
## Context
Feature X is noisy because of the way sensor Z functions, and that noise is a problem for the derivative calculation.
## Conjecture
We posit that applying a rolling average of up to N values to feature X gives us X' which can be used for derivative calculations.
## Status
refined
## Impact
This holds true while the ground truth remains stable, and it produces a very accurate X'
The conjecture breaks down when the ground truth of X is changing rapidly, as a result X' fails to reflect the change for too many cycles
## Related

Refined to rolling-average-filters-smooth-out-noise-in-feature-x-while-y-is-below-m

/doc/conj/rolling-average-filters-smooth-out-noise-in-feature-x-while-y-is-below-m.md

# Rolling Average Filters Smooth Out Noise in Feature X While Y is Below M
## Slug
rolling-average-filters-smooth-out-noise-in-feature-x-while-y-is-below-m
## Context
Feature X is noisy because of the way sensor Z functions, and that noise is a problem for the derivative calculation.
Feature Y has been shown to have an impact on the validity of the rolling average calculation.
## Conjecture
We posit that applying a rolling average of up to N values to feature X while Y < M to determine X'. When Y >= M then the rolling average values are invalid and must be discarded. X' can be used for derivative calculations when valid.
## Status
proven

## Impact

The conjecture holds true in all cases so far tested.
## Related

Refined from rolling-average-filters-smooth-out-noise-in-feature-x

/doc/conj/kalman-filters-smooth-out-noise-in-feature-x.md

# Kalman Filters Smooth Out Noise in Feature X
## Slug
kalman-filters-smooth-out-noise-in-feature-x
## Context
Feature X is noisy because of the way sensor Z functions, and that noise is a problem for the derivative calculation.
Simple averaging filters do not respond well when the ground truth is changing rapidly.
## Conjecture
We posit that applying a kalman filter to feature X with ABC configuration will yield a X' that closely tracks the ground truth.
## Status
explorable
## Impact
In Progress
## Related
See rolling-average-filters-smooth-out-noise-in-feature-x-while-y-is-below-m

Author: Hugh Reid, Infer Systems Ltd.

Thanks to Michael Nygard, Philippe Kruchten et al.

Infer Systems Tech Blog

Search This Blog