Skip to main content

Lightweight Conjecture Records for Research Teamwork

Lightweight Conjecture Records for Research Teamwork

Intended for Data Science AI/ML Research Teams, but generally applicable.

Slug

conjecture-records-improve-research-teamwork

Context

Data Science is now a first class citizen of the technical world, but that is only a recent development and it still lags behind hardware and software in terms of ecosystem maturity. One area that is still behind the curve is the area of teamwork and working on large scale objectives. From Fred Brooks to Michael Nygard the software and system architecture challenges have always been the same - how best to communicate the solution in your head. So following in the footsteps of LADR files, and in the style of the original post...

Conjecture

We posit that keeping a collection of "domain significant" conjectures will improve research teamwork; these conjectures put forward experimental thinking that affect dimensionality, data characteristics, pre-processing options, calibrations, qualitative analysis and underlying factors (and more).

A conjecture record is a short text file (commonly in markdown format). Each record describes a step of thinking that puts forward a conjecture based on the forces involved in the domain of research. Although each conjecture may be a single vector of thinking many conjectures may tackle the same problem space and conjectures may apply to different areas of the problem domain.

We will keep conjecture records in a project repository under doc/conj/<slug>.md (or .txt, .rst).

We should use a lightweight formatting language like markdown or reStructuredText.

Conjecture records will be identified by a canonical slug that represents its citation reference for other areas of the project. Conjectures from other projects can be referenced by a broader Clean URL (e.g. https://mycompany.com/otherproject/<slug>).

If a conjecture is disproved we will mark it as disproven (but we will still keep it around as it is relevant to know that it has been disproven). We may choose to move disproven conjectures to an archive folder to de-clutter the current conjecture space.

We will use a format with just a few parts, so each document is easy to digest. The format shall be:
  • Title - a short readable title for the conjecture
  • Slug - this short form of the conjecture represents its citation string, research isn't linear so we don't need to use a numerical sequence
  • Context - describe the problems space from which the conjecture emanates and the provenance of the idea behind it
  • Conjecture - the text that describes the conjecture in a complete but concise way. It is stated in full sentences, with active voice. "We posit that ..." (feel free to use wording you are comfortable with)
  • Status - proposed, explorable, refined, proven, disproven; typically a conjecture will be proposed by one member of the team before being reviewed and deemed explorable. The exploration will eventually lead to the conjecture being either proven or disproven and also commonly refined into a more qualified form (referenced under the Related conjectures)
  • Impact - describe the indicators that support and potentially refute the conjecture, and build up any evidential case from related work; as more is discovered in through exploration of the conjecture so this section should be expanded to capture that knowledge. Where possible experimental results should be provided as links.
  • Related - list the other conjectures that might define a refined form of this conjecture, provide complementary support for the conjecture, or essentially conflict with this conjecture
The whole document should be one or two pages long. We will write each conjecture as if it is a conversation with a future researcher. This requires good writing style, with full sentences organized into paragraphs.

To make full use of the project repository Pull Request and review functionality (such as those provided by github.com) we will have a long lived branch called conjecture-changes that will be regularly reviewed under a PR and merged into the master/main branch periodically. Researchers will commit new and edited conjecture files into this branch so that they go through a review round with the whole team before becoming accepted into the core conjecture space.

Status

explorable

Impact

So far we are seeing:
  • New researchers get a broad overview of the current research vectors that the team is involved in and often generate new proposed conjectures as they bring a fresh mind to problems.
  • Agile research is simpler to manage because it is clearer what each team member is working towards and research sprints become shorter because the scope of each conjecture is well defined.
  • The process of starting with a conjecture helps researchers build explainability into their experiments (but we are still worried about confirmation bias).
  • Having the set of proven conjectures available encourages better data hygiene practices among junior researchers.
  • Writing down the conjectures in a project repository means that distributed researchers have more opportunity to contribute.
  • The review of impact changes to conjectures provides a good way to show progress to the team.
  • The concept of "proven" still needs more validation, here we have used the term in its broadest sense, but we still need to be careful about assuming theories and laws of our domain.

Related

None (if you have something then get in touch)

Conclusion

In keeping with the original LADR post, this post is itself laid out as a Conjecture Record.
While it is marked as status explorable, early indications are good enough for the technique to merit adoption consideration.
We are going to keep using them until something better comes along.

Examples

The following are simple examples of conjecture records:

/doc/conj/rolling-average-filters-smooth-out-noise-in-feature-x.md

# Rolling Average Filters Smooth Out Noise in Feature X
## Slug
rolling-average-filters-smooth-out-noise-in-feature-x
## Context
Feature X is noisy because of the way sensor Z functions, and that noise is a problem for the derivative calculation.
## Conjecture
We posit that applying a rolling average of up to N values to feature X gives us X' which can be used for derivative calculations.
## Status
refined
## Impact
This holds true while the ground truth remains stable, and it produces a very accurate X'
The conjecture breaks down when the ground truth of X is changing rapidly, as a result X' fails to reflect the change for too many cycles
## Related
Refined to rolling-average-filters-smooth-out-noise-in-feature-x-while-y-is-below-m
 
/doc/conj/rolling-average-filters-smooth-out-noise-in-feature-x-while-y-is-below-m.md

# Rolling Average Filters Smooth Out Noise in Feature X While Y is Below M
## Slug
rolling-average-filters-smooth-out-noise-in-feature-x-while-y-is-below-m
## Context
Feature X is noisy because of the way sensor Z functions, and that noise is a problem for the derivative calculation.
Feature Y has been shown to have an impact on the validity of the rolling average calculation.
## Conjecture
We posit that applying a rolling average of up to N values to feature X while Y < M to determine X'. When Y >= M then the rolling average values are invalid and must be discarded. X' can be used for derivative calculations when valid.
## Status
proven
## Impact
The conjecture holds true in all cases so far tested.
## Related
Refined from rolling-average-filters-smooth-out-noise-in-feature-x

/doc/conj/kalman-filters-smooth-out-noise-in-feature-x.md

# Kalman Filters Smooth Out Noise in Feature X
## Slug
kalman-filters-smooth-out-noise-in-feature-x
## Context
Feature X is noisy because of the way sensor Z functions, and that noise is a problem for the derivative calculation.
Simple averaging filters do not respond well when the ground truth is changing rapidly.
## Conjecture
We posit that applying a kalman filter to feature X with ABC configuration will yield a X' that closely tracks the ground truth.
## Status
explorable
## Impact
In Progress
## Related
See rolling-average-filters-smooth-out-noise-in-feature-x-while-y-is-below-m

Author: Hugh Reid, Infer Systems Ltd.
Thanks to Michael Nygard, Philippe Kruchten et al.

Comments

Popular posts from this blog

10 Tips for Hiring a Data Scientist into a Tech Company

What gap are you trying to fill? Before you get close to offering a Data Scientist a job you should be clear in your own mind what skills gap in your organisation you are trying to fill. In my experience these are good reasons to be hiring a Data Scientist: You need someone with mathematics, and in particular statistics, skills that can do a better job of understanding data and creating meaningful outputs than your average accountant or computer scientist. You need someone that thinks and operates in a numerically framed way, someone that is comfortable with representing concepts as graphs and formulae. Those 2 core competencies are to be found in any successful Data Scientist. You may be tempted to frame the role in the following terms: You need someone to make sense of a large dataset, to understand the dimensionality and the "shape" or distribution of the key components of that data. You need someone who can create, improve or debug some very sophisticated algori

Conjecture Cards - Agile Research Project Management

Project Managing research activities is hard; the open ended nature of research makes it too easy to meander aimlessly through the available time and budget. Good project management won't help you find the solution to the problem but it may stop you wasting time getting to a conclusion. In the competitive world of commercial data science, project management could be the difference between market success and obscurity. Photo by Eden Constantino on Unsplash Background Some of the history has been simplified to avoid allowing the original project and corporate complexity to detract from the key points. In 2017 we started our first project with data as a primary component, and data science as a necessary skill. We were not creating a new type of model, but we were aiming to deliver a product based on a function for which no pre-trained model existed at the time. We started hiring a "mixed bag" of PhD. data scientists and dived in. 2 main problems started cropping up: It took