The Importance of Documentation in Data Science
Why Document?
Github conducted a 2017 Open Source Survey to assess the common problems encountered when developing open-source software. One of their key findings was that the main problem encountered by developers in their platform is “incomplete or confusing documentation.”
While documentation is often overlooked, it is critical in projects because documentation can tell us:
- what the project is for
- what it does
- how it works
- how to use it
- issues that were encountered
- insights gleaned
- how to contribute to it
- which datasets were used
Daniele Procida, software developer, and documentation manager at Divio supports the sentiment above, saying,
“It doesn’t matter how good your software is, because if the documentation is not good enough, people will not use it.
Even if for some reason they have to use it because they have no choice, without good documentation, they won’t use it effectively or the way you’d like them to.”
Aside from informing users and collaborators, documentation allows for project reproducibility and can help us avoid redundancies in our work.
Let’s say we want to reproduce a project done in the past either to implement it as is or to modify it for a new use case. Imagine if that prior project has no documentation. What will the process flow be like?
It will probably go like this: the new team will ask who led the previous project. Then they will ask what the datasets were and what they mean, where to find the ETL, modeling, and deployment code. Then they may ask what issues were encountered, what insights were gleaned, and so on. Imagine the time spent on just finding answers to those questions. And they haven’t even started the project.
If there was no documentation on the code, for example, if the code were not uploaded into a Git repository, we may find ourselves coding the same code over and over. Or for example, if ETL code is not documented and shared, then we may risk coding for the same ETL from scratch every new project.
There may be times when different people will end up making different code for the same purpose. In this case, having documentation of the standard way of querying can help us avoid this and make workflows more efficient.
Having a written record of what happened before, during, and after the project can prevent all those issues we mentioned above.
If one thinks about it, scientists are known to carry a notebook where they write down their findings, methods, and other things that may come to mind.
Here’s a photo of Marie Curie’s notebook. If she didn’t document her findings, she wouldn’t have discovered how to extract Radon that eventually got her to win the Nobel Prize.
Here’s another excerpt from Charles Darwin’s notebook. It reads “But I am very poorly today and very stupid and hate everybody and everything.”
Scientists constantly document their work. In the case of Charles Darwin, even his frustrations. And just like these scientists, data scientists (there’s a “scientist” in the job “data scientist” after all) should probably integrate documentation into their workflows.
Documentation is not as sexy as the other parts of data science workflow like modeling or data analysis, but it has to become a critical part of our workflows. We may see documentation as a slight hurdle in the short term, but it can save us a lot of time and effort in the long term.