This is the 2nd year of my current job. Instead of teaching people R and machine learning algorithm, I started to build my own models and implemented them into production. However, I found it was so hard to revisit my old project: different versions of data, different visions of models, and tons of PPT and related excel sheets. If my boss asked me to use an old version of an old model, I would use the rest of the day to reproduce the model and generate a new report for him.
To be more specific? Hmm… the directory would look like this pic:
I realized coding productivity correctly and responsibly is so important! I read several papers and post like A Quick Guide to Organizing Computational Biology Projects and Cookiecutter Data Science. All these amazing materials shaped my daily project workflow and structure. This blog is more like a reading note and my personal understanding of how to organize code.
Directory structure
Roughly speaking, the structure should have the ability to keep a record on data processing workflow, have all the iterations in recorders and can trace back to old project easily.
Here is my directory structure and some related notes:
The most important part of my structure is the data, src, bin, and results. Models and reports are added based on my job routine so I can revisit my old model results quickly.
An example
To illustrate myself, I also draw an example. You can zoom in to see the details.
Note:
- The orange cell means items that contain important information
- The red cell means items that are immutable
- The date should be formatted as <year> — <month> — <day>, so that they can be sorted in chronological order.
Some details
1. Data folder
- Raw: The original dataset that is immutable.
- Interim: The processed data will be saved in “interim” folder, and since we do different data cleaning and create new features along the way. Interim contains different versions of processed data and should be saved in chronological order.
- Processed: The data for the final version of the model. The final, canonical data for modeling.
2. README file:
- Under Raw Data folder:
- Under Interim Data folder:
3. Log
The log resides in the root of the results directory and records your progress in detail.
Entries in the log should be dated and contains
- Input: Data source, output location, purpose
- Process: Describe what did you do
- Result: links or images/ tables displaying the results of the experiment
- Output and findings: Your observations, conclusion, and ideas for the next move
Note:
- It’s important to document why an experiment is failed!
- It’s also valuable to transcribe notes from conversations/ email/ texts into the log.
- Storing the log on a wiki-based system or on a blog site may be appealing!
4. Summarize & runall (drive script)
The log file contains a prose description of each experiment, while runall script provides all the gory details.
Sometimes, we are required to output results as well as some analysis on the result. The best way to do is to separate analysis part from drive scripts so we can edit the analysis part without opening the whole drive script.
So we create runall to run the experiment and one to summarize/ analysis the results (summarize). The final line of runall calls summarize, which will output some tables and plots.
Important Principle of a Reasonable Project Directory
- Top- level Organization should be Logical; with Chronological organization at the next level and Logical organization below that.
- Comment generously. Someone should be able to understand what you are doing solely from reading the comments.
- Avoid editing the intermediate file by hand. Tooooo important.
- Some important commands can be:
- Unix: sed, awk, grep, head, tail, sort, cut and paste
- Python:
try: result.to_csv(join(output_dir,"result.csv),index=False)except: os.makedirs(output_dir) result.to_csv(join(output_dir,"result.csv))
5. In parallel with this chronological directory structure, it is useful to maintain a chronologically organized log — will talk about it in detail later.
6. Record every operation that you perform — A chronologically organized log
7. README file:
- Why did this exercise?
- What did you do in this exercise? — can save the email/dialogue
- How do you transfer the data — some commands
8. A drive script, like runall.py (add data, header files locations, and parameters)
- Store all file and directory names in this script. (drive script)
- Forcing all of the file and directory names for residing in one place makes it much easier to keep track of and modify the organization of your output files.
9. Use relative pathnames to access other files within the same project. Not just pathnames, we should also try to make file name corresponding to the date and others, by using .format()
10. Make the script restartable. It’s useful to embed long-running steps of the experiment in a loop of the form
if (<output file does not exist>)then (perform operation>
So we can rerun selected parts of the experiment by deleting the corresponding output files. Don’t need to separate your python or run some cells in your notebook.
Other Notes
3 suggestions about how to handling/ preventing Errors:
- Write robust code to detect errors.
- When an error does occur, abort. Print/assert in the middle of the program to make sure it works well
- Whenever possible, create each output file using a temporary name and then rename the file after it is completed.
Reference:
A Quick Guide to Organizing Computational Biology Projects https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2709440/
Cookiecutter Data Science https://drivendata.github.io/cookiecutter-data-science/
Designing projects https://nicercode.github.io/blog/2013-04-05-projects/