The general workflow for report creation outlined can be leveraged in other projects.
0. Make it an R package
Even though it seems more complicated from the outset, creating your
report workflow as an R package will save time and streamline your final
product in the long run. When your reporting ecosystem inhabits its own
R package, it can be easily accessed by people (including you). To
create an R package, start a new R project, and then run
usethis::create_package()
.
1. Data curation
The first step is to locate data. Data can come from a variety of
sources, including public websites, private databases, personal
correspondence, or existing R packages. Current data sources for the
NEesp2
package include the assessementdata
package, the survdat
package (svdbs
database),
and MRIP recreational catch data.
After data has been located, it must be formatted correctly. It is possible that the data is already in the correct format, but in many cases some data cleaning will be required. It is strongly suggested to put the data in a tidy format.
Depending on your data source, the data may or may not already have species names and stock regions assigned. For example, some survey data may have strata rather than stock region, or scientific name rather than common name. You should format all of your data to conform to standard species naming and region naming conventions.
Finally, once your data is properly formatted with standard species
and region names, you should save your data in your own R package. After
your data exists as an R object, you can save it to the package with:
usethis::use_data({your data name})
. This will allow you to
access your data quickly and easily, without having to repeat any data
curation steps when you need to use your data.
2. Write functions
After you have your data, the next step is to begin visualizing and
displaying it. I like to begin by creating a plot or table for one
species (or stock) only. Once I am happy with the look of the plot or
table, I generalize the creation code into a function that can be
applied to any input data. Finally, document the function and save it
into your package using the roxygen2
package.
3. Make your report template
You can create your report template at the same time as you create
your functions. I recommend using the bookdown
package to create your template, because you can create each section
as its own RMarkdown file, which helps with organization.
I recommend that you give your template the ability to export and
save all of the data and figures used and produced by your report. This
isn’t required, but it makes it easier for the information from your
report to be used in other contexts. Data can be saved with a simple
call to write.csv
or saveRDS
(for large or
unwieldy data sets). Figures can be saved by setting the figure cache in
the global options, e.g., by running
knitr::opts_chunk$set(fig.path = "...")
in your setup
chunk. Put two forward slashes (//) after the path name, or else the
figure names will be appended to the folder name, rather than placed in
the folder. Alternatively, you could also use
png(...); ... ; dev.off()
to save a figure. If desired, you
can create an RMarkdown parameter to toggle data/figure saving on and
off.
If you find that your workflow necessitates frequently repeating the
same code chunk(s) or RMarkdown section(s) with different inputs,
consider making a child document that you can call instead. This will
streamline your code and reduce copy-and-paste errors. For example, the
render_indicator
function use in the regression reports
calls a child document that is rendered with different indicator inputs
each time (inputs are assigned to generic names before
render_indicator
is called).
Although it is not required, I strongly recommend that you create a custom rendering function for your report template. This will allow you to set default values and make the parameters that should be changed more visible.
Your template should be saved in the inst
folder of your
package (in its own folder, if it’s a bookdown). Files in the
inst
folder are made available in the root directory of an
installed R package, so your package code can access the template files
by calling
system.file("{your template name}", package = "{your package name}")
4. Update your package and iterate
As you work on your report, update package functions and the report template as needed. You will likely run into edge cases that have to be handled, come up with more efficient code, and add new data as it becomes available, among other updates.
5. Deploy a suite of reports using Github Actions
If you are using your workflow to generate a large number of reports, it is probably useful to offload the report generation onto a remote runner with Github Actions. When you generate your reports with Github Actions, you do not have to devote memory and time on your local machine to report generation. However, there will likely be some hurdles when getting Github Actions set up successfully. Here is a Github Action workflow that successfully renders reports and deploys them to a different repository. The general steps in the workflow are outlined below.
Github Actions is running your code in a remote environment, which does not have anything pre-installed. That means you have to install R, pandoc, and any packages you need. I have found that I also have to configure some settings on the command line in bash to get all of the packages I need to install correctly. You also have to check out your repo to the remote environment.
Depending on the complexity of your reports, you may need several dozen libraries, and the time it takes to install these libraries adds up. To avoid the bottleneck of re-installing all of the libraries you need every time you run your Github Action, you can cache your packages by adding a cache step to your Github Actions workflow, which will save the libraries in the remote environment and allow them to be loaded in the remote environment without re-installing. As long as the cache is accessed at least once a week, it will continue to live in the remote environment. Therefore, I suggest making a cache-maintaining Github Actions workflow that pings the cache on a regular schedule, so you don’t have to re-create the cache every time you go a few weeks without creating reports.
After rendering the reports in the remote environment, you must deploy the reports to a repository on github.com, or else the Github Action will end, the remote environment will be closed, and the reports will be gone. Before being able to successfully deploy reports with Github Actions, you must do some housekeeping to set yourself up with the correct permissions. First, create a personal access token by going to Settings –> Developer settings –> Personal access tokens in your Github account. Write this token down and save it somewhere, but do not share it with anyone. Next, go to your repo where you are setting up the Github Actions workflow. Go to Settings –> Secrets and add the same token with the same name. You must have a personal access token set up in every repo where you are trying to deploy a Github Action.
One issue with deploying reports on Github Actions is that if one
step fails, all following steps in the Github Actions workflow are
canceled. This means that if your report rendering script breaks, no
reports will be deployed, even if the script breaks on the last report.
Depending on the complexity of your report, it may be common for an
untested edge case to break the code. Fortunately, there is a
work-around. If you wrap your render function in try()
, the
script will execute even if some reports are not properly created. Of
course, the reports that fail to render will not be updated. You can
create a report
generation script that integrates error logging code to output
information about the reports that failed.