Built 2023-12-04 using NMdata 0.1.3.902.
This vignette is still under development. Please make sure to see latest version available here.
This vignettes aims at enabling you at
Using NMdata’s data preparation tools to assist building your data set
mergeCheck to automatically check merge
results, ensuring rows do not get lost or duplicated
Assigning exclusion flags and obtain a table summary counting data exclusions from source data to analysis data set
Easily and consistently order data columns using
NMcheckData to perform a extensive data check
before exporting for NONMEM
Writing the prepared data to file ensuring compatibility with
NONMEM and for post-processing in R using
Updating multiple NONMEM control streams to read the updated data
file using one simple call of the
Only basic R knowledge should be required to follow the instructions.
Getting data ready for modeling is a crucial and often underestimated task. Mistakes during the process of combining data sets, defining time variables etc. can lead to difficulties during modeling, need for revisiting data set preparation, and in worst case wasted time working with an erroneos data set. Avoiding those mistakes by integrating checks into the data preparation process is a key element in an efficient and reliable data preparation work flow.
Furthermore, NONMEM has a number of restrictions on the format of the
input data, and problems with the data set is a common reason for NONMEM
not to behave as expected. When this happens, debugging can be
NMdata includes some simple functions to
prevent these situations.
This vignette uses
data.table syntax for the little bit
of data manipulation performed. However, you don’t need to use
data.table at all to use these or any tool in
NMdata. The data set is a
pk <- readRDS(file = system.file("examples/data/xgxr2.rds", package = "NMdata")) class(pk) #>  "data.table" "data.frame"
If you are not familiar with
data.table, you can still
keep reading this vignette and learn what
NMdata can do.
data.table is a powerful enhancement to the
data.frame class, and the syntax is a little different from
data.frame. The few places where this affects the examples
provided here, explanations will be given. You can replace all use of
data.table in this vignette with base R functions,
tidyverse functions or whatever you prefer.
When stacking (
rbind) and merging, it is most often
necessary to check if two or more data sets are compatible for the
compareCols compares columns across two or more
To illustrate the output of
compareCols, a slightly
modified version of the
pk dataset has been created. One
CYCLE) has been removed, and
been re-coded to character.
compareCols tells us about
exactly these two differences:
compareCols(pk, pk.reduced) #> Dimensions: #> data nrows ncols #> 1: pk 1502 24 #> 2: pk.reduced 751 23 #> #> Columns that differ: #> column pk pk.reduced #> 1: CYCLE integer <NA> #> 2: AMT integer character #> #> Columns where no differences were found: BLQ, CMT, DOSE, DV, EVENTU, EVID, #> FLAG, ID, NAME, NOMTIME, PART, PROFDAY, PROFTIME, ROW, STUDY, TIME, TIMEUNIT, #> TRTACT, WEIGHTB, eff0, flag, trtact.
Before merging or stacking, we may want to re-code
in one of the datasets to get the class we need, and decide what to do
CYCLE column which is missing in one of the
datasets (add information or fill with
When stacking data sets we often know what columns we are looking to
obtain in the final data. We may already have defined that early in our
data preparation script, and
compareCols can use this to
highlight these columns.
cc is a shorthand function to
create character vectors without quoting the elements.
special.columns <- cc(ID, TIME, CYCLE, STUDY, BW) compareCols(pk, pk.reduced, cols.wanted = special.columns) #> Dimensions: #> data nrows ncols #> 1: pk 1502 24 #> 2: pk.reduced 751 23 #> #> Columns that differ: #> column pk pk.reduced #> 1: *ID integer integer #> 2: *TIME numeric numeric #> 3: *CYCLE integer <NA> #> 4: *STUDY integer integer #> 5: *BW <NA> <NA> #> 6: AMT integer character #> #> Columns where no differences were found: BLQ, CMT, DOSE, DV, EVENTU, EVID, #> FLAG, *ID, NAME, NOMTIME, PART, PROFDAY, PROFTIME, ROW, *STUDY, *TIME, #> TIMEUNIT, TRTACT, WEIGHTB, eff0, flag, trtact.
In this case, we may want to add
diff.only=FALSE to see
if other columns could hold the information we are missing for
The model estimation step is heavily dependent (and in NONMEM almost entirely based) on numeric data values. The source data will often contain character variables, i.e. columns with non-numeric data values.
If the column names reflect whether the values are numeric,
double-checking can be avoided.
columns if a function of their contents returns
pk.renamed <- renameByContents(data = pktmp, fun.test = NMisNumeric, fun.rename = tolower, invert.test = TRUE)
We make use of the function
NMisNumeric which tests if
NONMEM can interpret the contents as numeric. If say the subject ID is
of character class, it can be valid to NONMEM. Subject ID
"1039" will be a numeric in NONMEM,
NMisNumeric will return
TRUE if and
only if all elements are either missing or interpretable as numeric. We
invert the condition (
invert.test=TRUE), and the
names of the columns that NONMEM cannot interpret as numeric become
lowercase. We use
compareCols to illustrate that three
columns were renamed:
compareCols(pktmp, pk.renamed) #> Dimensions: #> data nrows ncols #> 1: pktmp 1502 23 #> 2: pk.renamed 1502 23 #> #> Columns that differ: #> column pktmp pk.renamed #> 1: EVENTU character <NA> #> 2: NAME character <NA> #> 3: TIMEUNIT character <NA> #> 4: eventu <NA> character #> 5: name <NA> character #> 6: timeunit <NA> character #> #> Columns where no differences were found: AMT, BLQ, CMT, CYCLE, DOSE, DV, EVID, #> FLAG, ID, NOMTIME, PART, PROFDAY, PROFTIME, ROW, STUDY, TIME, WEIGHTB, eff0, #> flag, trtact.
We can now easily see that if we wish to include the information
pk.renamed, we have to modify or translate their contents
Merge or join operations are a very powerful data preparation tool. But they are also a very common source of bugs. Most of us know too well how merges can leave us with unexpected rows or make rows disappear. However, most often we can impose restrictions on the merge operation that allows for automated validation of the results.
Imagine the very common example that we have a longitudinal PK data
pk), and we want to add subject-level
covariates from a secondary data set (
dt.cov). We want to
ID, and all we can allow to happen is columns to
be added to
dt.cov. If rows disappear
or get repeated, or if columns get renamed, it’s unintended and should
return an error. That is what
mergeCheck is for.
Often people check the dimensions of the result to make sure nothing
unintended happened. The following example shows that this is not
enough, and that
mergeCheck works differently. After
merging the two data sets the check of the dimensions raises no alarm -
the number of rows is unchanged from
pk2, and one of two columns in
dims is just a
dim-like function that
can compare multiple data sets - handy for interactive analysis.
What we didn’t realize is that we now have twice as many rows for subject 31.
pk[ID == 31, .N] #>  10 pk2[ID == 31, .N] #>  20
If we instead use
mergeCheck, we get an error. This is
mergeCheck compares the actual rows going in and
out of the merge and not just the dimensions.
mergeCheck(pk, dt.cov, by = "ID") #> Rows disappeared during merge. #> Rows duplicated during merge. #> Overview of dimensions of input and output data: #> nrows ncols #> 1: 1502 24 #> 2: 150 2 #> 3: 1502 25 #> Overview of values of by where number of rows in x changes: #> ID N.x N.result #> 1: 31 10 20 #> 2: 180 10 0 #> Error in mergeCheck(pk, dt.cov, by = "ID"): Merge added and/or removed rows.
mergeCheck tells us for which values of
by argument which can be of length
>1) the input and output differ so we can quickly look into the data
sets and make a decision how we want to handle this. In this case we
discard the covariate value for subject 31 and use
all.x=TRUE argument to get
NA for subjects 31
dt.cov2 <- dt.cov[ID != 31] pk2.check <- mergeCheck(pk, dt.cov2, by = "ID", all.x = TRUE) #> Column(s) added: COV
To ensure the consistency of rows before and after the merge, you
merge(...,all.x=TRUE) and then check dimensions
before and after (yes, both
all.x=TRUE and the dimension
check are necessary). This is not needed if you use
mergeCheck does not try to reimplement merging. Under
the hood, the merge is performed by
data.table::merge.data.table to which most arguments are
mergeCheck does is to add the checks that the
results are consistent with the criteria outlined above.
data.table::merge.data.table is generally very fast, and
even if there is a bit of extra calculations in
it should never be slow.
mergeCheck verifies that the rows that
result from the merge are the exact same as in one of the existing
datasets, only columns added from the second input dataset. You may
think that this will limit your merges, and that you need merges for
inner and outer joins etc. You are exactly right -
mergeCheck is not intended for those merges and does not
support them. When that is said, the kind of merges that are supported
mergeCheck are indeed very common. All merges in the
NMdata package are performed with
Another problem the programmer may not realize during a merge is when
column names are shared across
addition to columns that are being merged by). This will silently create
column names like
col.y in the
mergeCheck will by default give a warning if that
happens (can be modified using the
argument). Also, there is an optional argument to tell mergeCheck how
many columns are expected to be added by the merge, and
mergeCheck will fail if another number of columns are
added. This can be useful for programming.
The row order of the first data set is by default maintained by
mergeCheck. Apart from this, there is only one difference
from the behavior of the
merge.data.frame function syntax,
being that either the
by argument or
by.y must always be supplied to
Default behavior of
merge.data.frame is to merge by all
common column names, but for coding transparency, this is intentionally
not allowed by
In addition to stacking doses and concentration data and merging in
covariates, we often need to derive time since previous dose, we may
want to cumulatively count the number and amounts of drug administered,
keep track of previous dose amount and most recent dosing time. These
are all within at least the subject subject.
these really easily.
By default, the column names shown above are used.
addTAPD takes arguments to customize the generated column
names and of course to indicate what columns are store the used
information, such as dose amounts, time etc. By default,
addTAPD adds the five columns listed above. The following
example derives time since previous dose based on nominal time, uses
customized names for derived column names, and skips derivation of
cumulated dose amount.
pk.tapd2 <- addTAPD(pk, col.time = "NOMTIME", col.tapd = "NTAPD", col.tpdos = "NTPDOS", col.ndoses = "NOMNDOSES", col.doscuma = NULL) #> col.ndoses is a deprecated argument. Please use col.doscumn instead. cnames.tapd2 <- colnames(pk.tapd2) ## These are the columns added by addTAPD setdiff(cnames.tapd2, cnames.1) #>  "DOSCUMN" "NTPDOS" "NTAPD" "PDOSAMT"
addTAPD uses information in
AMT (names of
columns holding this information can be speified using arguments). It
respects repeated dosing defined in
II. Under the hood,
NMexpandDoses is used to
achieve this but the returned data will have the exact same rows as the
input data (i.e. if doses are expanded, it is only for internal
calculations on the existing rows).
There is no way around excluding some of the events in data due to
various reasons. We need to be able to answer to why we excluded each of
the points, and to how many points were excluded due to which criteria.
NMdata provides two functions to handle this -
flagsAssign assigns exclusion flags to data records (rows),
flagsCount summarizes the number of discarded rows and
This implementation makes it easy to keep the rows flagged for exclusion in the dataset and ignore them in NONMEM. Or if you prefer, you can remove the rows after generating an overview of the exclusion counts for your report.
flagsCount are based on
sequential evaulation of exclusion criteria. This means we can summarize
how many records and subjects were excluded from the analysis due to the
different criteria. The information is represented in one numerical
column for NONMEM, and one (value-to-value corresponding) character
column for the rest of us in the resulting data.
For use in NONMEM’s
IGNORE feature, the easiest is that
inclusion/exclusion is determined by a single column in data - we call
FLAG here, but any column name can be used.
FLAG obviously draws on information from other columns such
DV, and many others, depending on
your dataset and your way of working.
The function that applies exclusion rules is called
flagsAssign, and it takes a dataset and a data.frame with
rules as arguments. In this example we consider four different reasons
to exclude samples - and only samples (keeping all doses in the
analysis). We exclude all pre-dose samples. We also exclude samples with
missing time, missing value, and we exclude those below LLOQ. The
data.frame with these rules looks like this
dt.flags <- fread(text = "FLAG, flag, condition 40, Pre-dose sample, !is.na(TIME) & TIME<0 30, Missing time, is.na(TIME) 20, Missing value, is.na(DV) 10, Below LLOQ, BLQ==1") dt.flags #> FLAG flag condition #> 1: 40 Pre-dose sample !is.na(TIME) & TIME<0 #> 2: 30 Missing time is.na(TIME) #> 3: 20 Missing value is.na(DV) #> 4: 10 Below LLOQ BLQ==1
fread is used to create a data.table (like
read.csv to create a data.frame) for readability, one line
for each row in the data.table created. Notice how
numeric and interpretable by NONMEM,
flag is descriptions
interpretable by humans, and
condition is expressions
interpretable by R.
pk <- flagsAssign(pk, tab.flags = dt.flags, subset.data = "EVID==0") #> Coding FLAG = 40, flag = Pre-dose sample #> Coding FLAG = 30, flag = Missing time #> Coding FLAG = 20, flag = Missing value #> Coding FLAG = 10, flag = Below LLOQ
flagsAssign applies the conditions sequentially and by
decreasing value of
FLAG=0 means that
the observation is included in the analysis. You can use any expression
that can be evaluated within the data.frame. In this case, numeric
BLQ culomns must
Finally, flags are assigned to
EVID==1 rows. Here, no
flag table is used. This means that all
EVID==1 rows will
flag="Dose". You can use a
separate data.frame of flags for dosing records as needed.
pk <- flagsAssign(pk, subset.data = "EVID==1", flagc.0 = "Dose")
Again, the omission will be attributed to the first condition
matched. Default is to apply the conditions by the order of decreasing
numerical flag value. Use
flags.increasing=TRUE if you
prefer the opposite. However, what cannot be modified is that 0 is the
numerical value for rows that are not matched by any conditions.
In NONMEM, we can now include
$INFILE. NMwriteData (see later in
this vignette) will by default look for
FLAG and suggest an
IGNORE statement for
What rows to omit from a data set can vary from one analysis to
another. Hence, the aim with the chosen design is that the inclusion
criteria can be changed and applied to overwrite an existing
inclusion/exclusion selection. For another analysis we want to include
the observations below LLOQ. We have two options. Either we simply
IGNORE statement given above to
IGNORE=(FLAG.LT.10), or you create a different exclusion
flag for that one. If you prefer to create a new set of exclusion flags,
just use new names for the numerical and the character flag columns so
you don’t overwrite the old ones. See help of
flagsCount for how - arguments are called
An overview of the number of observations disregarded due to the
different conditions is then obtained using
we see from the
names call below, both discarded,
cumulative discarded, and observations left after application of the
respective criterion are available. Choose the ones you prefer - here we
show how many observations and subjects were matched by each criterion
and how many were left after application of each criterion.
tab.count <- flagsCount(data = pk[EVID == 0], tab.flags = dt.flags) names(tab.count) #>  "flag" "N.left" "Nobs.left" "N.discard" #>  "N.disc.cum" "Nobs.discard" "Nobs.disc.cum" tab.count[, .(`Data cleaning step` = flag, N.discard, Nobs.discard, N.left, Nobs.left)] |> kable()
|Data cleaning step||N.discard||Nobs.discard||N.left||Nobs.left|
|All available data||NA||NA||150||1352|
Notice that each row in the summary table does not describe how many
observations matched the criterion, but how many observations
were excluded due to the criterion. For instance, two samples
are excluded due to values below LLOQ. All the predose samples may also
be below LLOQ. By the order of the
FLAG values however, we
decided that we wanted to exclude this samples no matter of their
values. Hence they are counted in that and only that bin.
flagsCount includes a
file argument to save
the the table as a csv right away.
Once the dataset is in place,
NMdata provides a few
useful functions to ensure the formatting of the written data is
compatible with NONMEM. These functions include checks that NONMEM will
be able to interpret the data as intended, and more features are under
development in this area.
The order of columns in NONMEM is important for two reasons. One is
that a character in a variable read into NONMEM will make the run fail.
The other is that there are restrictions on the number of variables you
can read into NONMEM, depending on the version.
NMorderColumns tries to put the used columns first, and
other or maybe even unusable columns in the back of the dataset. It does
so by a mix of recognition of column names and analysis of the column
Columns that cannot be converted to numeric are put in the back,
while column bearing standard NONMEM variable names like
EVID etc. will be
pulled up front. You can of course add column names to prioritize to
first) or back (
?NMorderColumns for more options.
pk <- NMorderColumns(pk)
One trick is worth mentioning here. If you are adding variables to a
data set after having started to model with NONMEM, you may not want to
have to update and rerun your NONMEM models right away.
NMorderColumns has options for putting some variable last
(to the right) in data. That argument is called
has several other options to tweak how the columns are ordered so you
can hopefully get the order you want.
Before we save the data and go to model estimation,
NMdata offers a quite extensive and automated function to
check data for consistency and compatibility with NONMEM.
NMcheckData checks all the standard NONMEM columns
against the NONMEM requirements and looks for other common data issues.
The list is quite long. Please see
?NMcheckData for a list
of performed checks.
We can add subject level covariates, and subject-occasion covariates to be checked for whether they are non-missing, numeric and not varying with subject or subject-occasion. We can also add other numeric variables to use in NONMEM to check for missing values.
findings <- NMcheckData(pk, covs = c("DOSE", "WEIGHTB")) #> column check N Nid #> EVID Subject has no obs 19 19 #> MDV Column not found 1 0 #> WEIGHTB Cov not unique within ID 1 1 #> AMT Non-positive dose amounts 1 1 #> EVID EVID not in 0:4 1 1
Let’s look at these findings:
findings #> row ID column check level ROW #> 1 NA 31 EVID Subject has no obs ID NA #> 2 NA 32 EVID Subject has no obs ID NA #> 3 NA 33 EVID Subject has no obs ID NA #> 4 NA 34 EVID Subject has no obs ID NA #> 5 NA 36 EVID Subject has no obs ID NA #> 6 NA 37 EVID Subject has no obs ID NA #> 7 NA 38 EVID Subject has no obs ID NA #> 8 NA 39 EVID Subject has no obs ID NA #> 9 NA 42 EVID Subject has no obs ID NA #> 10 NA 44 EVID Subject has no obs ID NA #> 11 NA 45 EVID Subject has no obs ID NA #> 12 NA 46 EVID Subject has no obs ID NA #> 13 NA 49 EVID Subject has no obs ID NA #> 14 NA 52 EVID Subject has no obs ID NA #> 15 NA 53 EVID Subject has no obs ID NA #> 16 NA 56 EVID Subject has no obs ID NA #> 17 NA 57 EVID Subject has no obs ID NA #> 18 NA 58 EVID Subject has no obs ID NA #> 19 NA 60 EVID Subject has no obs ID NA #> 20 NA NA MDV Column not found column NA #> 21 NA 180 WEIGHTB Cov not unique within ID ID NA #> 22 1403 171 AMT Non-positive dose amounts row 1403 #> 23 1480 178 EVID EVID not in 0:4 row 1480
level we can now take a look at single
rows, data from a subject or fix a column to address these. The fact
that some subjects are missing observations in this case is not
necessarily an error (they are in this case all BLQ), but
WEIGHTB has to be constant within subjects, and for NONMEM
to even run,
EVID must be in
0:4. So those
have to be fixed. For the rest of the vignette, assume we fixed those
pk <- copy(pk.copy)
For the final step of writing the dataset,
is provided. Most importantly, it writes a csv file with appropriate
options for NONMEM to read it as well as possible. It can also write an
rds for R with equal contents (or RData if you prefer), but with the rds
including all information (such as factor levels) which cannot be saved
in csv. If you should use
NMscanData to read NONMEM
results, this information can be used automatically.
NMwriteData also by default calls
which provides a proposal for text to include in the
$DATA sections of the NONMEM control streams. There are
several arguments that will affect the proposed text for the NONMEM run,
?NMwriteData and especially
Let’s include the origin script of the data as meta data.
write.csv=TRUE is default but included here because we
often want to use something like
writeOutput is a switching variable we set to
FALSE in the initialization section of
text.nm <- NMwriteData(pk, file = "derived/pkdata.csv", script = "DataPrepare.Rmd", write.csv = TRUE, args.stamp = list(Description = "PK data for the Data Preparation vignette.")) #> arguments in the format write.xxx are deprecated. Use the `formats.write` argument instead. Example: formats.write=c("csv","rds") #> For NONMEM: #> $INPUT ROW ID NOMTIME TIME EVID CMT AMT DV FLAG STUDY BLQ CYCLE DOSE #> PART PROFDAY PROFTIME WEIGHTB eff0 #> $DATA derived/pkdata.csv #> IGN=@ #> IGNORE=(FLAG.NE.0) #> Data written to file(s): #> derived/pkdata.csv #> derived/pkdata.rds
We are being told that two files were saved, and then we get some
text to use in the NONMEM control streams.
detected the exclusion flag and suggests to include it in
Let’s take a look at what was saved:
list.files("derived") #>  "pkdata_meta.txt" "pkdata.csv" "pkdata.rds"
There is a metadata file which
automatically recognize if found. The metadata becomes accessible using
dat.inp <- NMreadCsv("derived/pkdata.csv") NMinfo(dat.inp) #> $dataCreate #> $dataCreate$DataCreateScript #>  "DataPrepare.Rmd" #> #> $dataCreate$CreationTime #>  "2023-12-04 19:19:20.346164" #> #> $dataCreate$writtenTo #>  "derived/pkdata.rds derived/pkdata.csv" #> #> $dataCreate$Description #>  "PK data for the Data Preparation vignette."
With the flexibility of the
rds format, we don’t need
such an additional file. Only difference on the metadata for the rds
file is the filename:
dat.inp.rds <- readRDS("derived/pkdata.rds") NMinfo(dat.inp.rds) #> $dataCreate #> $dataCreate$DataCreateScript #>  "DataPrepare.Rmd" #> #> $dataCreate$CreationTime #>  "2023-12-04 19:19:20 UTC" #> #> $dataCreate$writtenTo #>  "derived/pkdata.rds" "derived/pkdata.csv" #> #> $dataCreate$Description #>  "PK data for the Data Preparation vignette."
If we have to update the input data file, the NONMEM
$INPUT sections no longer match the input data. We saw in
NMorderColumns how we can use the
argument to get columns pushed towards the back so the NONMEM runs
should still work. But maybe you need the column in your nonmem runs,
and so you have no way around updating the control streams. And that can
be quite a lot of control streams. With
NMdata that is
NMdata has a couple of functions to extract and write
sections to NONMEM control streams called
NMwriteSection. Those functions are very flexible for
updating NONMEM control streams, and we will not go into detail with
them, but let’s stick to the example above. We can do
NMwriteSection(dir = "nonmem", file.pattern = "run1.*\\.mod", list.sections = text.nm["INPUT"])
This updates the INPUT section (and not DATA) for all control streams
in directory “nonmem” which file names start with “run1” and end in
“.mod” (say “run101.mod” to “run199.mod”). If we had done simply
list.sections=text.nm instead of
list.sections=text.nm["INPUT"], it would have replaced the
$DATA section too. However, the
rarely needs update following an update of the input data file, and
$DATA can vary among control streams that use
the same input data (some models may be estimated on a smaller subset of
data), so be careful with that.
NMwriteSection has the argument
to further limit the scope of files to update based on what data file
the control streams use. It only makes sense to use the auto-generated
text for control streams that use this data set.
The text for NONMEM can be generated without saving data using
NMgenText. You can tailor the generation of the text to
DV instead of
CONC) and more.
We saw how
NMwriteDatat saves metadata automatically.
NMwriteData can actually be used as a simple rds
writer that adds meta data the same way, we may want to save data or any
R object using
saveRDS. In that case, use
NMstamp (which is also what
script argument is recognized by
NMstamp, but you can add anything to this. We want to keep
descriptive note too. Another often useful piece of information is what
source data files were read in order to generate the saved data.
Source.Files are only examples
- any name can be used.
pk <- NMstamp(pk, script = "vignettes/DataCreate.Rmd", Description = "A PK dataset used for examples.", Source.Files = "/path/to/adpc.sas7bdat,/path/to/adsl.sas7bdat") NMinfo(pk) #> $dataCreate #> $dataCreate$DataCreateScript #>  "vignettes/DataCreate.Rmd" #> #> $dataCreate$CreationTime #>  "2023-12-04 19:19:20 UTC" #> #> $dataCreate$Description #>  "A PK dataset used for examples." #> #> $dataCreate$Source.Files #>  "/path/to/adpc.sas7bdat,/path/to/adsl.sas7bdat"
These are very simple functions. But hopefully they will help you avoid sitting with a data set trying to guess which script generated it.
Again, when using
NMwriteData, you don’t have to call
NMstamp explicitly. Just pass the
NMstamp will be