||||
|\/|
|/\|
||||

SIB-PAIR 1.00b (08 May 2008)

by

David L. Duffy

(1995-2008)

A program for elementary genetical analyses

David L. Duffy, MBBS PhD.
Queensland Institute of Medical Research,
300 Herston Road,
Herston, Queensland 4029, Australia.
Email: davidD@qimr.edu.au

CONTENTS

INTRODUCTION

Program Sib-pair performs a number of simple analyses of family data that tend to be "nonparametric" or "robust" in nature. It is modelled to some extent on the Genetic Analysis System [Young, 1995] in terms of the command language and types of analysis. Included are routines for:

An example of use

Sib-pair is command line oriented, and writes output to the standard output (the screen if you are using it interactively). Output can be saved to a file by diverting it to a file using the "out" command (in which case you will no longer see it coming to the screen). Alternatively, you may use another program that can copy from the standard output, such as tee, emacs (the powerful editor usually found on Unix systems), syn (a powerful Windows editor) or the Tcl/Tk based GUI program isp which can found on the same site as Sib-pair itself.

Bearing this in mind, here is a sample interactive session. We start at the command line of your operating system shell:

> sib-pair

|||| SIB-PAIR: A program for simple genetic analysis
|\/| Version : Version 1.00a16 (01-Jul-2006)
|/\| Author : David L Duffy (c) 1995-2006
|||| Job run : Thu Jun 1 15:28:03 2006


>>

A double arrow command prompt appears. We read in a previously prepared script:

>> include ex.in

The contents of ex.in are a series of Sib-pair commands:

# declare four loci
set loc a affection
set loc b quantitative
set loc m1 marker 0.0 cM
set loc m2 marker 5.1 cM
# read the pedigree data
read ped inline
ex1 1a x x m n x 1 3 1 2
ex1 1b x x f n x 1 2 3 4
ex1 2a 1a 1b m n 3.5 1 2 1 3
ex1 2b x x f n 1.1 2 2 2 3
ex1 3a 2a 2b m y 4.3 1 2 1 2
ex1 3b 2a 2b m n 2.0 2 2 2 3
ex1 3c 2a 2b f n 0.8 2 2 3 3
ex1 4a 3c 3d f y x 1 2 2 3
ex1 3d x x m n x x x x x
ex1 4b 3b 3e m y 4.7 1 2 3 4
ex1 4c 3b 3f m n 1.6 2 2 1 3
;;;;
# The four semicolons ends the in-line data
run

The "run" command actually starts the initial processing of the pedigree.


NOTE:  Imputation level 1.  Imputing untyped parental
        genotypes where unequivocal.

Pedigree file        = inline.ped                                                                                                                                      
Number of loci       =     4

Locus      Type Position
---------- ---- --------
a           a     6
b           q     7
m1          m     8--  9
m2          m    10-- 11

Number of marker loci=    2
Bonferroni corr. 5%  =    0.025321
Bonferroni corr. 1%  =    0.005013
Bonferroni corr. 0.1%=    0.000500

NOTE:  Creating dummy record for ex1-3e.
NOTE:  Creating dummy record for ex1-3f.

NOTE: Father and mother of person ex1-4a appear to be swapped around.  Reordering.
NOTE:  Person ex1-3e appears as a mother and sex was unspecified.
       Setting sex to female.
NOTE:  Person ex1-3f appears as a mother and sex was unspecified.
       Setting sex to female.
Pedigree: ex1        No. members:   13 No. founders:    6 No. sibships:    5

Total number of pedigrees  =     1
Number with only 1 member  =     0
Total number of sibships   =     5
Total number of subjects   =    13
Total subjects genotyped   =    10 ( 76.9%)
Total number of genotypes  =     20
Largest pedigree (members) =    13 (Pedigree ex1)
Deepest pedigree (genrtns) =     4 (Pedigree ex1)

Mean size of pedigrees     =    13.0
Mean pedigree depth        =     4.0
Mean size where >1 members =    13.0
Mean depth where >1 geners =     4.0
Number of imputed genotypes=     0

Number of pedigree errors  =     0
Number of deleted records  =     0

We obtain a few routine messages and summary statistics. The small table of Bonferroni corrections is a reminder that Sib-pair does not usually correct P-values for multiple tests; it is up to the user to decide what the appropriate significance thresholds are.

Blank records are created for named but missing parents. This is necessary, as a formal pedigree should have neither or both parents present for each person.

>> ls

a* b* m1 m2

The "ls" commands shows trait loci (with an asterisk appended), and marker loci.

>> drop $m
>> ls
>> undrop

a* b* (m1) (m2)

The "drop" commands drops loci from the scope of subsequent commands, while the "undrop" command returns access to all loci.

>> describe m1


------------------------------------------------
Allele frequencies for locus "m1        "
------------------------------------------------
   Allele  Frequency   Count  Histogram
       1     0.3000        6  ******
       2     0.6500       13  *************
       3     0.0500        1  *

Number of alleles    =    3
Heterozygosity (Hu)  =    0.5105
Poly. Inf. Content   =    0.4064
Number persons typed =   10 ( 76.9%)

>> describe snp


Marker         NAll  Allele(s)    Freq   Het    Ntyped
-------------- ---- -----------   ------ ------ ------
m1                3    1 ..   3    -     0.5105    10
m2                4    1 ..   4    -     0.7211    10

The "describe" command gives summary information about loci. For marker loci, it tabulates simple allele counts and proportions in the dataset. Given the keyword "snp", it gives a summary for all markers, one line per locus. For traits, "describe" gives familial correlations or recurrence risks. The "tabulate" gives simpler tables:

>> tab a


a          x:   2      y:   3      n:   8

>> tab a m1





------------------------------------------------
Cross-tabulation of "a         " ... "m1        "
------------------------------------------------

                      m1        
a                 1/2          1/3          2/2   
           ---------------------------------------
     n         2 (.286)     1 (.143)     4 (.571)
     y         3 (1.00)     0 (.000)     0 (.000)

    No. complete observations =    10
    LR contingency chi-square =   5.5
           Degrees of freedom =   2
                      P-value =0.0643

There are a few other simple descriptive commands. The "count" command gives information about families and individuals:

>> count b>1

Count where "b > 1":
 
Pedigree   Con=T   Num  ASPs Trios    4+
---------- ----- ----- ----- ----- -----
ex1            6    13     1     0     0
Total          6    13     1     0     0

The "select" command selects pedigrees containing a specified number (or one or more) individuals meeting the criterion. The "print" command is individual oriented:

>> print b>1

Print where "b > 1":

ped=ex1 id=2b fa=x mo=x  sex=f b=1.1000 m1=2/2 m2=2/3
ped=ex1 id=4c fa=3b mo=3f sex=m b=1.6000 m1=2/2 m2=1/3
ped=ex1 id=4b fa=3b mo=3e sex=m b=4.7000 m1=1/2 m2=3/4
ped=ex1 id=3a fa=2a mo=2b sex=m b=4.3000 m1=1/2 m2=1/2
ped=ex1 id=3b fa=2a mo=2b sex=m b=2.0000 m1=2/2 m2=2/3
ped=ex1 id=2a fa=1a mo=1b sex=m b=3.5000 m1=1/2 m2=1/3

Number of matched persons   =    6 out of    13 ( 46.2%)
Number of matched pedigrees =    1 out of     1 (100.0%)

If we had instead issued:

>> keep $m
>> print b>1

we would obtain:

Print where "b > 1":

ped=ex1 id=2b fa=x mo=x  sex=f m1=2/2 m2=2/3
ped=ex1 id=4c fa=3b mo=3f sex=m m1=2/2 m2=1/3
...

As of 2006-Mar-01, we can write expressions containing genotypes:

>> undrop
>> print m2=="3/4"

Print where "m2 = = 3/4":

ped=ex1 id=1b fa=x mo=x  sex=f a=n b=x m1=1/2 m2=3/4
ped=ex1 id=4b fa=3b mo=3e sex=m a=y b=4.7000 m1=1/2 m2=3/4

The "associate" command gives results from family based tests of allelic association for a trait versus all active marker loci:

>> ass a

--------------------------------------------------
Allelic association testing for trait "a         "
--------------------------------------------------

Marker     Typed  Allels Chi-square Asy P  Emp P  Iters
---------- ------ ------ ---------- ------ ------ ------
m1             10      3        1.9 0.3930 0.1575    127 AssX2-HWE .  
m1              1      3        1.1 0.5978 0.5978     23 RC-TDT    .  
m2             10      4        0.9 0.8192 0.7692     26 AssX2-HWE .  
m2              1      4        2.1 0.4471 0.4471    160 RC-TDT    .  

The "sibpair" command gives results from regression based tests of linkage for a trait versus all active marker loci. The P-values for these tests can be "empirically" estimated by gene-dropping marker alleles under the null hypothesis of no linkage:

>> sib b sim

---------------------------------------------------------
Sham S+D H-E for trait "b         " v. all markers
---------------------------------------------------------

Marker     FSibs  HSibs  t-value    Asy P  Emp P  Iters
---------- ------ ------ ---------- ------ ------ ------
m1              3      1        2.6 0.1180 0.0288    695 H-E +  
m2              3      1        1.5 0.1908 0.2128     94 H-E .  

Finally, we log transform the quantitative trait values and write out a Genehunter type pedigree file, so we can further examine our data using a multipoint program:

>> b=log(b+1)
>> keep b -- m2
>> write locus merlin merlin.dat
>> write merlin merlin.ped
>> write map merlin merlin.map

The "dummy" keyword adds a dummy binary trait variable to the Genehunter pedigree file, so that program will calculate ibds for us. If we like, Merlin can be run from within Sib-pair:

>> $ merlin --vc

The "$" command shells out to run another program. When we exit from Merlin, we will be returned to Sib-pair:

>> quit

This job took  1.8 minutes

There are a number of operations useful for manipulating pedigree data prior to analysis. These allow you to prune ("prune") pedigrees down to selected individuals and only those relatives needed to connect the index people, to reduce families to unrelated cases and controls ("case"), to break up large pedigrees into component nuclear families ("nuclear") or into unrelated cliques ("subped"), if the pedigree file does not specify all the connecting relatives (between different branches say). One can also select ("select" and "unselect") particular groups of pedigrees, and specify different values of a variable in each group:

# Select out ASP nuclear families
>> nuclear
>> select containing exactly 2 where dementia and isnon and anytyp
# or EDAC nuclear families
>> unselect
>> select containing 1 where IgE>1000 and isnon and anytyp
>> select containing 1 where IgE<50 and isnon and anytyp

Sib-pair also can be used as a calculator, and has a few genetics utilities, notably the "sml" and "grr" commands which gives expected recurrence risks, ibd's and genotype frequencies for a specified diallelic model:

>> sml 0.01 0.5 0.2 0.1

------------------------------------------------
Single Major Locus Recurrence Risk Calculation
------------------------------------------------

 Frequency(A): 0.010000; Pen(AA): 0.500; Pen(AB): 0.200; Pen(BB): 0.100
 Trait Prev  : 0.102020; Pop AR:   2.0%; Var(Add): 0.000206; Var(Dom): 0.000004

 Measure      MZ Twin       Sib-Sib        Par-Off      Second    
----------   ----------    ----------     ---------    ----------
 Rec risk         0.104         0.103         0.103         0.103
 Rel risk         1.023         1.011         1.011         1.006
 Odds rat         1.025         1.012         1.012         1.006
 PRR              1.020         1.010         1.010         1.005
 ibd|A-A          1.000         0.502         0.500         0.251
 ibd|A-U          1.000         0.500         0.500         0.250

 Freq of A if Affected: 0.019898 (0.000,0.039,0.961) 
 Freq of A if Unaffctd: 0.008875 (0.000,0.018,0.982)

 Mating       Proportion    Risk to offspring
----------   -----------   ------------------ 
UnA x UnA         0.806         0.102
Aff x UnA         0.183         0.103
Aff x Aff         0.010         0.104

Finally, most commands can be interrupted by typing "ctrl-C" (that is holding the control and C keys down simultaneously). This returns you to the Sib-pair prompt, after finishing the current command as quickly as possible. To interrupt the program completely, you need to strike "ctrl-C" six times.

METHODS

Analytic Methods

Imputation of unobserved genotypes. This is performed using the algorithm described by Lange & Goradia [1987]. Firstly, (0) A phenoset (all possible genotypes) for one locus is generated for each individual in the pedigree. Then, iterate by nuclear families, repeating the next two steps until no further updates: (1) Parental genotypes inconsistent with their offspring are removed; (2) child genotypes inconsistent with their parents are removed. Finally, (3) If zero genotypes remain, report an inconsistency; if one genotype remains, this becomes the imputed genotype; if the joint spouse genotype is unambiguous, but the specific genotype each spouse carries is ambiguous, if requested, randomly assign a genotype to each parent. These latter genotypes might be used only for calculation of statistics for offspring of the pair, but not for the parents themselves. A further extension is to sequentially (founders then nonfounders) impute the remaining missing genotypes as the most likely member of the phenoset. This or a faster randomised algorithm is always run (unless the imputation flag is set to "-1", see below) to give starting values for the MCMC methods, but these simulated genotypes will not be saved unless the imputation level is set to "3" or "full".

Sex imputation. Likelihood of observed homozygosity at multiple sex-linked loci is calculated under hypothesis of male and female sex, assuming 0.1% of male and female genotypes are miscalled as heterozygotes.

Allele frequencies. These are for codominant systems only. Either a straight allele count is used, or the contribution of each pedigree is weighted by the number of founders it contains. Alternatively, the imputed and observed genotypes in the founders can be counted, or the MLE of the founder allele frequencies calculated by MCMC (see below).

Hardy-Weinberg proportions. These are tested by a Pearson chi-square, with the P-value estimated via "gene-dropping" (Monte-Carlo,MC) simulation. These are based on the genotypes in the founders of that pedigree, where typed, or on the gene frequencies in the total sample where the founder is untyped, and the structure of the pedigree. For diallelic markers, the exact test of Hardy-Weinberg equilibrium (assuming all individuals is unrelated) is also calculated.

Segregation ratios. The default analysis assumes the pedigrees are unascertained, and gives the naive estimators. The ascertainment corrected analysis is for nuclear families and follows Davie [1979]: p=(r-j)/(t-j), where p is the risk, r is the total number of affected children, j, the number of sibships containing exactly one proband, t, the number of children. A fairly efficient approximate sampling variance is also given.

Haplotype reconstruction. This routine is not currently available in the Fortran 95 version of Sib-pair. This is performed on a nuclear family by family basis, though incorporating grandparental information where available. Initial reconstruction is performed using a simulated annealing algorithm that maximizes a sharing based criterion based on length of runs of the same alleles on a putative chromosome among sib pairs, parent-offspring pairs, and grandparent-child pairs. The order of loci in the pedigree file is treated as the linkage order, and map distance information is not used. Missing parental haplotypes are filled in using the childrens' haplotypes in a simple fashion. Mendelian inconsistencies are flagged in the printout.

The second algorithm is more ambitious and attempts to construct recombination minimized haplotypes, again on a nuclear family by family basis, and dealing more intelligently with missing data. A simulated annealing algorithm using multiple restarts is used. Recombination events and Mendelian errors are flagged in the output.

Admixture analysis. This refers to testing for a mixture of specified distributions in the empirical distribution of a quantitative trait, usually a mixture of normals, though Sib-pair also offers mixtures of exponential and Poisson distributions. Information from the relationship between family members is not utilised. The usual EM approach is used.

Test for normality. The Filliben correlation [Filliben 1975] is calculated as a test for normality. This is the correlation between the observed data and the rankits expected under the assumption of normality. The P-value for this statistic is approximate, and is produced using an approach modelled on that used by Royston [1993] for the related Shapiro-Francia W':

log(1-rf) ~ N(m,s)
m:=1.0402 (log(log(n))-log(n)) - 1.99196
s:=0.788392/log(n)) + 0.31293

where n is the sample size. This performs reasonably well versus the empirical percentiles:
Coverage
n 5 10 20 50 100 500 1000 5000
Nominal P=0.050.0210.0440.0550.0570.0590.0520.0450.033
Nominal P=0.010.00 0.0060.0090.0140.0190.0070.0070.007

A common use of the Filliben correlation is as the criterion for selecting an optimal transformation of the data.

The J0.02 statistic is a variance-corrected order-statistic based skewness measure:

[(P0.02+P0.98)/2-P0.50]/[P0.75-P0.25]

[David & Johnson 1956]. A test of normality can be constructed, using the standard error of J0.02. This is based on interpolation of results from Monte-Carlo simulations (SE(J0.02):=1.36/sqrt(N) cf Resek [1974]).

Sibship variance test. This is the linear model suggested by Fain [1977] for the detection of the phenotypic effects of quantitative trait loci. Briefly, if parental trait values are at the extreme of the population distribution, then they will be carrying multiple increasing or decreasing alleles at the QTLs. As a result, the trait variance among their offspring is decreased compared to sibships whose parents have trait values close to the population mean. This U-shaped relationship between midparent value and sibship variance can be detected by fitting linear or quadratic curves.

Variance components analysis. This is the usual mixed effects analysis of quantitative traits assuming multivariate normality. The log-likelihood:

LL = -0.5 [log(det(S)) + (y-u)' inv(S) (y-u)]

is maximized, where S is the variance-covariance matrix for the trait values for each phenotyped pedigree member, and y and u are the trait values and their expected values:

u= B'X,

reducing to the grand mean in the absence of covariates. The main diagonal elements of S takes the value:

VA+VQ+VD+VE,

and the off-diagonal elements:

Ri,jVA+ibdi,jVQ+Ki,jVD,

where,

Ri,j is the coefficient of relationship for the i-jth pair of relatives
K is the coefficient of fraternity
ibd is the average ibd sharing at the marker location being tested for linkage to a QTL.

The maximization is performed using AS319 (variable metric minimizer with numerically estimated gradients). The phenometric models (ADE, AE, E) are fitted to the intact entire pedigree, but the models including VQ can be fitted to the entire pedigree, or to sibships only.

Categorical trait association analysis. This is the Pearson goodness-of-fit based test for equality of allelic gene frequencies at a marker locus in individuals expressing different trait values (RxC table). Both the nominal (ignoring relatedness of the sample) and empirical P-values for the test are output. The empiric P-value is estimated via gene-dropping simulation. These are based on the gene frequencies in the total sample, and the structure of the pedigree, and are conditional on the observed occurrence of the binary trait in the families. The gene-dropping can also be further conditioned on identity by descent at another marker (which may in fact be the marker being tested for association). This therefore gives a "model-free" test of association conditional on linkage.

Formerly, a sibship permutation based test was provided, calculating the same chi-square statistic for members of sibships that contain at least one affected and one unaffected typed individual, and generating a (within-sibship) permutation P-value. This is now replaced by a more powerful score (FBAT) test combining the within-sibship association and transmission-disequilibrium tests after Knapp [1999] and Laird et al [2000]. In this approach, the complete or partial genotype of untyped parents is reconstructed from the genotypes of the affected and unaffected children. The transmission of alleles from these reconstructed parents is conditioned on the genotypes of the children used in the reconstruction. The appropriate conditional distribution under the null hypothesis is approximated via Monte Carlo simulation and rejection sampling.

In the case of multiple generation pedigrees, Sib-pair scans the pedigree upwards nuclear family by nuclear family, saving imputed genotypes, so that these contribute as children to subsequent nuclear families. I don't think this will be biased, but have yet to check. This behaviour can be turned off.

Population genetic F statistics (FIS, FIT and FST) are also calculated for each marker assuming individuals with different indicator trait values come from separate related demes. These are estimated following the approach of Pons and Chaouche [1995], as described by Excoffier [2001]. Results for each locus are combined to give multilocus summaries for the sample as:

F*IS = (Σ HS - Σ HO)/Σ HS
F*IT = (Σ HT - Σ HO)/Σ HT
F*ST = (Σ HT - Σ HS)/Σ HT

Quantitative trait association analysis. This fits an additive (allelic means) model predicting an individual's trait value from his/her genotype at a marker locus. The residual sum-of-squares is compared to those obtained via a gene-dropping simulation of the pedigrees, giving an empiric P-value.

Two-locus and N-locus linkage disequilibrium estimation. One algorithm finds informative founder matings, or informative matings where all the grandparents are untyped, and imputes the two-locus haplotypes transmitted to the offspring. The loci are assumed to be tightly linked, so that four parental haplotypes are counted, as opposed to the more usual two haplotypes from the child. This is the default for markers where there are more than 6 alleles per marker. Both D and D' measures are calculated.

For two locus analyses, the second algorithm combines phased and unphased genotype data in an EM fitted log-linear model. This can handle X-linked loci. For more than two markers, and for testing haplotype association to a trait, phase information is discarded (ie all individuals are treated as unrelated), and only autosomal markers can be used.

Homozygosity analysis. This tests for an increase in observed homozygosity at a marker locus in individuals expressing a binary trait, comparing this to the predicted homozygosity based on the allele frequencies in the total sample. This may occur in the presence of allelic association with a recessive trait locus, and/or deletional loss of heterozygosity (the parents would be untyped for such an individual not to have been flagged as a Mendelian inconsistency of course). A one-tailed empiric P-value is estimated via "gene-dropping" simulation, based on the gene frequencies in the total sample, and the structure of the pedigree. The multipoint homozygosity analysis uses the mean maximum marker homozygosity run length in the set of cases. Again, gene dropping is used to produce the distribution of this statistic under the null given the marker map, allele frequencies and observed pedigrees.

Transmission-disequilibrium test. The original formulation of the TDT is for a diallelic marker [Spielman et al 1993]. The TDT statistic calculated by the program is the Pearson goodness-of-fit based test of symmetry in the square table of transmitted versus nontransmitted alleles to each affected child [Haberman, 1979]. Empiric P-values are produced by randomization of the table. This global allelic form does not correct for the (usually small) correlation in parental genotypes induced by linkage disequilibrium (absent of course under the null hypothesis). Another allelic test provided is the marginal allelic test suggested by Spielman and Ewens [1996], which is probably slightly more powerful than the global symmetry test [Kaplan et al 1997].

The genotypic TDT P-value is estimated via gene-dropping based on the genotypes of typed ancestors of probands (where both parents of the proband and all antecedents must be typed), and the structure of the pedigree. The test statistic compares the observed number of each genotype transmitted with the number expected based on the parental genotypes. Pairs of cells whose total count is less than a given cutoff may be excluded from the analysis to increase power. The P-value for the TDT testing each allele versus all others in turn ("allele-by-allele") is the exact two-sided binomial probability (via the beta distribution). When the plevel is zero, only the best of the allele-by-allele test results are printed, and the P-value is Bonferroni corrected for the number of alleles at that marker.

Note that the default option is to use probands for the TDT only one parent is typed. For the diallelic marker case with unequal allele frequencies, using one parent families does lead to biased results. The unified transmission test available via the "assoc" command does not suffer from this problem (see above).

The Schaid and Sommer [1993] genotypic risk ratio tests for familial association under the assumption of Hardy-Weinberg equilibrium, or conditional on parental genotypes, is also offered. This uses log-linear modelling (implemented as iteratively reweighted least squares) of a biallelic locus with both parents genotyped. The attributable risk is also produced.

Haplotype Relative Risk analysis. This is the original familial association statistic comparing transmitted and nontransmitted allele frequencies unmatched on family (Falk and Rubinstein 1987; Knapp et al 1993). As usual, an empiric P-value is estimated via "gene-dropping" simulation, based on the gene frequencies in the total sample, and the structure of the pedigree.

Sib-pair analysis. Identity by descent estimation is based on the sib pair and parental genotypes when available. In the case of untyped parents, the full-sib sharing is the sum of sharing for each possible set of parental genotypes weighted by their likelihood based on all children in the sibship. Half-sib sharing is estimated based only on known genotypes, whether observed or unequivocally imputed. The effective degrees of freedom for the t-test of the slope of the regression line is given as the number of individuals in the sample (counted once only) who are in a sib pair where both members are typed at trait and marker, minus two. For a sample made up of nuclear families (no halfsibs), this will be equivalent to the Σ(sibship size-1)-2 value used by SIBPAL 2.6, and originally suggested by Hodge [1984]. For binary traits, the same ordinary-least-squares analysis is performed -- the t-statistics from these results are only really applicable to large samples, and tend to be too liberal. The quantity regressed is not the usual Haseman-Elston squared trait difference, but a function of the squared trait sums and differences following Sham and Purcell [2001]. This approach is supposed to approach the power of the variance components approach according to those authors, and gives appropriate Type 1 error rates.

For the Fulker & Cardon methods, the expected ibd through the interval between the two markers is estimated using the equation given in Olson [1995]. Haseman-Elston regressions are performed at a series of points across the interval using the ibd sharing of the two flanking markers, and the given size of the interval. The Haldane mapping function is used.

Affected sib pair analysis. This is the original IBS based approach described by Lange [1986a], extended to half-sibs as per Bishop and Williamson [1990]. No correction for sibship size is made -- that is all possible pairs are treated as independent. The usual two d.f. chi-square is calculated, with expected counts being calculated based on the observed gene frequencies in the total sample. The IBD based mean test is also calculated.

Affected Pedigree Method. This uses the measures of genetic similarity described by Weeks and Lange [1988; see also, Lange, 1986a, 1986b], Whittemore and Halpern [1994], and Ward [1993, 1995]. The expected mean and variance for each pedigree is estimated via gene-dropping simulation. These are based on the observed gene frequencies in the total sample, and the structure of the pedigree. Both ibs and ibd based family scores can be estimated. The original APM statistics, the APM statistic of Whittemore and Halpern and the T(AB) and GPM statistics of Ward are calculated.

Multilocus IBS sharing statistics. These are used to confirm the pedigree structure using marker data. One approach calculates the overall probability of sharing two alleles IBS for full and half sibs, summing over all loci, and ignoring any linkage between markers. The second approach uses gene dropping based on the given marker map. Two lists are generated, one by individual, the other by relative pair.

Martingale residuals. The elegant approach of Commenges [1994] to genetic analysis of age-at-onset is to analyse the residuals obtained from a nonparametric or semiparametric survival analysis. Sib-pair implements the former, calculating the martingale residuals using the Nelson-Aalen estimator for the integrated hazard [eg Andersen et al 1993], which are then transformed following Therneau et al [1990] to give a more symmetrical distribution.

Generalized linear models and survival regression. These are the usual IRLS algorithms (using AS 164, [Stirling 1981]). The exponential and Weibull regressions are implemented as Poisson regressions (with log time as offset) as per Aitken and Clayton [1980].

Other standard statistical tests. A number of classical tests for independent data (eg unrelated cases) are implemented, such as contingency chi-square test (with Monte Carlo "exact" P-values, see below), ordinal by ordinal trend test for contingency tables [Yates 1948], and nonparametric one way analysis of variance via the Kruskal-Wallis test. Sib-pair can calculate the Pearson correlation coefficient for between-trait association, the Kaplan-Meier estimator for the survivor function, and Nelson-Aalen estimator for the hazard function, and measures of agreement for contingency tables.

Deterministic Estimation of Pedigree Likelihoods

Rather than the original Elston-Stewart algorithm [Elston and Stewart 1971] augmented by pedigree traversal, Sib-pair uses the iterative peeling approach described by Wang et al [1996] (following Janss et al [1992]). This has the advantage of "automatically" dealing with loops, although the resulting likelihood in that case is only approximate: this approximation can be improved by several methods, but these are not currently implemented. The algorithm follows the detailed description in Chapter 2.1 of Schelling [2004], though at present only allows evaluation for a single codominant locus.

Briefly, the iterative approach peels up and down simultaneously by calculating anterior and posterior values for each individual, where the anterior and posterior values represent the scaled likelihood contributions for ancestors and siblings, and mates and descendants respectively. The values for any individual rely on those for the other relatives, so these are reciprocally updated over multiple iterations until they converge. Usually a maximum of 10-20 iterations suffices, and if an ideal ordering of evaluations is used, a single iteration in unlooped pedigrees is sufficient ie it is equivalent to the usual pedigree traversal algorithms eg Lange and Boehnke [1983]; Wang et al [1996], Fernandez and Fernando [2001].

Monte Carlo Algorithms

Gene-dropping. The "unconditional" algorithm producing null distributions is as follows. Repeat the following 1-3 steps a large number of times. (1) Founder genotypes are assigned using the allele frequencies in the observed sample, assuming panmixia and Hardy- Weinberg equilibrium (HWE). Iterate, until all genotypes are assigned: (2) If both parental genotypes are nonmissing, randomly assign the index a genotype based on Mendelian autosomal inheritance (ie if parental genotypes are {1/2} and {3/4}, a child's genotype is randomly selected from {{1/3}, {1/4}, {2/3}, {2/4}}, with each genotype having a probability of selection of 0.25). Once complete, (3) calculate the test statistic based on the family's simulated genotype. Following completion of the outer loop, (4) summarize the distribution of the resulting test statistic. This procedure is used to generate null distributions for the association Pearson chi-square.

For the share command, this also allows for recombination between markers based on the given linkage map

The null distributions for the genotypic marginal TDT is generated using a "conditional on founders" algorithm, that takes observed founder/ancestor genotypes as given. Only typed nonfounders genotypes where both parents were typed are simulated.

ibd estimation. The modification to calculate ibd distributions gives each founder (two) unique alleles in his/her "typing genotype". A simple gene-drop gives the null distributions for the ibs and ibd statistics of the APM method.

I based the "conditional on observed genotypes" (gene-drop with rejection sampling) algorithm for calculating ibd on that described by Blangero et al [1995]. As before, each founder is assigned a typing genotype made up of two unique alleles. Offspring are only assigned ibd-typing genotypes that are consistent with the observed genotype at the observed locus of interest. For example, say the observed genotypes in the parents are 100/102 and 100/102, and the typing genotypes associated with these are set to {1/2} and {3/4} respectively. If the child is 100/102, assignment of typing genotypes {1/3} and {2/4} will be rejected. This is equivalent to a child's genotype being randomly selected from {{1/4}, {2/3}}, with each genotype having a probability of selection of 0.5. The resulting ibd statistics based on the typing genotype will approximate ibd for the marker locus of interest.

Multipoint ibd estimation. This is a simplification of the Cardon-Fulker approach as extended by Almasy and Blangero [1998]. The ibd estimates from sets of markers in complete linkage are combined by calculating the variance-weighted mean ibd for each pair of relatives over the set. The estimates of the ibd variances for each marker arise as a byproduct of the MCMC algorithm.

Gene-dropping conditional on ibd. The previous algorithms can be combined by gene dropping a marker conditional on "ibd indicator" genotypes that represent segregation at either that same locus, or a neighbouring more informative marker. This allows us to simulate the null distribution assuming linkage but no association.

Missing genotype simulation by Monte-Carlo Markov Chain (MCMC). The calculation of ibd in the presence of missing genotypes is performed via a Metropolis algorithm. This algorithm is a multiallelic extension of that described by Lange and Matthysse [1989]. One iteration of the generation of a legal constellation of imputed and observed genotypes is produced by:

(1) Perform (a) (b) (c) or (d):

(a) Simulate ibd, then "mutating" up to four imputed founder alleles. These propagate through the pedigree using the current pattern of ibd transmission as indicated by the ibd-typing alleles, and are rejected and resimulated if an (unordered) inconsistency with an observed genotype occurs.

(b) Simulate ibd, then switch the parent of origin for an individual heterozygote. Propagate this change up through the pedigree to the originating founder(s), but not below the chosen "pivot" individual.

(c) A "conditional on observed genotypes" dropping of ibd-typing alleles, with the refinement that this ignores imputed genotypes. Inconsistencies thus generated for imputed genotypes are resolved by changing the imputed genotypes. This procedure will be slow in the presence of a large number of untyped nonfounders.

For (a)-(c), additional local proposals (as below) are compounded to these, and the resulting proposed constellation is accepted or rejected via the Metropolis criterion.

(d) Resimulate all untyped x untyped founder mating joint genotypes conditional on their offspring and other spouses, then other pedigree members singly, again conditional on surrounding genotypes. This is a simple Gibbs sampler, and is more efficient than the above when there are many missing genotypes in larger pedigrees.

MCMC burn-in. In releases prior to version 0.96.0, there was no burn-in for this Metropolis algorithm, as preliminary empirical tests had found the results from this program agreed well with "exact" results from programs such as GENEHUNTER. Subsequently, I have found some pedigrees where using the starting genotypes from the Lange-Goradia approach does lead to biased ibd estimates for certain pairs of relatives. Therefore, the program now performs a number of burn-in iterations (default 100) prior to those used to estimate ibd. The required number of such iterations depends on the number of missing genotypes in the pedigree.

Metropolis generalized linear mixed model and finite polygenic model sampler. This is either a "standard" or "slice" Metropolis sampler, where the simulated variables include diallelic QTL genotypes, Gaussian breeding values, a single QTL allele frequency (shared by all QTLs in the FPM), up to three genotypic means (shared by all QTLs in the FPM), polygenic and environmental variances (including pedigree ("VC") and maternally-derived sibship ("VS") variances).

The trait model can be gaussian, binomial (with identity, probit or logit link), poisson (including log link), weibull or MFT.

Proposals for diallelic QTLs genotypes are straightforward to generate, and do not usually give rise to noncommunication between sets of legal genotype proposals. Proposals for continuous variables are generated from random normal deviates, and a tuning parameter can be set that alters the variance of these proposal distributions.

The likelihood contribution from the ith individual to the Metropolis criterion for these models is (see for example, Guo and Thompson [1993]):

LL = F*Σ(log(P(Gj))) + F*log(f(a|VA)) + (1-F)*log(f(a|aFA, aMO)) + log(c|VA) + I*log(f(y|G1,...Gj, a, c, VE))

where,
P(x) denotes the probability of x,
f(x) denotes the density of x,
y is the trait value,
a is the breeding value,
c is the pedigree-specific intercept,
Gj is the genotype at the jth QTL,
VA is the additive polygenic variance,
VC is the familial environmental variance,
VE is the error variance,
F=1 when a founder, 0 when a nonfounder
I=1 when phenotype observed, 0 when unobserved.

The conditional density for the breeding values of offspring includes the correction for inbreeding (the segregation variance being 1-0.5*(FFA+FMO). The random effects are modelled as zero-mean gaussian.

The realizations of the parameters are summarized as means, and approximate standard errors produced by batching (default B=iter1/2 [Jones et al 2005]). The interbatch lag-1 serial correlation is calculated as a diagnostic for the appropriate number of values to simulate [Ripley 1987].

The implementation of the generalized linear mixed models is quite straightforward in the chosen Metropolis paradigm (it would be more work to produce a Gibbs sampler, I believe), but for the "standard" sampler, it is important to check that the proposal acceptance rates are in the optimum range (usually stated as 0.2-0.6, Ripley 1987). This is less critical for the slice sampler, where the tabulated acceptance rates are actually the ratio of accepted proposals to the number of function evaluations (and so are just a measure of algorithm efficiency). Models fitting VC and VS or VA are two-level GLMMs and so I have fitted a number of test datasets from the literature. There are surprising differences between results from standard software for some of these datasets, so although Sib-pair sometimes does not give identical results to that from non-simulation-based maximum likelihood methods, this may reflect approximations used by other programs.

Increasing the number of random effects chains is realized by duplicating the families the appropriate number of times and correcting the likelihood and standard errors. One is essentially averaging over multiple estimates of the random effects for each individual, as global parameters such the fixed effects regression coefficients and overall variances are the same over the replicate chains at that iteration. This seems to reduce bias in the estimation of the random effects, but with the side effect of increasing the between-batch correlation, and so slowing estimation. The tabulated results below generally used 4 chains run for 10000 iterations after a 1000 iteration burnin.

Binomial GLMM analysis of seed germination dataset of Crowder et al (1978) using different approaches. PQL1 is the penalised quasilikelihood approach implemented as glmmPQL() in the MASS package [Venables and Ripley 2002], while PQL2, AGQ are results from lmer() in the lme4 package of Bates and Sarkar [2005] using penalized quasilikelihood, adaptive Gaussian Quadrature respectively. The BUGS results are from the BUGS Examples manual.
Parameter Estimate (SE)
Method Sib-pair AGQ PQL1 PQL2 BUGS
SD of Plate Effect 0.28 (0.17) 0.24 (0.09) 0.23 0.24 0.29 (0.15)
Intercept -0.51 (0.13) -0.54 (0.17) -0.54 (0.17) -0.54 (0.16) -0.51
Seed 0.06 (0.32) 0.10 (0.28) 0.09 (0.27) -0.09 (0.28)
Root 1.31 (0.26) 1.33 (0.24) 1.32 (0.23) 1.32 (0.24)
Seed x Root -0.79 (0.44) -0.81 (0.38) -0.81 (0.38) -0.81 (0.38)

Binomial GLMM analysis of "bacteria" dataset from the R MASS package [Venables and Ripley 2002] using 5 different approaches. PQL1 is the penalised quasilikelihood approach implemented as glmmPQL() by Ripley in the MASS package, while PQL2, AGQ and Laplace are results from lmer() in the lme4 package of Bates and Sarkar [2005] using penalized quasilikelihood, adaptive Gaussian Quadrature and the Laplace approximation respectively.
Parameter Estimate (SE)
Method Sib-pair AGQ PQL1 PQL2 Laplace
RE Variance 2.05 (1.23) 1.70 (1.05) 1.98 3.27 1.66
Intercept 3.70 (0.76) 2.86 (0.48) 2.74 (0.38) 2.75 (0.48) 2.81 (0.48)
Low dose -1.46 (0.74) -1.36 (0.82) -1.25 (0.64) -1.25 (0.82) -1.35 (0.82)
High dose 0.60 (0.74) 0.58 (0.85) 0.49 (0.67) 0.49 (0.85) 0.58 (0.85)
Week>2 -1.66 (0.51) -1.63 (0.46) -1.61 (0.36) -1.61 (0.46) -1.57 (0.46)

Binomial GLMM analysis of contraception usage data from the 1988 Bangladesh Fertility Survey [Steele et al 1996].
Parameter Estimate (SE)
Method Sib-pair PQL
RE Variance 0.25 (0.06) 0.22
Intercept -1.67 (0.16) -1.66 (0.15)
Age -0.03 (0.01) -0.03 (0.01)
Urban 0.72 (0.07) 0.72 (0.12)
1 child 1.10 (0.12) 1.09 (0.16)
2 children 1.36 (0.15) 1.35 (0.17)
3+ children 1.32 (0.17) 1.32 (0.18)

Poisson GLMM analysis of epileptic seizure count data of Thall and Vail [1990] using different approaches. PQL1 is the penalised quasilikelihood approach implemented as glmmPQL() in the MASS package [Ripley and Venables 2002], while PQL2, AGQ are results from lmer() in the lme4 package of Bates and Sarkar (2005).
Parameter Estimate (SE)
Method Sib-pair AGQ PQL1 PQL2
RE variance 0.268 (0.101) 0.252 0.197 0.101
Intercept 1.834 (0.112) 1.833 (0.074) 1.870 (0.106) 1.870 (0.074)
Progabide -0.346 (0.152) -0.334 (0.105) -0.310 (0.149) -0.309 (0.105)
log basal rate 0.861 (0.119) 0.883 (0.091) 0.882 (0.129) 0.882 (0.091)
Base:therapy 0.394 (0.157) 0.339 (0.143) 0.342 (0.203) 0.342 (0.143)
log age 0.513 (0.330) 0.481 (0.244) 0.534 (0.346) 0.533 (0.244)
Period 4 -0.159 (0.054) -0.160 (0.055) -0.160 (0.077) -0.160 (0.143)

Poisson GLMM analysis of European male melanoma death rate dataset of Langford et al (1998) using different approaches. PQL1 is the penalised quasilikelihood approach implemented as glmmPQL() by Ripley and Venables [2002] in the MASS package, while PQL2, AGQ are results from lmer() in the lme4 package of Bates and Sarkar (2005). The STATA result used the xtpois command, and comes from the review article at http://www.mlwin.com/softrev/revstata.html.
Parameter Estimate (SE)
Method Sib-pair AGQ PQL1 PQL2 STATA
Region variance 0.188 (0.037) 0.170 (-) 0.161 0.125 0.102 (-)
Intercept -0.151 (0.058) -0.139 (0.043) -0.129 (0.049) -0.129 (0.043) -0.138 (0.017)
UVB insolation -0.035 (0.011) -0.034 (0.009) -0.038 (0.010) -0.038 (0.009) -0.056 (0.004)

The final results are sometimes sensitive to the choice of starting values for the random effects (the fixed effects are started automatically from the marginal model parameter estimates using "reg"), and to the proposal step size. Because of the correlation between random and fixed effects in GLMM's other than Gaussian (since the intercept affects variance), differences in the estimated random effect size do alter the fixed effects regression coefficients.

The output (with plevel set to 1) allows plotting of parameter estimates to assess convergence of the chain.

Randomized TDT. The randomization test for the global allelic TDT permutes the transmission table by randomly selecting a single proband-parent pair and reversing the transmitted and nontransitted alleles. One "shuffle" of the table involves N such permutations, where N is the number of such informative parent-proband pairs in the observed pedigrees (this reduces the correlation between successive tables in the random walk to close to zero).

Sequential empiric P-values. The Monte-Carlo P-values provided for the various MC-based tests are produced using the sequential approach described by Besag and Clifford [1991]. In this refinement, we only generate as many pseudosamples as is necessary to give a P-value numerator of size mincount; the denominator is the number of pseudosamples. The practical effect of this procedure is that if the true P-value is large, then relatively few pseudosamples are generated to give a less precise estimate of this uninteresting value. Besag and Clifford suggest a value for mincount of 10-20. It is necessary to set a maximum denominator to avoid excessive computation for "highly significant" results.

Other empiric P-values. An exception to this is the algorithm used for empirical P-values for the APM. Here, a P-value for each family is simulated at the same time as the mean and variance. These P-values were previously combined using the procedure due to Fisher [Hedges and Olkin 1985], that is, twice the sum of the natural logarithms of the P-values was treated as a chi-square variate with 2*N degrees of freedom, where N is the number of contributing families. This does not seem to be particularly powerful, so each P-value is now inverse-normal transformed to a Z-score, and these combined in an unweighted fashion [Hedges and Olkin 1985].

USAGE

The program reads commands from standard input, and writes results to standard output. Therefore, the program can be run interactively, or if a series of commands is to be found in a file, in batch mode. If the input file was test.in, entering "sib-pair <test.in >test.out" would perform the commands in test.in, and write results to test.out.

A command is a single line of keywords, locus names and/or variable values. If a "\" character is the last word of a line, the next line is interpreted as a continuation of the previous command. Sib-pair is case-sensitive, so that the keyword "READ" is not equivalent to "read". Commands are either global, which can be entered at any time; descriptive (set impute, set locus, read pedigree), which must precede the run statement; the run statement, that causes the dataset to be read and processed; or analytic, which act only after the run statement.

One command, set plevel, controls verbosity of output. Some useful descriptive tables are only printed if plevel is at least 1.

Sib-pair's parser can evaluate simple algebraic and logical expressions for each record in a datafile, but does not directly allow complex programming. It does offer simple macro facilities including a loop construct. The eval command accesses a Scheme interpreter that can carry out more sophisticated computing.

There are command line options that may save a few keystrokes. Running "sib-pair -i test.in" starts Sib-pair and "includes" the contents of test.in. The command "sib-pair -l test.in" starts Sib-pair and runs the "locus" command on the contents of test.in. The command "sib-pair -h" (or --help) starts Sib-pair and runs the help command. If the program has been compiled with g95, the --g95 flag prints useful information about environmental variables that can be tweaked to alter performance, and run time error messages.

Global commands

  1. !|#. The rest of the line is a comment, and is echoed to standard output.
  2. echo. Prints the rest of the line to standard output.
  3. $ <operating system command>. The rest of the line (up to position 80) is a command, and is passed to the shell for execution.
  4. dir [<OS specific args>]. Prints list of files in current directory.
  5. pwd [<new directory>]. Prints name of current directory, or changes directory to that specified.
  6. file delete|rename <name> [<new_name>]. Deletes or renames a file.
  7. clear. Restarts the program, closing all workfiles and zeroing all arrays.
  8. help [All|Examples|Globals|Operators|Data| Analysis|<search_string>]. Prints a brief description of the commands -- either all, a subset, or all matching the search string.
  9. info. Information about program settings and the current dataset. For the latter, gives counts of active and inactive pedigrees, individuals, and loci and a table of numbers observed for every trait and marker.
  10. list|ls|which [<loc1> [[to] <locN>]] [$(a|q|m|h|x)[m|r]]. List of loci in current analysis. The "$<letter>" keyword denotes the list of of one class of locus, which can be reordered according to genetic map position (m) or in reverse (r) to the dataset ordering. The which command gives the position/rank of each locus argument in the total list of loci.
  11. show loci. List of active loci along with table of numbers available (as per info).
  12. show pedigrees. Same as gener, with print level 0.
  13. show ids. List of indidivual IDs tabulated versus pedigree(s).
  14. show map. Shows the current marker map.
  15. show macros. Shows the current macro definitions.
  16. time. Print time elapsed since start of the program.
  17. set timer [on|off]. Show time taken by each command.
  18. include <command_file>. Read in Sib-pair commands from a file. If the command file is not specified, a directory browser will start up.
  19. output [<output_file>]. Divert Sib-pair output from standard output (the screen) to a file. Issuing the output command a second time closes the output file. If the command file is not specified on the first call, a directory browser will start up.
  20. macro <macro_name> [ (= <macro value>) | (<- all <marker> | bur | che | epo | imp | ite | ls | min | ple | pwd| sex | twi)]. Create a macro variable or function. If an equals sign is present, the rest of the line is taken as the macro variable body. If the <- (setting) sign is present, the macro variable takes the current value of the named program setting (or list of alleles or markers).

    Otherwise, multiple lines of the body of a macro function are expected to follow on subsequent lines, terminating with an empty line or ";;;;". The function body contains Sib-pair commands and parameters for substitution, represented as %1, %2 etc. The macro is called either as a function or a variable.

    To call a macro as a function, the entire macro name must be the first word on the command line, followed by any required parameters (which can be any legal string). The function body is evaluated, and the resulting Sib-pair command(s) is then acted upon. For example:

    # Draw a pedigree (need to have dot and gv):
    # First set up macro
    >> macro draw
    draw> select pedigree %1
    draw> write dot %%.dot %2
    draw> $ dot -Tps -o %%.ps %%.dot
    draw> $ gv --scale=-4 %%.ps
    draw> $ rm %%.dot %%.ps
    draw> unselect
    draw>
    # Call macro
    >> draw ped1 trait

    The %% keyword is replaced by the macro call ID, a unique 5 character string. The %0 keyword is replaced by all the macro parameters, while %+2, %+3... is replaced by that parameter and all subsequent parameters.

    To call the macro as a variable, the entire macro name is prepended by the "%" character. It can then appear anywhere in a command. The macro variable is replaced by the macro body, but there is no evaluation of macro function parameters or %%. The variable name can be delimited by brackets, so that it can be embedded in another string. The resulting command is then parsed:

    >> macro a=1
    >> macro b=+
    >> macro b1=2
    >> %a%b%a
    => 2.
    >> 1%b1
    => 12.
    >> 1%(b)1
    => 2.
    >> macro a=D14S52 D14S43
    >> tab %a
    ------------------------------------------------
    Cross-tabulation of "D14S52" ... "D14S43"
    ------------------------------------------------
    [...]

  21. eval [<Scheme expression>]. Accesses the Scheme read-eval-print-loop. If called without arguments, presents the Scheme prompt "%%", otherwise evaluates the rest of the line as if it were a Scheme expression. Sib-pair macro variables and functions are stored within the Scheme environment, and so can read and written to. The Sib-pair specific extensions to Scheme are "(nloci)", "(ls)", "(loc)", "(loctyp)", "(locord)", "(locnotes)", "(read-line)", "(run)" and "(help)".

    (ls ["m" | "x" | "h" | "a" | "q" | "d[mxhqa]"])List locus names
    (nloci)returns total number of loci
    (loc )returns locus at that position in the locus list
    (locord <loc>)returns position of a locus in the locus list
    (locnotes <loc>)returns notes for a locus
    (loctyp <loc>)evaluates type of a locus ("adhmqx")
    (read-line <port>)Read next line from port as a string
    (run "<Sib-pair command>")Run a Sib-pair command
    (string-split <string> [<sep>]Convert from string to list of words split on white-space (or a specified character separator)
    (system "<operating system command>")Run an OS command
    (help)Information about this Scheme implementation

    See the appendix for more details.

  22. last [<line_number>]. With no argument, displays the command history, otherwise submits that line of the history for reevaluation. A negative line number counts backwards from the current line. The command history is saved to a file "sib-pair.log".
  23. quit|bye. Halts the program.
  24. set prompt [on|off]. Displays a prompt, and activates/resets the command line history.
  25. set gui [on|off]. Activates or stops the GUI. The GUI is either Java based (using the japi library to interface with the Java AWT library), or GTK2.0 based (using the pilib library to interface), depending on the compile time choice.
  26. set ndecimal_points [<nwid>] <ndec>. The total width (number of characters) of a quantitative variable written to a new pedigree file defaults to 9 (and is fixed to 8 for some files, notably MENDEL and FISHER) but can be set as high as 20 for GAS and LINKAGE format files. The number of decimal places can be set to ndec.
  27. set epoch [iso|jul|<epoch>]. Set or show the epoch used for julian dates. Defaults to "iso" epoch of 1970-01-01.
  28. set out|plevel <level>|verbose|on|off. Increasing the print level causes more information to be printed by almost all procedures. Print level 1 prints out the identities and genotypes of parents imputed where the genotype was missing, raw counts of genotypes for the hwe procedure, expected ibs probabilities for the asp procedure etc. Print level 2 (or verbose) writes out the statistics for each simulated dataset for the MC based procedures, the intrapair variance and ibd sharing for each pair in the sib pair analysis, etc. Print level -1 omits outputting a list of pedigrees.
  29. set weight founders|imputed. Weights contribution of each pedigree to the allele frequencies by the number of typed founders, or alternatively gives the count of the founder alleles, observed and imputed.
  30. set analysis [imputed | observed]. Includes imputed genotypes in generalized linear models fitted using the regress command. Genotypes are automatically reimputed each time regress is called. The idea of this is to allow multiple imputation association analysis. Imputation is performed using Sib-pair's single locus genotype MCMC sampler. The regress command with imputed genotypes respects the set frequencies command, so that the population allele frequencies can be specified in advance.
  31. set burn-in <number of iterations>. Controls the number of Markov Chain Monte-Carlo iterations used by the apm algorithm discarded before estimation commences. Default is 100 iterations. Setting bur to zero means no burn-in is performed (the old default).
  32. set iteration<number of iterations>. Controls the number of iterations used by the various Monte Carlo algorithms. Default is 200 iterations. Setting ite to zero means the Monte Carlo procedures are not performed.
  33. set emit <number of iterations>. Controls the number of (Monte-Carlo) Expectation Maximization iterations used by the mcf algorithm.
  34. set batch <number of batches>. Controls the number of batches used by the fpm algorithm for the estimation of parameter standard errors.
  35. set chain <number of MCMC chains>. Controls the number of chains used by the fpm algorithm.
  36. set tune <MCMC tuning parameter>. Controls the multiplier for the MCMC proposal step size used by the fpm algorithm. The base step size is usually the fixed effects model standard error for that parameter, and tune defaults to 0.3.
  37. set mincount <minimum numerator of P-value>. Controls the number used for Monte Carlo simulation of a P-value. Default is 20 pseudosamples with a test statistic more extreme than that for the observed statistic. Set mincount equal to iter if this is not desired.
  38. set seeds <seed1> <seed2>< seed3>. Initializes random number generator seeds to given values, rather than via system time.
  39. set fbatimpute on|off. Enables (default) or disables imputation of missing child alleles based on their own offspring.
  40. set tdt bot[h parents]|one [parent]|first. Limits TDT statistic to cases where either both parents or at least one parent is typed, or one proband per family where both parents typed.
  41. set hre zero|children. Assume zero recombinants between markers for dis command where parents genotyped, thus counting four imputed parental haplotypes. Alternatively, only utilize two haplotypes from children.
  42. set map function kosambi|haldane. Set the mapping function used by multipoint analytic and locus file output routines.
  43. set sml <Frequency of A allele> <Penetrance of AA genotype> <Penetrance of AB genotype> <Penetrance of BB genotype>. Set default single major locus model written to locus files (usually p=0.05, pen=(0.5, 0.5, 0.05)).
  44. set liability <binary trait> <liability class indicator> <number of classes>. Declare a quantitative trait to indicate liability class for the named binary trait. Used to generate Linkage format locus and pedigree files.
  45. pchisq <chi-square> <degrees of freedom> [<df2>]. Calculate P-value for central Chi-square distribution (or F-distribution).
  46. chisq <nrows> <ncols>. Calculate contingency chi-square and permutation P for flat table entered via keyboard.
  47. proportion <numerator> <denominator> <confidence interval width>. Calculate accurate confidence interval following Wilson (as described by Agresti and Coull) for a proportion.
  48. sml <Frequency of A allele> <Penetrance of AA genotype> <Penetrance of AB genotype> <Penetrance of BB genotype>. Calculates recurrence risks and segregation ratios under a specified diallelic generalized single major locus model.
  49. sml <Frequency of A allele> <Mean for AA genotype> <Mean for AB genotype> <Mean for BB genotype> <standard deviation for AA genotype>. [<AB SD> [<BB SD>]]. Calculates mean, variance components and parent-offspring regression results under a specified diallelic generalized single major locus model.
  50. grr <trait prevalence> <Frequency of A allele> <genotypic risk ratio> [add|dom|rec]. Calculates recurrence risks and segregation ratios under a diallelic generalized single major locus model specified via trait prevalence, ratio of penetrances and pattern of inheritance (codominant multiplicative, dominant or recessive).
  51. ito <Frequency of A allele> [<Penetrance of AA genotype> <Penetrance of AB genotype> <Penetrance of BB genotype>]. Calculates conditional genotype frequencies for a relative of a proband of given genotype (if only allele frequency provided) or affection status under the specified diallelic generalized single major locus model (if penetrances were also specified). ITO refers to the matrices used to perform the calculation for pairs of relatives.
  52. Algebraic operators and functions

  53. "<allele1>/<allele2>". Double quotes mark the contained text for special evaluation by the parser. A constant genotype is written as two numbers (1-999) or letters (a-zA-Z) separated by a slash and surrounded by quotes. Other quoted items are passed intact to be read, either as a reserved command or as a single Fortran real, so "1+3" is evaluated as 1000, and "1 1" as 11.
  54. (<value>|<locus>) *|/|+|-|^ (<value>|<locus>). Arithmetic operations combining numerical constants and/or trait values. The result of an operation involving constants is a single constant, but an operation involving a trait value results in nobs results (where nobs is the number of individuals in the pedigree file.
  55. (<value>|<locus>) <|>|=<|=>|ge|le| eq|==|ne|^=|and|or (<variable>|<locus>). Logical operations comparing numerical constants and/or trait values. when operating on genotypes, the equality and inequality operators require both pairs of alleles to meet the criterion, but the comparison operators test true if either pair of alleles meets the criterion. That is "2/2">"1/3" evaluates to True, but "1/2"=="2/2" evaluates to False.
  56. if <logical expression> then <action> [else if <logical expression> then <action>]... [else <action>]. Conditional evaluation of expressions. The if statements can be nested.
  57. log|log10|sqrt|exp|sin|cos|tan|asin|acos|atan|abs|int|round (<variable>|<locus>). Functions acting on numerical constants and/or trait values.
  58. rand|rnorm. Produce a random value from U(0..1) or N(0,1).
  59. istyp|untyp <marker>. Test if genotyped at given marker. Necessary since if imputation is higher than -1, all untyped individuals have a genotype containing negative allele numbers (used to start MCMC algorithm).
  60. ishom|ishet <marker>. Test if homozygous (or heterozygous) at given marker.
  61. alla|allb <marker>. Return the first or second allele for each individual at the given marker.
  62. marcom. Show the maximum of the number of markers an individual and any of his relatives are both genotyped at.
  63. numtyp|anytyp|alltyp. Show number of markers an individual is genotyped at, or indicate whether genotyped at any one or all marker loci.
  64. male|female|isfou|isnon. Test sex and founder status of individual.
  65. num|nfound. Number of members and number of founders of the pedigree containing an individual.
  66. famnum|index. Position of pedigree and of individual in the active dataset.
  67. <expr> : <expr> [:...]. The colon separates the members of a sequence of algebraic expressions. Its main function is to allow multiple expressions in the branches of an if-then-else statement, but it will minimize overheads in repetitive calculations. If multiple bracket-enclosed expressions are encountered, then a : is automatically inserted:

    >> (a=rnorm) (b=a^2)

    Command iteration

  68. ... { (<val1> <val2> ...<valN>)|($[mxqa]) } .... Repeat the associated command, placing each value of the iterator list into the position the list currently occupies. There may be multiple iterators, and iterator lists may be nested. This macro extension allows much of the functionality of proper computer languages:
    Iteration over a range
    of numerical values of a function
    sml {0.1 0.3 0.5} 1 0 0
    Iteration over a range
    of loci of a function
    tab a1 {m1 m2}
    Iteration over a range
    of functions
    {tdt ass} a1
    Combinatorial generation
    of strings eg locus names
    set loc a{1 2 3} aff
    Compound statementsif (male) then m{1 2}=x

    Data Declaration commands

  69. set datadirectory <pathname>. Sets directory to be searched for pedigree files.
  70. set workdirectory <pathname>. Sets directory to which temporary files are written.
  71. set impute off|on|full|sequential|lange-goradia|nil|<level>. Toggles imputation routine.
  72. set errordrop off|on|<level>. Toggles automatic deletion of genotypes that give rise to a Mendelian inconsistency, either an entire nuclear family (level 1), or an entire pedigree (level 2, the default).
  73. set checking off|on. Toggles the first level testing for Mendelian inconsistencies within nuclear families.
  74. set locus <locus name> <locus type> [<map position> [<description...>]]. Declares position (by order within list), name and type of locus within pedigree file. Locus type may be either:

    marker an autosomal (fully) codominant marker
    xmarker X-linked codominant marker
    haploid Y or mitochondrial codominant marker
    quantitative quantitative (or interval or ordinal) trait
    affectionbinary trait


    It is best to avoid a locus name containing reserved characters (eg "+-*/()^"), if algebraic manipulation of that variable will be required (otherwise quotation of the name is required). Names identical to commands also cause trouble unless protected by brackets.

    The fifth column (optionally) contains the genetic map position. All subsequent words (up to a total of 40 characters) are stored as an annotation. The annotation is appended to the long form of output of some commands (eg show loci or list), and is searchable by some commands (currently keep|drop where). If you wish to annotate, but do not have a map position, a "." for the fifth column is unobtrusive and accepted by Sib-pair.

  75. declare loci <number>(m|x|q|a) [<number>(m|x|q|a)... ]. Declare a batch of loci, automatically generating names: "trait1", "trait2"..."traitN", "mar1", "mar2"..."marN" as appropriate.
  76. rename <locus name> [to] <new name>. Change name of previously declared locus.
  77. loci <command file>. Read in Sib-pair locus and pedigree file declarations from a file. If the command file is not specified, a directory browser will start up.
  78. read locus linkage <locus file name>. Read locus names, types and map positions from a Linkage-format locus (.dat) file. Does not recognise factor coding of genotypes, but does create a new quantitative trait for liability class
  79. read locus merlin <locus file name> [xli]. Read locus names, types from a Merlin-format locus (.dat) file. Markers are treated as sex-linked when the xli modifier is used.
  80. read pedigree <pedigree file name>|inline. [skip <lines_at_beginning>]. Reads a GAS type pedigree file either from an external file, or inline following the command. The inline data is terminated by a line containing ";;;;". The skip keyword leads to the skipping past that number of lines at the beginning of the file.
  81. read linkage <pedigree file name>|inline. Reads a pre-MAKEPED LINKAGE type pedigree file.
  82. read ppd <pedigree file name>|inline. Reads a post-MAKEPED LINKAGE type pedigree file.
  83. read cases <data file name>|inline [sex] [skip <lines_at_beginning>]. Reads a data file containing unrelated individuals. The individual ID is the first column of data, which is followed by all the phenotypes. If the sex keyword is present then the second column of data is expected to be the sex. The skip keyword leads to the skipping past that number of lines at the beginning of the file.
  84. set sex marker <locus name>. Declares a sex-informative marker, such as Amelogenin.
  85. set twin <quantitative trait>. Declares a variable to identify monozygotic (MZ) twins. All individuals within a pedigree with the same nonzero value of this variable are taken to be part of the same MZ sibship (twin pair or higher order multiple). Different values indicate different MZ pairs within the same family. This information is used to write out MZ twin indicators to the pedigree files used by MENDEL, MERLIN and SOLAR, and to test for marker inconsistencies.
  86. set skiplines <slines>. Skip first slines lines in pedigree file) when reading in.
  87. order <loc1>...[<locB> to <locC>]... [$(m|x|q|a)[r|m]]...<locN>. Set order of loci. Addition of r to a class eg $mr, reverses the order of all members of that class, while the m modifier causes the order to be the genetic map order. You may have to revise the genetic map order (by set map or set dist to get sensible export files for some programs such as Linkage (Sib-pair assumes a map position lower than the preceding position implies the markers are unlinked).
  88. set map <pos1>...<posN>. Set map positions for the marker loci. This will overwrite any original map positions.
  89. set distances <dis1_2> <dis2_3>...<posN-1_N>. Set interlocus map distances map positions for the marker loci. Distances are in centiMorgans. This will overwrite any original map positions.
  90. read map <map file name> [[k]bp]. Read in map positions for loci from a file, matching via names of previously declared markers. Should recognize most formats of map file automatically eg Merlin, Mendel, Solar. Tests number of columns and whether column contents are numeric or alphabetical, skipping first row as possible header. The bp (kbp) modifier tells Sib-pair to divide the positions by 106 (103); thus the map distances become Mbp, and remain readable.
  91. set frequencies <marker> [<allele_freq1>...<allele_freqN>]. Sets the "population" allele frequencies for a marker to be used by MCMC procedures that simulate missing genotypes for that marker. This currently only affects the set analysis and gpe commands. Only one marker at a time can have specified frequencies. To free the frequencies (allow calculation from the dataset), call the command without specifying any frequencies after the marker name.
  92. run. Reads in pedigree file and creates working pedigree file. Imputes genotypes if requested.
  93. Analysis and data manipulation commands

  94. keep|drop <loc1>...[<locB> to <locC>]... [$(m|x|q|a)]...<locN>. Retain or exclude loci for subsequent analysis. Consecutive loci can be summarized as a range, as can all members of a particular class of locus type (marker, quantitative, affection) via a class ($type) token. Note that dropped variables can still be used in algebraic and logical expressions.
  95. keep|drop where (monomorphic | max <frequency> | number_typed <ntyp> | distance <smallest_gap> | | r2 <smallest_r2> | every number_skipped | position <start_position> <end_position>; | hwe_p [<critical_P>] | test_p [<critical_P>] | search_string>). Retain or exclude loci for analysis. Note that dropped variables can still be used in algebraic and logical expressions. The where condition can be used to match the set of loci meeting that condition. Available conditions are: test that a marker is monomorphic, that the commonest marker allele frequency exceeds a threshold, the number of individuals typed falls below a threshold, the marker is closer than a set amount to the last included marker (map distance), marker is in too high linkage disequilibrium (r2) with the last included marker, every Nth marker in list, within an interval on the genetic map, the HWE test or most recently applied test P-value is smaller than the critical level (defaults to 0.05/Nmarkers), or the marker annotation matches the search string.
  96. undrop [(<loc1>...[<locB> to <locC>]... [$(m|x|q|a) ] ...<locN>]) | where <search string>]. Return previously dropped loci to analysis. Default is to undrop all dropped loci. Loci can be selected on name (including wild cards), class, or annotation (where). This is not the reverse of the delete command.
  97. select [containing|exactly <nprobands>] [where] <a logical expression>. Select pedigrees containing one or more individuals with a trait value meeting the criterion.
  98. select pedigree|id [[not] in] <ped1>...<pedN>. Select pedigrees included or excluded from a list of pedigree or individual names. The names can contain wildcard characters: "." (match any character in the target at that position in the search string) and "*" (match any characters zero or more times in the target at that position in the search string).
  99. unselect [<Nth>]. Returns all pedigrees excluded by a select command back to the analysis. If an integer argument is given, this gives how many select statement subsettings to roll back. The argument can be negative (reversing the effect of a previous targetted unselect).
  100. pack loci|pedigrees. Permanently delete all loci currently excluded by a drop command, or all pedigrees currently excluded by a select command from the work file.
  101. edit <pedigree> <person>|all <trait> to <value> [<new value>]. Allows editing of trait values or genotypes. The all keyword performs the action on all members of that pedigree: since wildcards can now be used, an equivalent is "edit <ped> *".
  102. delete <pedigree> <person>|all Sets all data to missing for a specified individual. The all keyword performs the action on all members of that pedigree.
  103. delete [<locus1>...<locusN>] [whe|where] <a logical expression>. Sets specified data to missing for all individuals meeting particular criteria.
  104. get <relationship> <statistic> <trait> [<newtrait>]

    getall mean<trait>[<newtrait>]
    offspring|children minimum
    sons maximum
    daughters sum
    parents count
    father sample
    mother
    spouse
    husband
    wife
    siblings
    brothers
    sisters

    Summarizes trait values of all relatives of the specified class of all individuals, saving the result to a trait locus if requested. The sample option samples with replacement unless the all option is used, when the sampling is without replacement.

  105. recode (<marker>|$(m|x)) [frequencies]. Recodes alleles at that marker or set of markers to 1..N, where the ordering defaults to the allele size (or collation order for letter alleles). If the freq modifier is present, the numbering is by ascending allele frequency. If the let modifier is present, the numbering is "1..4" for "ACGT", and the reverse for num.
  106. recode <marker> <all1|value1>...<allN|valueN> to <new allele|new value> [...<newN>]. Allows pooling and/or recoding of marker alleles prior to subsequent analysis. If there are fewer new values than old values, the last new value is recycled.
  107. combine <marker1> [...<markerN>] [<threshold>]. Pool rare alleles for a marker into one new allele. "Rare" defaults to a frequency of 5%, but can be changed via the last parameter on the command line.
  108. flip <marker> [to <marker2> ] [...<markerN>]. Recode SNP marker alleles to the complementary strand coding ie "[ACGT]" -> "[TGCA]".
  109. date <quantitative_trait> [julian|gregorian|year]. Convert a numeric variable from Julian to Gregorian, Gregorian to Julian, or Gregorian to "decimal" year. The "chronological" Julian date is the number of days since the epoch, usually 1970-01-01 or -4712-01-01. Gregorian dates are represented as 8 (or 9) digit integers of the form of (-)YYYYMMDD. The decimal years are YYYY.x, where the decimal part is the day of year number (from 1...366) divided by the length of that year (365 or 366).
  110. date (<yyyymmdd> julian)| (<juldate> gregorian). Convert a single date from Julian to Gregorian or Gregorian to Julian.
  111. test dob <DOB_variable> [gregorian] [<threshold>]. Test that parent and offspring DOBs are consistent. The threshold controls the minimum age of parents at birth of offspring, and defaults to 4380 days, if gregorian has been used to declare the DOB variable to represent Gregorian dates.
  112. test age <age_variable> [<threshold>]. Test that parent and offspring ages are consistent. The threshold controls the minimum age of parents at birth of offspring, and defaults to zero, since the units are unknown.
  113. test sex. Uses X-chromosome and/or Amelogenin to test the putative sexes of genotyped individuals.
  114. test haploid [mitochondrial]. Uses Y-chromosome or mitochondrial markers to test the putative relationships of genotyped individuals.
  115. transform <xtrait> <divisor> <subtractand> <power> <lower threshold> <higher threshold>. This transform the quantitative trait xtrait as:
    boxcox({xtrait-subtractand}/divisor)

    where boxcox() is (a slightly altered) Box-Cox transformation, so that:

    The resulting transformed value can then be truncated above or below using a specified low or high threshold.
  116. standardize <trait> [familywise]. Replace each trait value with its Z-score, ie (x-xbar)/sd, where xbar is the total sample mean, and sd the total sample standard deviation. This can also be performed using the individual's family mean and standard deviation, if the fam keyword is included.
  117. adjust <ytrait> on <xtrait> [to <adjustment value of xtrait|m|f>]. Performs linear regression of quantitative trait ytrait on quantitative or binary trait xtrait (or sex, if sex is set to on), calculates residuals, and adds adjustment value or, if not specified, the mean value of xtrait. The residuals then replace the original values of ytrait. A multiple regressive adjustment of Y on X1 and X2 requires sequential adjustment of Y on X2,X1 on X2, and then Y on the adjusted X1. Has been superceded by the more powerful residuals command.
  118. residuals <ytrait> on <loc1>...[to]...<locN> [complete_obs]. Replace quantitative trait with the residuals from the multivariate regression on the list of predictors (which may include the average allele length of a marker locus). The com option means only individuals with no missing values for any of the listed traits will be updated. Otherwise, missing values are replaced with the sample mean for that phenotype when calculating the predicted value.
  119. predict <ytrait> on <loc1>...[to]...<locN> [complete_obs] Replace quantitative trait with the predicted value from the multivariate regression on the list of predictors (which may include the average allele length of a marker locus). The com option means only individuals with no missing values for any of the listed traits will be updated. Otherwise, missing values are replaced with the sample mean for that phenotype. when calculating the predicted value
  120. impute <ytrait>. Replace missing quantitative trait values with the predicted values from a regression on the spouse, sibling and offspring observed values. Designed mainly for imputing missing age or date of birth.
  121. impute <ytrait> on <loc1>...[to]...<locN> [complete_obs] Replace missing quantitative trait values with the predicted value from the multivariate regression on the list of predictors (which may include the average allele length of a marker locus). The com option means only individuals with no missing values for any of the listed predictor traits will be updated. Otherwise, missing values are replaced with the sample mean for that phenotype when calculating the predicted value.
  122. kaplan-meier <age-at-onset> <censor> [residuals]. Prints the product-limit estimator for the survivor function for the quantitative trait age-at-onset, where censor is the binary outcome trait, which is affected when age-at-onset represents the age at which the individual first expressed the trait. The age-at-onset is replaced by a nonparametric residual when requested. If affected, this is:
    sgn(1-H(t)).(-2(1-H(t)+log(H(t))))1/2

    If unaffected:
    -(-2H(t))1/2

    where H(t) is the Nelson-Aalen estimate of the integrated hazard function at that age t.
  123. lifetable <start> <end> <censor> [<cohort stratum width> [<period stratum width>]] [time] [covariate <covariate>]. Prints the life table for the quantitative trait end, where censor is the binary outcome trait, which is affected when end represents the age at which the individual first expressed the trait. Start represents the time of entry into the study of the individual. The <cohort stratum width> is the bin width used to divide up the entry times into the study. The <period stratum width> is the bin width used to divide up person time of followup (and defaults to years). The <time> modifier causes the start and end values to be treated as a continous measure, rather than a Julian date (the default). The covariate keyword declares the following named variable to be a categorical covariate which will be used to stratify the dataset.
  124. rank <trait> <rank>. Write the ranks of a quantitative trait to the quantitative variable rank.
  125. blom <trait> <blom_score>. Write the approximate inverse normal scores for a quantitative trait to the quantitative variable blom_score.
  126. simulate <trait> [<h2> [<linked extant marker>]]. The data for the named trait is replaced by simulated data for a trait under the control of a QTL (or polygenes) with a total heritability of h2 (defaulting to 50%). If a second marker name is given, the controlling QTL is simulated as being completely linked to the second marker.
  127. simulate <marker> [<linked extant marker>] [<number of equifrequent marker alleles> | <allele 1 frequency>...<allele N frequency>]. The data for the named autosomal marker is replaced by simulated data. If a second marker name is given, the new marker is simulated as being completely linked to the second marker. Either a set of allele frequencies, or the number of (equifrequent) alleles, can be given for the simulation. If the sum of the given allele frequencies is less than 1, an extra allele will be added automatically.
  128. simulate pedigrees [<nped> [<ngen> [<min_number_of offspring> [<max_number_of_offpsring> ]]]]. Generate a set of nped (default 100) random pedigrees, each of ngen (default 2) generations. The component nuclear families each contain between min_number_of offspring and max_number_of_offpsring offspring (defaulting to a range 0-2). The generated pedigrees are each descended from a single founder couple (with marry-ins).
  129. nuclear [maxsibs] [grandparents]. Split pedigrees into component nuclear families, duplicating individuals as necessary. If maxsibs is set, then sibships containing more than maxsibs members are truncated. The gra option includes the grandparents as well.
  130. subpedigrees. Split nominal pedigrees into component true pedigrees. Sib-pair normally can analyse a group of individuals with the same pedigree ID, even if they are not all related. This command splits such groups into uniquely named formal pedigrees.
  131. join <pedname1> [<pedname2>...<pednameN>] Join or rejoin pedigrees into a single pedigree, appropriately dealing with shared individual IDs and their associated data. Note that Sib-pair allows noncontiguous pedigree blocks to use the same pedigree name, so multiple pedigrees of the same name will be collected up.
  132. prune [<binary trait> [|<quantitative trait> over|under <threshold>]]. Reduce pedigree to contain probands and minimum number of connecting relatives.
  133. cases <locus>. Reduce pedigree to unrelated individuals with non-missing values at the trait i.e. the informative founders, and any informative nonfounders who are not directly related to any individuals already selected.
  134. unique_id [sequential]. Generate unique consecutive (within family) numerical IDs for all individuals (as well as new numeric pedigree IDs). The sequential gives IDs from 1...total_records, instead of 10001, 10002...20001...
  135. print [where] <a logical expression>. Print trait values for individuals, with a combination of trait values meeting the criterion.
  136. print ped <Ped1>...<PedN> [id <Id1>...<IdN>] Print trait values for individuals, with specified combination of pedigree and individuals IDs. The pedigree and ID names can contain wildcard characters: "." (match any character at that position in the search string) and "*" (match zero or more characters).
  137. write [<pedigree file name>]. Writes a GAS type pedigree file from the current dataset. Default is to screen.
  138. head|tail [<nrec>| (<skip> <nrec>)]. Writes the first or last nrec records (default 20) of a pedigree file to the screen. If two arguments are present, then the first represents the number of records to skip over before writing nrec records.
  139. more [<nrec>]. Pages through the current dataset, nrec records (default 20) per page. Allows paging backwards and forwards by full or half pages.
  140. write pap. Writes the required pedigree files trip.dat and phen.dat (note that you may have to sort trip.dat).
  141. write <format> <pedigree file name> <modifier>

    writepedigree|gas <pedigree file name>
    arl [par|all]
    asp|tcl
    beagle [fou]
    crimap|tcl
    csv
    dot
    fisher
    gda [all]
    haploview
    linkage|ppd|gh[dummy] [numbered_alleles]
    mendel
    merlin
    phe
    sage
    solar [phe] [nopedigreeID]
    structure [fou]
    Use of the keywords pedigree or gas writes a GAS type pedigree file from the current dataset. Quantitative values are written as F9.x or F8.4. The keyword gda writes a GDA Nexus datafile containing all current marker genotypes for founders. If the keyword all is added, nonfounders will be included as well, but the "gdatype" format will not differentiate between relatives. Similarly, arlequin writes a data file for the program Arlequin containing haplotypes from one informative child per family, or two parents of such a child if the par keyword is added. Only if the all keyword is added will all genotyped individuals be printed. The keywords linkage and ppd write a pedigree file from the current dataset suitable for use by the LINKAGE (and FASTLINK) programs, the former type requires preprocessing by the Makeped program (note that if a quantitative trait value is zero -- that is nonmissing -- it is recoded to 0.0001); aspex (or tcl) writes a linkage style pedigree file but with the marker locus names as the first line, as the ASPEX programs prefer; gh writes a linkage style pedigree file with a dummy affection trait as the first trait and all the quantitative traits last, with "-" for missing quantitative trait values. The dummy option added to linkage or gh writes a dummy affection locus as the the first trait (everybody affected). The numbered_allele option skips recoding alleles to numbered alleles. The haploview option is a linkage style file with recoding of letter alleles from "ACGT" to "1234". The sage keyword writes a pedigree file from the current dataset suitable for use by the program FSP included in the SAGE package; mendel writes a pedigree file from the current dataset suitable for use by the programs MENDEL or SIMWALK; merlin writes a pedigree file suitable for Merlin -- actually a LINKAGE "pre" format file with zygosity included as the first trait, if the "set twin" command has been previously issued; fisher writes a pedigree file from the current dataset suitable for use by the program FISHER; phe writes the "pheno.dat" style file required by Mapmaker-Sibs; both csv and solar give a comma delimited file, with header naming columns, from which the pedigree ID column can be dropped via the nop option, and the SOLAR phenotype (or genotype, depending on a prior keep statement) file written by the phe option. The SOLAR pedigree file has two additional columns: MZ twin indicator (requiring a previous "set twin") and a household (actually pedigree) indicator. The structure and beagle commands write genotype data files for Structure and Beagle respectively

    (and can be restricted to writing just founder data using the founders option).
  142. write map mendel|merlin|loki <map file name>. Writes out the map file required by MENDEL 4.0, MERLIN or LOKI.
  143. write locus pap. Writes the required locus files header.dat and popln.dat.
  144. write locusaspex|tcl <locus file name>
    fisher
    gas
    haploview
    linkage|gh[dummy] [xlinked]
    loki
    mendel [trait]
    merlin
    relpair
    sage
    sib-pair [<pedigree file name>]
    structure <pedigree file name>
    superlink [dummy] [xlinked]
    Use of the keyword gas writes a GAS type locus file from the current dataset; haploview writes an "info" file, giving the marker position as the first word of the annotation if it is numerical, or the map position multiplied by 106; linkage writes a locus file from the current dataset suitable for use by the LINKAGE (and FASTLINK) programs; gh writes the same as linkage save that map distances are in cM. The dummy option is used when the first trait is a dummy trait generated by write linkage <file> dummy, while the xlinked option declares the markers to be all X-linked. The keyword loki writes a control file for LOKI's prep program; sage writes a locus file from the current dataset suitable for use by the program FSP included in the SAGE package; mendel writes a locus file from the current dataset suitable for use by the programs MENDEL or SIMWALK (with binary traits defaulting to a factor, but given as a diallelic locus if the trait modifier is present); fisher writes a locus file from the current dataset suitable for use by the program FISHER; merlin for MERLIN; tcl or aspex writes the tcl command file required by ASPEX programs such as SIB_PHASE; sib-pair writes a Sib-pair style script; superlink writes a LINKAGE style locus file modified for the SUPERGH option of SUPERLINK; relpair writes the RELPAIR format (modified from that for MENDEL): it infers chromosome number from the map position (as multiples of 1000) or from the locus name (if it takes the form of "DxSxxx").
  145. write var [mendel] <var file name>. Writes out the var file (list of quantitative traits) required by MENDEL.
  146. generations [<quantitative trait> [reverse]. List founders/marry-ins and sibships by generation number for all pedigrees, (over)writing the generation number to a quantitative trait if requested. If the reverse modifier is present, the generation number counts up from the bottom of the pedigree, rather than from the top.
  147. gpe <codominant marker> [mcmc] [<allele dose estimate>]. Gives iterative peeling or MCMC (Monte-Carlo Markov Chain) estimates of the genotype probability estimates for the given marker for each individual: a vector of probabilities corresponding to the possible genotypes at that marker. For an observed genotype, this is 100% for the observed value and 0% for all other possible genotype values. For an unobserved genotype, it gives the probability distribution of possible genotypes conditional on the sample allele frequencies (assuming Hardy-Weinberg Equilibrium) and the observed genotypes in the individual's relatives. If the name of an extant quantitative trait is appended to the command, the expected gene dose for the first (ie lowest in the collation order) allele will be written to this variable. The gpe command respects the set frequencies command, so that the population allele frequencies can be specified in advance.
  148. peel <codominant marker>. Calculate the pedigree likelihood for the given marker.
  149. haplotypes <marker1> <marker2> <newmarker> [<threshold>]. This infers phased genotypes when the two markers are in complete (or near-complete) LD. The threshold sets the maximum number of the rare haplotype that is acceptable when LD is not complete.
  150. triads This routine lists haplotypes inferred from fully typed parent-offspring triads, along with counts of obligate recombinants.
  151. relatives <ped> <id>. This routine lists relatives of an index individual: parents, sibs, spouses, offpsring and descendants.
  152. ancestors <binary trait> |(<quantitative trait> >|>=|<|<=|==|^= < threshold>). This prints the IDs of the ancestor (and ancestral mating) shared by the greatest possible number of probands in a family. The mean intrafamilial inbreeding coefficient for the probands is also output.
  153. frequencies|describe [[<codominant marker>| <binary trait>| <quantitative trait>]...[to]...<trait>] | [snp] . Print allele frequencies for marker loci, segregation ratios for binary trait, or means, variances, familial correlations and a sibship variance test for a quantitative trait. Default is to describe all loci. The snp option prints minor allele frequencies and number typed for all diallelic marker loci.
  154. mcfrequency <codominant marker> | $m. Print MCEM (Monte-Carlo Expectation-Maximization) estimates of the founder allele frequencies for marker loci. A fixed number of EM iterations are carried out, usually 20. This can be set higher if desired using the set emit command.
  155. count [where] <a logical expression>. Count individuals, and sibships and pedigrees containing such individuals, with a combination of trait values meeting the criterion.
  156. print [where] <a logical expression>. Print phenotype data for individuals with a combination of trait values meeting the criterion.
  157. plot <quantitative trait> <quantitative trait> [<file>]. Produce an Encapsulated Postscript scatterplot for two quantitative traits. Graphic file name defaults to "sib-pair.eps".
  158. tab <trait 1>...[<trait N>]. Print contingency table for one, two or N traits, along with contingency chi-squares, Kruskal-Wallis test or odds ratio if appropriate. For RxC contingency tables where the second variable is a diallelic marker locus, allele frequencies and exact P-values testing Hardy-Weinberg Equilibrium are printed for each level of the first trait.
  159. llm <trait 1> [[+] <trait 2>...] [[+] <trait 1> * <trait 2>...] [-1]. Carry out log-linear modelling of a multidimensional contingency table under the specified model. The "lrt" command can be used to compare sequentially fitted models.
  160. kruskal-wallis <quantitative trait> <trait>. Print table of means for the quantitative trait for each level of factor, along with the Kruskal-Wallis chi-square.
  161. regress <ytrait> = <x1>...[to]... <xN> [offset <offset>] [poisson|(exponential|weibull [<censoring_trait>)] [shape <shape>] [sim] [rep <nreplicates>]. Performs linear or logistic or poisson or weibull regression of trait ytrait on set of loci x1...xN. If an x variable is a marker genotype, that independent variable is the mean allele size in the genotype, with the exception of the first marker locus encountered in the list, which is fully allelic effect coded. The offset option reads an offset for the linear predictor from the specified trait. Addition of a binary trait name to the end of the keyword list when the regression is weibull or exponential declares this as the censoring indicator. The shape keyword declares a starting value for the solution of the Weibull distribution shape parameter. The sim keyword gives a gene-dropped P-value for the first marker locus in the list. The rep keyword specifies a number of replicates for multiple imputation of the test marker locus genotypes, and is usually used when set analysis imputed has already been issued.
  162. mixture <quantitative trait> [<Number of distributions> [normal|pooled_normal|exponential|poisson]]. Estimate mixing proportions, means and standard deviations for a 1..5 component mixture model describing the specified quantititative trait. The default is a mixture of Normal (Gaussian) distributions with different means and variances, but a common variance can alternatively be specified. Other distributions available are the exponential and Poisson. A line-printer type histogram is produced.
  163. kinship [inbreeding [mc]|pairwise|<binary trait> [|<quantitative trait> [>|>=|<|<=|==|^= <threshold>]]. Write the numerator relationship matrix (matrix of coefficients of relationship) for each pedigree in a lower triangular form or as a list of pairs (in the latter case, the coefficient of fraternity is also printed). Alternatively, if requested, print a list of individuals with a non-zero inbreeding coefficient, using a Monte Carlo estimator if mc modifier is present. If a binary trait is specified, the NRM is only for the affecteds if plevel=1; for plevel=0, only a summary for each pedigree is printed: number of affecteds, number of "sporadic" cases ie cases unrelated to any other affected family members (eg marry-ins with no affected descendants), mean coefficients of relationship for affected relative pairs and of inbreeding for cases.
  164. ibd <codominant marker> [pairwise]. Write the estimated mean identity-by-descent sharing at a marker for all relative pairs in a pedigree as a lower triangular matrix or a list of pairs.
  165. ibs <codominant marker> [pairwise]. Write the estimated mean identity-by-state sharing at a marker for all relative pairs in a pedigree as a lower triangular matrix or a list of pairs.
  166. hwe [founders] [<mar1> ..[to].. <marN>] [$(m | x)]. Prints chi-square statistic for Hardy-Weinberg equilibrium for all marker loci. Analysis may be restricted to founders, and if the marker is diallelic, an exact test is carried out. If nonfounders are included, then a gene-dropping simulated P-value is produced. The mean IBS sharing for all typed matings is also calculated, and compared to its expected value. This latter test may allow detection of homogamy or assortment.
  167. cksib. Lists all sib pairs, and the mean of IBS at all marker loci where both members of the pair are typed at the marker, comparing this to that expected if related as specified by the pedigree structure. The output is to the standard output.
  168. share [pairs]. Lists all relative pairs, and the mean of IBS at all marker loci where both members of the pair are typed at the marker, comparing this to that expected if related as specified by the pedigree structure, allele frequencies and linkage map. The output is to the standard output. The default lists individuals whose Z score measuring deviation from expected exceeds 1.65 with any other relative. The pairs option prints the statistic for each deviant pair, or all pairs if output is set to verbose.
  169. mztwin <monozygosity_indicator> |(<zygosity_score> even|odd|(>|>=|<|<=|==|^= <threshold>)) [clean|delete]. Using a binary or quantitative trait which indicates which sib pairs are monozygotic twin pairs, list markers at which the twins carry discordant genotypes. Gives proportion discordance for each marker. This is useful for estimating genotyping error rates. The clean option deletes genotypes for pairs where there is an inconsistency, and fills in missing genotypes where that for the cotwin is available. The delete option drops the member of the pair with the fewest nonmissing phenotypes, and averages (across the pair) quantitative phenotypes where both are observed. The even and odd zygosity score test require the score to be positive, and are designed for use with MERLIN's coding system.
  170. davie <binary trait> [<proband indicator>]. Print segregation ratios and standard errors for a binary trait, adjusted for the ascertainment scheme. Probands are indicated by being affected at the proband indicator "locus". If no proband indicator variable is given, complete ascertainment is assumed (equivalent to "davie trait trait").
  171. twin <trait> [<zygosity_score> even | odd | ( >|>=|<|<=|==|^= <threshold>)]. Calculate, and test for equality of, MZ and DZ twin correlations or concordances. If a zygosity indicator is not specified, the indicator declared by the "set twin" command is used to divide siblings into MZ twins and non-MZ full siblings (DZ twins and "singletons"). If a zygosity indicator is specified, then this is used to define three groups: MZ twins (indicator expression TRUE or greater than zero), DZ twins (indicator FALSE or equal to zero), and nontwin siblings (indicator value is missing).
  172. varcomp <quantitative trait> [ae] [ce] [ace] [ade] [covariate <covariate trait 1> ... <covariate trait N>]. Performs MVN variance components analysis for a quantitative trait using all phenotyped individuals. Currently fits ADE, ACE, AE and CE models. The ae option fits AE and E only, and so forth. Multiple covariates can be included in the fixed effects part of the model via adding the cov keyword and a list of covariates at the end of the command line. Only the first marker locus in the list of covariates is fully allelic effect coded, with subsequent markers included as the mean of their allele values (i.e. 1="1/1", 1.5="1/2", 2="2/2" for a diallelic marker).
  173. lrt. Constructs likelihood ratio test comparing the last two variance components (var) or generalized linear (reg) or generalized linear mixed (fpm) models fitted. The compared likelihood statistics from fpm are actually the mean model loglikelihoods. This allows one to carry out tests of linkage after conditioning out the effects of genotype at a candidate polymorphism, and traditional "measured genotype" association analysis in pedigrees, including non-normal data (currently binomial and poisson distibuted traits).
  174. blup <quantitative trait> <h2>. Calculates BLUPs (best linear unbiased predictions of the breeding value) for a quantitative trait assuming the given heritability.
  175. fpm <quantitative trait> [>|>=|<|<=|==|^= <threshold>]|<binary trait>
    [nqtl <number simulated QTLs>]
    [link logit|probit|mft|ln]
    [likelihood_family gaussian|binomial|poisson]
    [p|fre] [a|va] [d|vd] [g|vg|h2] [c|vc] [s|vs]
    [fix a|c|d|e|g|mu|s|var]
    [aval|cval|dval|eval|gval|mu|pval|sval| var|AA|AB|BB|SD <value>]
    [cov <x1> [+ <x2>...]].
    Uses a Monte Carlo Markov Chain (Metropolis) algorithm to perform Generalized Linear Mixed Model analysis (when nqtl set to zero), complex segregation analysis (nqtl set to one), "genetic" mixed model analysis (nqtl set to one, g estimated), or a finite polygenic model analysis for a quantitative or binary trait using all phenotyped individuals. The a, d, g, c and s options respectively include additive QTL effects, dominance QTL effects, additive Gaussian polygenic random effects, pedigree environmental effects and maternally derived sibship effects in the model. The fix option allows that variable to be held fixed. The aval (dval etc) keyword allows the starting value of that parameter to be set. Either a Gaussian, Binomial, Weibull or Poisson likelihood can be used. In addition to the canonical links for these three types, one can also fit the multifactorial threshold model to binary data. Because the identity link function for a binomial likelihood is of genetic interest, notably in the case of the single major locus model, special checks to reject models where predictions lie outside 0-1 are made, so that this model can be successfully fitted.
  176. dis [<marker locus 1> [<marker locus 2> [.. <marker locus N> [<trait>]]]. Estimates frequencies of haplotypes based on individuals with two informative parents or no informative parents. For pairs of multiallelic markers (6 or more alleles), only the former group is used. In the informative parent-offspring triads, the loci are assumed to be tightly linked, so that four parental haplotypes may be inferred, but this maybe turned off. If no markers are specified, the sequence of pairs of markers in the list of loci is produced (ie marker1 with marker2, marker2 with marker3...). If one marker is specified, then the pairing of all other markers with this index marker is analysed. For pairs of markers with few alleles, information from both phased and unphased genotypes is combined using an EM-fitted log-linear model, which can also deal with X-linked data. If three or more markers are specified, or two or more markers along with a trait, the genotypes are all treated as if unphased (and X-linked data is not handled). The same log-linear modelling framework is used. When a trait is specified, the haplotype frequencies within each trait category (two if a binary trait, as many levels as observed for a quantitative trait) are calculated, along with a chi-square testing equality of marker haplotype frequencies across categories. If exactly two numbers are given, then these are assumed to be the number of alleles at two loci, and genotype counts may be entered at the command line.
  177. assoc (<binary trait> |<quantitative trait> [(>|>=|<|<=|==|^= <threshold>]) | categorical) [founders] [covariate <covariate>] [genotypic] [ibd <ibd_marker>]. For a binary or dichotomised quantitative trait, prints chi-square statistics for association for all marker loci versus the trait, either affected versus unaffected if the trait is binary, or above or below the threshold if the trait is quantitative. If the categorical modifier is present, then each unique value of a quantitative trait is taken to represent a category of a multinomial trait. For dichotomous traits, a second table in the output shows the results from the reconstruction-combined TDT within informative sibships. Also prints F statistics assuming trait is indicator of membership of different subpopulations. For a quantitative trait, prints the model and residual sums-of-squares and allelic means with naive standard errors from an additive allelic ANOVA model. Monte-Carlo empiric P-values are produced for either analysis via gene-dropping, which can either be unconditional, or completely linked to that of another marker. Analysis may be restricted to founders. Genotypic rather than