Sham P.C.1,2,3,*, Ao, S.I.4, Kwan J.S.H.2, Kao P.2, Cheung F.2, Fong P.Y.2, Ng, M.K.5
(1)
Department
of Psychiatry, University of Hong Kong
(2) Genome Research Centre, University of Hong Kong
(3) SGDP Centre, Institute of Psychiatry, King's College London
(4) Department of Mathematics, University of Hong Kong
(5)
Department
of Mathematics, Hong Kong Baptist University
There are two main approaches to selecting genetic markers in association studies of complex diseases. The first is a direct, functional, or gene-based approach, in which polymorphisms are selected if they cause a change in the amino acid sequence or expression of candidate genes. The second is an indirect approach in which markers in a region or the whole genome are systematically screened, on the basis that they may be in linkage disequilibrium (LD) with disease-related functional variants. For the second approach, efficiency can be improved by recognizing the redundancy between near-by markers through the presence of LD, and the selection of a subset of informative SNPs, called tag SNPs, for analysis (Johnson et al, 2001). A large number of methods for selecting tag-SNPs have been proposed (Halldόrsson et al, 2005).
In this report, we propose a modification of tag SNP selection algorithms that takes account of functional as well as LD information. More importance is attached to some SNPs than others, based on their positions within the coding or regulatory regions or splice sites. We also describe further modifications to address other practical issues: some SNPs may be more readily assayed than others under the proposed genotyping platform, and some SNPs may have been already genotyped in the sample. We have modified our recently described tag-SNP selection method (Ao et al, 2005) to incorporate functional and other information.
Our method for tag-SNP selection is based on agglomerative hierarchical
clustering, which starts with a square matrix of pair-wise distances between
the objects to be clustered. The two clusters with the smallest inter-cluster
distance are successively merged until all the objects have been merged into
a single cluster. For two SNPs, an appropriate measure of distance is 1-R^2,
where R^2 is the squared correlation between the SNPs. Different forms of agglomerative
clustering differ in the definition of the distance between two clusters, each
of which may contain more than one object. We previously proposed the following
definition for inter-cluster distance:
For each SNP belonging to either cluster, find the maximum distance (i.e.
1-R^2) from it to all the other SNPs in the two clusters
The smallest of these maximum distances is defined as the distance between
the two clusters.
The corresponding SNP is defined as the tag-SNP of the newly merged cluster
In this method, called minimax clustering, setting a cutoff merging distance
of C for terminating the algorithm would ensure that no SNP is further than
C away from the tag-SNP in its cluster. This method was implemented in the CLUSTAG
program. Also implemented in this program are two other tag SNP selection procedures,
a complete linkage clustering method (Byng et al, 2003) and a set-cover algorithm
similar to the method described in Carlson et al (2004). Our previous results
show that complete linkage clustering results in a greater number of clusters
and is therefore less efficient, while the set-cover method is similar to minima
clustering in terms of the number of tag SNPs but produces less compact clusters
(as measured by the average of the distances,1-R^2 between all SNPs and their
assigned tag SNPs).
One complication of this modification is the asymmetry between two SNPs with
different values of C. For example, if one SNP is coding and therefore given
a C of 0.8, and the other is non-coding and therefore given a C of 0.4, and
the R2 between the two SNPs is 0.6, then the first SNP is able to serve as tag-SNP
for the second SNP, but not the reverse. The modification of the algorithm necessary
to deal with this complication is simplified by the ability of the clustering
program to handle an asymmetric distance matrix, in which the distance from
object i to object j can differ from the distance from object j to object i.
Because of this, the desired extension can be achieved by the following modifications
to the clustering algorithm:
For each marker, a user-defined value of C is provided in an input file
The distance from marker i to marker j is defined as C[j]-R[ij] where C[j] is
the value of C specified for marker j. If C[j]-R[ij]<0, then marker i can
serve as a tag SNP for marker j.
This asymmetric distance matrix is subjected to the minimax clustering method
with the cut-off merging distance set at 0. In order words, a cluster is formed
if there is a tag SNP which has distance 0 or less with every member of the
cluster.
Other modifications of this algorithm can be used to ensure the inclusion of SNPs that have been genotyped, and the exclusion of those that cannot be assayed. These modifications involve changing the values of the certain elements of the correlation matrix [R[ij]]. Thus, if marker t has been already genotyped, then all elements of column t of the matrix can be set to 0, except the diagonal element which remains 1. This setting ensures that marker t cannot be tagged by any marker except itself, so that it must be included as a tag SNP in the clustering algorithm. Similarly, if marker t is problematic for assay design, then all the elements of row t of the matrix can be set to 0. This setting ensures that marker t can never serve as a tag SNP. As the clustering proceeds, some of these SNPs will be merged into clusters, but others may remain untagged when the algorithm is terminated. At this stage it is necessary to check each of the remaining markers whether they have a potential tag SNP, and to include these in the list of tag SNPs.
And the details of how to handle the cases of some already-genotyped SNPs and some non-desirable SNPs are available here with MS Word format or here with pdf format.
The same principles of weighting can be applied also to the set-cover algorithm �V marker i can serve can tag SNP for marker j if the condition C[j]-R[ij]<0 is fulfilled. The algorithm would initially select all SNPs that have been already genotyped, and remove the markers tagged by these SNPs. Then the greedy algorithm proceeds as usual, except the exclusion of SNPs that have problems with assay design from the set of possible tag SNPs. As with the clustering algorithm, some "non-designable" SNPs may be left untagged at the end, and these are checked to see if they have potential tag SNPs.
We
have made the necessary modifications to implement the weighting in all two tag-SNP
selection algorithms in the WCLUSTAG program. To illustrate the performance of
the new algorithms, they were applied to the CEPH sample genotype data from the
International Haplotype Map Project. The ENCODE regions were selected because
genotyping were undertaken for all known SNPs in these regions. Intragenic
regions were identified from the start and end points of the coding sequences
for the 33K Ensembl genes in NCBI build 34.
Intragenic SNPs are given a C
weighting of 0.8, and other SNPs 0.4. The compression ratios (the number of tag-SNPs
over the total number of SNPs) of the various ENCODE regions are compared with
the original procedure which used a uniform C
value of 0.8. Our results show that there can be a further 35.2% saving with our
weighted minimax algorithm, and 35.9% with the set cover method (Table 1). We
also explored the impact of using different weighting schemes. Some additional
saving can be obtaining by lowering the weights for either intragenic or other
SNPs, although the compression ratios remain in the region of 0.2 (Table 2).
The method and program described here allow the user to prioritize different SNPs and genomic regions in a systematic association screen, depending on prior genomic and disease data available either through public databases or recent experiments. Any weight values below 1 can be specified for any SNP or region, allowing the user to explore different designs under a fixed budget. The utility of the program will be greatly enhanced when detailed SNP maps are available from the International Haplotype Map Project, and progressively more detailed functional information becomes available in public genome databases.
2A (519) |
0.277 |
0.245 |
0.247 |
|
0.104 |
0.104 |
2B (595) |
0.291 |
0.255 |
0.261 |
|
0.197 |
0.198 |
4 (665) |
0.242 |
0.211 |
0.209 |
|
0.089 |
0.089 |
7A (417) |
0.314 |
0.281 |
0.281 |
|
0.149 |
0.139 |
7B (463) |
0.186 |
0.166 |
0.171 |
|
0.114 |
0.114 |
7C (433) |
0.240 |
0.217 |
0.215 |
|
0.189 |
0.185 |
8A (364) |
0.269 |
0.245 |
0.245 |
|
0.190 |
0.190 |
9 (258) |
0.360 |
0.318 |
0.314 |
|
0.221 |
0.225 |
12 (454) |
0.260 |
0.227 |
0.227 |
|
0.167 |
0.163 |
18 (350) |
0.283 |
0.254 |
0.254 |
|
0.186 |
0.189 |
Average |
0,267 |
2.237 |
0.238 |
|
0.154 |
0.153 |
Additional
saving |
--- |
--- |
--- |
|
35.2% |
35.9% |
Table 2. Effect of weighting scheme (intragenic versus other SNPs) on the comparison ratios for tag-SNP selection algorithms in the Chromosome 9 Encode data.
0.8 : 0.4 |
|
0.221 |
0.225 |
0.8 : 0.3 |
|
0.198 |
0.198 |
0.8 : 0.5 |
|
0.240 |
0.244 |
0.7 : 0.4 |
|
0.217 |
0.221 |
0.9 : 0.4 |
|
0.240 |
0.244 |
0.8 : 0.8 |
|
0.318 |
0.314 |
Table 3. The number of SNPs in the intragenic regions and the other regions. The average ratio of the SNPs in the intragenic regions to the overall SNPs is 32.3%.
SNPs no. | SNPs in intragenic regions | SNPs in other regions |
chr2A | 0 | 519 |
chr2B | 273 | 322 |
chr4 | 0 | 665 |
chr7A | 21 | 396 |
chr7B | 159 | 304 |
chr7C | 299 | 134 |
chr8A | 203 | 161 |
chr9 | 66 | 192 |
chr12 | 180 | 274 |
chr18 | 167 | 183 |
The program WCLUSTAG is available for free download. .
Package v1 (slim version): click here for downloading the zipped file (package v1) of the WCLUSTAG with the sample files, the necessary directories, the DOS script and the C script for setting the weights of the SNPs, so that the user can run the program after unzipping this file. The C script change_chr9_gene.cpp (can be renamed to .c file if one like) can be complied with any C complier in Windows or Unix environment. The names of the input files for this script can be changed to any name one likes in the declaration section of the script. The default ones are the chr9_raw_processed.txt for the SNP position and MAF information, the chr9_LD_unweighted.txt for the LD correlation data, and the chr9_gene_processed.txt for the gene region and non-gene region information. The default setting for the weights are 0.8 for the gene region and 0.4 for the non-gene region. Users can change these values in the C script and then set the threshold of the WCLUSTAG script to the weight value of the gene region.
Package v2 (full version): click here for downloading the zipped file (package v2) of the WCLUSTAG with the sample files, the necessary directories, the DOS script for WCLUSTAG and the C script for setting the weights of the SNPs. There is also a sample all-in-one DOS script (execute_all.bat) of calling the WCLUSTAG script, handling different input file names and calling the CLUSTAG in one command. The differences between package v1 and package v2 is that, in package v2, the user can set different weights for each individual SNPs as they like. Also, there are the capacities for handling the cases with some SNPs data already done. Also, it can handle the case with some SNPs that are set as undersirable.
Sample program output for the minimax clustering are available here: the WCLUSTAG output.
Note: The java program TaggingSetChooser.jar
here is based on our previous CLUSTAG program, but it is enhanced with the
capability of handling the asymmetric correlation matrix.
The sample DOS
script (for the computer with the DOS
environment like the MS Windows DOS command window) is available for download
and the script for Mac and Linux for download.
A sample C script for assigning the
different weights to the intragenic regions and other regions can be downloaded
here. The path "data" and "results" can be changed. In case of no change, then
the user should create these two folders and the "temp" folder under the directory
containing the program file. The file names *.txt, *.members and *.out can also
be set by the users. The parameters "threshold" and the "scale" can be set by
the users. The "threshold" is for the threshold value used by the algorithms
and the "scale" is for the scale of the html output.
The program WCLUSTAG (package v1) requires two input files, one for the LD relationship between
each SNPs pair and the other for each SNP's position and MAF data. The data columns in the input files are separated by one space " ". And the
LD data can be generated, for example, with the Haploview. The sample
LD file and the sample MAF file (in ascending order of the SNP position) are
contained in the above zipped file. The C script can assign different
weights to the SNPs. And the source code of this script is provided so that the
user can set different rules for deciding the weights.
There are more input files for the
package v2. They are the ld.txt, maf.txt, weight.txt, done.txt, undesirable.txt,
and SNP_names.data files. The done.txt file contains the names of the SNPs that
have been tested in previous experiments, and the undesirable.txt file contains
the names of the SNPs that are set as undesirable. Both of these files can be
set to be empty if there are not such SNPs. The SNP_names.data contains the
names of the SNPs that appears in the maf file in the ascending order of their
position. The file can also be produced with the CLUSTAG, where it is produced
under the "data" directory. The output file ld_processsed_temp.txt
contains the information of the SNPs that are undersirable and that can (can
not) be tagged by other SNPs. Also, it has the weighted correlation matrix
displayed in a matrix format. The ld_processed.txt and maf_processed.txt are
files that are to be used as the input files for the script execute_test.bat and
they should be moved to the "data" directory. The variable max_ld in
the C script refers to the maximum value of the weights of the SNPs. The
threshold value in the execute_test.bat should be set to this value too. The C
script can handle a total of 2000 SNPs in each run by default. This total number
can be changed by replacing the matrix size variables 2000 in the declaration
section to another values.
The ENCODE genotype
data were downloaded from HAPMAP's site www.hapmap.org
on Jun 30, 2004 and is based on NCBI build 34. The gene coordinates data were for Ensembl
genes from NCBI build 34, downloaded from the UCSC Genome Browser http://genome-test.cse.ucsc.edu/cgi-bin/hgTables
choosing human genome and July 1993 assembly. The work is supported by small project grant to P.C.
Sham from The University of Hong Kong. Ao, I., et al. (2005) CLUSTAG: Hierarchical
Clustering and Graph Methods for Selecting Tag SNPs. Bioinformatics. In Press. Barrett,
J.C., et al. (2004)
Haploview: Analysis and Visualization of LD and Haplotype Maps. Bioinformatics,
Advance Access. Byng, M., et al. (2003) SNP Subset Selection
for Genetic Association Studies. Annals. Of Human Genetics, 67, 543-556. Carlson, C., et al. (2004) Selecting a
Maximally Informative Set of Single-Nucleotide Polymorphisms for Association
Analyses Using Linkage Disequilibrium. Am. J. Hum. Genet. 74:106-120.
Note on Clustering Methods:
Two more papers of the
clustering methods and their comparison for the gene expression data are
listed below:
Mcshane L.M., Radmacher M.D., Freidlin B., Yu R., Li M.C., and Simon R.
(2002) Methods for assessing reproducibility of clustering patterns observed in
analyses of microarray data. Bioinformatics. v11, 1462-9. Datta
S. and Data S. (2003) Comparisons and validation of statistical clustering
techniques for microarray gene expression data. Bioinformatics. v19, 459-466.
Note
on the minimax algorithm:
It is pointed out that there exists a parallel from topology with the
minimax algorithm. It is the Hausdorff distance, which measures the maximum
distance of a set to the nearest point in the other set. More details can be
found at: Rucklidge W. (1996) Efficient visual recognition
using the Hausdorff distance. Springer.
Note
on Run Time:
The
complexity of the clustering methods are of order O(n2). With the run
time information in our table of several hundred SNPs and this complexity
information, the users can estimate roughly the expected run time for their
samples before the program's execution. The run time will not be an issue for data
of several hundred to a hundred thousand SNPs. But, it will be a constraint when
we are studying the whole genome at one time, when the size may be of several
million SNPs. This is an area of further work as the HAPMAP project is producing
the whole genome haplotype information.
Note on Graph-Method Algorithm: Two more papers from
different journals are cited, which is about the graph-method algorithm. The one
by Johnson (1973) studied the error bound of the algorithm and the one by Fujito
(2001) studied the case of weighted edges. Johnson D. (1973) Approximation Algorithms for
Combinatorial Problems. Annual ACM Symposium on Theory of Computing. 38-49. Note on the CLUSTAG:
The result is published on: Ao, S.I., Kevin Yip, Michael K. Ng, David Cheung, Pui-Yee
Fong, Ian Melhado and Pak C Sham (2005) CLUSTAG:
Hierarchical Clustering and Graph Methods for Selecting Tag SNPs. Bioinformatics.
v21(8), 1735-1736. Note on the C script of WCLUSTAG:
The script in our download section
will set the new correlation c'[i] to C[max] - ( C[j]-R[ij]
),
where C[max] is the maximum of the C's in the chromosome region. And, the
threshold in the WCLUSTAG script is set at C[max] too. How to use it
Acknowledgements
References
Supplementary Information
Fujito T. (2001) On approximability of the independent/connected edge dominating
set problems. Information Processing Letters. v79, 261-266.*To whom correspondence should be addressed.