CIMMYT Molecular Characterization Data for Maize and Wheat
Marilyn Warburton, Jesper Norgaard, Juan Carlos Alarcón, Carlos G. López, Guy Davenport, Miguel Anducho-Reyes, Maria Luz George, Ellen Regalado and Sussane Dreisigacker



CIMMYT is pleased to present the Fingerprinting Databases, including all publicly released maize and wheat fingerprinting data (in two volumes). This database includes data from many CIMMYT and collaborating scientists, and we feel that this wide range of data make these databases so useful.  In order to use these ACCESS databases, you will of course need Microsoft ACCESS installed on the computer where you are working. If you have ACCESS, you are ready to proceed by following the instructions, provided in this manual.

First copy the Access database to your hard disk, and open it from there. If you open it from the CD, it will be read-only, and some operations and queries will not work.

 

To download the database you are interested in, click title below for:

Maize: "CIMMYT_FPDB_public_maize" (WinZip File 4.49Mb) or

Wheat: "CIMMYT_FPDB_public_wheat" (WinZip File 4.16Mb)

You will notice several tables and forms, which make up the fingerprinting database. Click first on “tables” under the object list to see a list of tables in this database.  The following tables are primary data tables, and I include an explanation of each (all other tables are generated from the primary data, and not important here):

Genotypes:

Include all the information on each genotype, individual, or population in the study.This includes species, name, number, alternate name, and pedigree. I have not been putting pedigree, because I hope to be able to link one day this database to ICIS/IWIS and have that part done automatically. There is also a column for Genotype ID number.  I try to use the ICIS/IWIS number when possible, if not, I assign it a random number that must be unique.  This is the number all the other tables use to refer to this germplasm. You can use the ID numbers of your germplasm if it exists, but if not, assign it one using some consistent schema such as, country abbreviation and number combination or “AMB” and a number.You can put numbers and letters in this field. (actually, in all the fields of this table, you can put numbers and letters). The column individual number is only when we genotyped many individuals from the same population; if it is an inbred line, leave it blank. Entering data into this table may be done manually, or may be imported from excel. Please see the table, RajabGenosIN, as an example of a table that was imported from excel. If the genotype is already in the database, it does not have to be imported again, as long as you use the exact same name.

Marker Data, Inds:

This is the actual data for the individuals or the populations genotyped one individual at a time.  Data entered here include the data ID, which is a number that Access assigns to the data as it is entered (so don’t include this when importing data); the Genotype ID number, from the Genotypes table; the Study ID number from the Studies table; the Marker Name from the Markers table, and Allele 1 and Allele 2.  Allele 1 and Allele 2 are the alleles that the genotype had for that marker; if there is only one, put the value in the column for Allele 1 and put the marker name (with no allele specification) in the column for Allele 2.  If there is missing data for that marker, put the marker name in both Allele 1 and Allele 2.  The alleles are named: marker name, followed by a dash, followed by the size in base pairs of the allele.  No spaces.  Data is entered into this table by importing from Excel; an example table of data imported from excel is found in the table MarkerDataIN

Marker Data, Pops:

This table is the same as Marker Data, Inds, except that instead of only having two allele columns, you may put as many as you have alleles for in the population, and following each Allele column is a Frequency column, where you enter the frequency of each allele in that population.  You must enter a 0 for every allele not present in a population, and if new data is to be entered where that allele was never even scored, you must enter the zeros manually before importing the data.  Missing data is a period in the case of this table ONLY; do not use periods in the Marker Data, Inds table.  Data is imported in the same way as with the Marker Data, Inds table, except that you must have the same number of columns as you have alleles.

Markers:

This table includes the marker name (must be a unique name for each marker, and please make sure your markers all have the exact name as the markers already in the table, unless they are new markers not already in the database); marker type (maize or wheat), marker source, and location, meaning genetic location.  I haven’t entered that, either, because I hope to link this table to Graingenes and MaizeGDB, where the data is already stored, plus much more.  Data is generally entered into this table manually, or may be imported from excel.  If the marker is already in the database, it does not have to be imported again, as long as you use the exact same name.

Studies:

The Studies table is a brief description of each study, which will allow people to know what data may be combined with what other data.   This table includes Study ID, which has to be a number (just use the next number in line); study description (a short text identifying the study); investigator (name of the person who did most of the work, or the Service Lab, etc); date (could be date the analysis completed, or date the data was input into the database); If the data is available for public release (yes or no); conditions, which is a memo field describing the lab conditions, and can be longer than the other columns; error rate within the study, if calculated (for example, by repeating the same genotype or genotypes more than once for every marker); and control genotypes, if used.  Data is generally entered into this table manually; each study must be unique.

Menu system

The menu system will allow you to access different functionalities of the database. You open it by clicking on “Forms” and then double click on f_main. The following screen appears:

 Top

From here you can do the following:

  • Create a 0, 1 matrix of study.  This is a common input format for some data analysis programs. You have to select a study from the “Select Study” box and then click on the “Create matrix of study” button which will generate the matrix of the study based on the data in the table Marker_Data_Inds. This can take some time depending on the size of the study.
  • Show table matrix. Once you have generated the matrix with the button above, this button will open one or several tables of “0” and “1” values (or “.” for missing data) which represents the study.
  • Merge studies to new study. This will allow you to combine all the data from two or more studies into a new study, from which you can generate and show matrix etc. You must type the study numbers separated by blanks in an input box, and then give a new name to the study which will be the merger of the data for the given study numbers. Once the data has been merged, you can generate the matrix of the new study, and show its table matrix. This will only show markers that both (or all) studies you wish to merge have in common. This might be time-consuming.
  • Import error file for selected study. This functionality also works with the clipboard, like the Import Clipboard functionality. Your data has to be in the below format, with the Error ControlName in the cell A1 (of the Excel spreadsheet), the titles “Marker”, “allele1”, “allele2”, “allele1”, “allele2”, … in the next line, and the data corresponding to it in the subsequent lines.  The headings “allele1”, “allele2” appear as many times as you have the control data repeated for the same marker in the study, which may only be once (in which case, a within-study error cannot be generated).
  • Generate Error Report for selected study. This will generate a report reporting the error rate within the study for each individual marker in an error file, and the total percentage of errors in the file.
  • Export as GCP template. This will export all the data of a selected study in an Excel file in a format with an Excel file with 5 sheets, experiment, conditions, data_list, marker and accessions.

 

  • Import from Clipboard will import the data to table Marker_Data_Inds which has to be in an Excel file or similar with the right format, see below for the details. In Excel you will highlight the data you want to import and click “copy”, then click the “Import from Clipboard” button from the Access database. This will import the study directly into the “Marker data, Inds.” table, and might be time-consuming.
  • Query error table data. This will permit you to select certain items like markers, alleles, genotype IDs, genotypes names, studies from a number of select lists, to be able to generate a report with the data that corresponds to your criteria.

Import Error File

Here is screenshot of the Import Error file functionality

This is the general format of the above example of Error File:

 

<ControlName>

 

 

 

Marker

Allele1

Allele2

Allele1

<marker1>

<value>

<value>

<value>

<marker2>

<value>

<value>

<value>

<marker3>

<value>

.<value>

<value>

 

To import the error file, you have to select the whole area from Excel that you want to import (it has to have the above format, e.g. “Allele” in the first cell) and then click CTRL+C or right click and then from the menu click “copy”.

Top

Generate Error Report for selected study

Now you can go to the main menu of the Fingerprint database, select the corresponding study,  and click  “Import error file for selected study”. When this import is complete, you can use the next button “Generate Error Report for selected study” and you will get a report similar to this:

The above is a screenshot of the Error Report. Note that you can generate combined error report for several studies if you type in the other study numbers (apart from the selected study) in this message box (here we have selected studies 2, 6 and 15):

If you don’t want to combine error reports for several studies, just click OK in the above message box without filling in the input line.

Query data

From the main menu you can click “Query data” which will show the following selection screen for querying the error data. There are lists of alleles, genotypes, genotype names, markers and studies. Each list is present so you can select the specific values you want. You can single click on each element you want selected for the criteria. Note that if you haven’t selected any items in a list, all items will be considered in the query.

From the Query menu you can query the error tables that have been imported. You can build the criteria from the menu by selecting specific alleles, genotypes, genotype names, markers and/or studies that you are interested in. Example would be to select from above the allele “dp023-228” and “dp036-163” from the Alleles column, and markers “dp023” and dp036”, and then study 1, 5 and 16. You can click the “Criteria summary” button and it will give you a summary of the selection criteria like this:

Alleles must be one of the following:

   dp023-228

   dp036-163

Markers must be one of the following:

   dp023

   dp036

Studies must be one of the following:

  (1)   Omani Wheat Landraces

  (5)   M. Reynolds Wheat lines

  (16)  MegaEnvironment Wheats

 

Note that the meaning of the above criteria is “Alleles must be <dp023-228 or dp036-163> AND Markers must be <dp023 or dp036> AND Studies must be <1 or 5 or 16>.

Here is the corresponding output after clicking the “Generate Report” button:

Above each list in the query screen is a checkbox that has to be selected if you want that entity in the output. As an example, if we had deselected the “Alleles” checkbox in the Query screen, then in the above output the columns “Allele1” and “Allele2” would not be present.

Note that it is possible to perform criteria for columns that you don’t include in the output, for instance if you want data only from the study 16, you might want to deselect columns Study ID and Study Name, to exclude these two from the output, otherwise each record would have Study ID 16 and Study Name “MegaEnvironment Wheats”.

Below each list is a “Clean” button which can be used to deselect all the selected values in that particular list, also the ones that are not visible, since the lists might have more values than can be seen on the screen (scroll up and down with the scroll bar in that list). And note that to view which values have been selected in all of the list boxes, it is convenient to use the “Criteria Summary” button.

Export Template

When generating an Excel file with “Export Template” from the Main window, you have to select a study first from the study select box, and then click “Export Template”. This will generate an Excel file which will look like this:

Note that there are data in each of the sheets below “experiment”, “conditions”, “data_list”, “marker” and “accessions”.

Common Alleles within studies

If you click the “Common alleles within studies” from the query screen, you will get the alleles that are present in *all* of the studies that you have selected. For instance, if you have studies 2, 5, and 11 selected, and allele dp023-228 was present in study 2 and 5, but not 11, then the allele dp023-228 would *not* appear in the output.

If you want the alleles that are present in *at least one* of the studies you have selected, you could get that by selecting the studies you want, select the alleles checkbox only, and then deselect the checkboxes for genotypes, genotype names, markers, study ID and study Name. In the above example, allele dp023-228 *would* appear in the output.

Example: Querying Markers that are present in at least one study

If you want a list of all of the markers that are present in at least one study, deselect all other checkboxes than Markers, make sure all selections are clean (e.g. with the respective clean buttons, or by checking the criteria summary) and then click “Generate Report”.

Importing and exporting data:

To import data into Access from excel, there are two possibilities.  You may either import an external file that is already formatted exactly the same as the table in the FP database you are importing into, or you may use the “Import Clipboard” function from the main menu, which will import data in 0/1 format.

To import an external file, close all tables, and use File, Get External Data, Import.  A window will open where you will browse for your excel file.  Change the “Files of type” list to “Microsoft Excel”.  Double click the file you wish to import, and an import wizard will open; Click the box “first row contains column headings” (note, your first row must have the EXACT same column headings names as the table in access you are importing into).  Click Next. Check the box, “In an existing table”, and use the pull down menu to choose the table you are adding data to.  Click Next.  Click finish; if no error messages are displayed, you did the import correctly!  Open the table in Access to find your data added; if correct, you may save Access, and if not, exit without saving, or delete the rows you added to the table and try again.

To import data into the Marker Data, Ind. Table that is in a 0/1 format using f_main:

You must have an Excel file with “Allele” in cell A1, and then the Genotypes in each row, and the Markers in each column, like this: 

Allele

AMB-01

AMB-02

AMB-03

nc130-139

0

0

1

nc130-142

1

1

0

nc130-145

0

0

1

nc130-148

0

.

9

 

A “0” in the grid means that this marker was not present for this genotype, and a “1” means that the marker was present for this genotype. We can see above that marker AMB-01 and AMB-02 were not present for genotype nc130-139, and that marker AMB-03 was present for genotype nc130-139.

A dot “.” or a “9” means missing data, it wasn’t measured. You can see that markers AMB-02 and AMB-03 were not measured for genotype nc130-148.

Note that importing will insert the new genotypes and markers if they didn’t exist already, but without the additional data like Species etc. That means that it is better to define manually the genotypes and markers with its respective data (species etc.) in tables Genotypes and Markers before importing the 0/1 data.

You must use the f_main form for importing from the clipboard.  Click “forms” under the Objects menu, and then double click f_main.  Before proceeding any further, open Excel, and select the data you wish to import.  Using the example table, dataIN, select all data, or only a subset, if you wish.  From excel, click “copy” in order to put the selection into the clipboard.  Then, return to Access and click the box, “import from clipboard”.  This may take a few minutes.  When finished, go to the Marker Data, Inds table to confirm that your data was entered correctly.

If you are not very familiar with how to select an area in Excel, you could follow these steps. Suppose we want to select all the data in the above Excel file (cells A1 to D5). Click on the A1 cell with “Allele”. Now hold down the shift button and simultaneously click first with the right arrow key (which will select the line with “Allele” to “AMB-03” values) and then the down key (selecting the whole area from “Allele” to “9”). Now click CTRL+C. Note that this method will work even if you a much bigger area in your Excel file, that can only be seen by scrolling. When the selected area has passed to the Clipboard, the area in Excel will be blinking. Whenever the selected area in Excel is *not* blinking anymore, then it is no longer present in the clipboard. This would happen if you copied something else to the clipboard for instance from another application like Notepad.

To output your data into a 0/1 table:

You must use the f_main form for this.  Click “forms” under the Objects menu, and then double click f_main.  Select the study you wish to view as a 0/1 table, then click “Create matrix of study”.  When this is finished (it may take quite awhile, in the larger studies), click “show table matrix”.  Another matrix will open, which you can then save as a table in Access, or Save As Excel file to another file.

Clean up datasets before importing!

If you have multiple bands at an allele, you can remove some of them if you believe they are a stutters, shadow bands, plus-A bands, and other PCR and electrophoresis artifacts.  The software for scoring most SSR data from an automatic DNA sequencer often have functions that allow you to do this automatically.  If not, you can remove them manually if the extra bands are of lower intensity, characteristic of stutters for example (that is, one strong band, followed by one or more less intense bands, each one is one repeat unit smaller than the one before it.  All stutter bands should be removed).  Even if they do not behave like a stutter, if only one or two bands are intense, and all other bands much less intense, the other bands can be removed, if you know you are working with diploid individuals.

You may also choose to remove bands that do not follow the repeat type, within a certain error, which is usually:

for 2 base repeats bands are every two base pairs, plus or minus 0.75 – 0.95 bases.

for 3 base repeats, bands are every three base pairs, plus or minus 1.25 – 1.45 bases.

for 4 base repeats, bands are every four base pairs, plus or minus 1.5 bases - 1.75 bases).

for 5 base repeats, bands are every five base pairs, plus or minus 2 bases.

for 6 base repeats, bands are every six base pairs, plus or minus 2.5 bases. 

Note the error, in relation to the separation between two alleles, get smaller as the repeat unit gets larger, thus increasing your confidence (and one main reason we do not recommend using 2 base repeats at all).  Caution is warranted in assuming all polymorphisms follow the repeat type, however, as some polymorphisms in SSRs have been shown to fall outside the repeat unit, and therefore not follow any pattern at all.

Delete Study

This functionality you can use by selecting a study from the study listbox and then clicking on the wastebasket icon on the right of the study listbox:

Since this may delete a lot of data, you have to confirm a warning message that will also tell you how many records in the tables Marker_Data_Inds, Marker_Data_Pops and ErrorCheck will be deleted, to be able to make an informed decision. If you click “Yes” the study will be deleted, if you click “No” nothing will happen and the delete request is cancelled.

Useful queries

Some of the queries can be used to check if your data is correct. These queries are documented below. The rest of the queries you should not try to run yourself, they were created for the import functions only. To run a query, open the “Queries” tab and then double click on the query name for the query you want.

qry_check_missing_genotypes

This query will give you a list of all the genotypes that have been used in table Marker Data, Inds but has not been defined in table Genotypes. There must be matching records from field Genotype ID Number in table Genotypes to field Genotype ID Number in table Marker Data, Inds.

qry_check_missing_markers

This query will similarly give you a list of all the Markers that have been used in table Marker Data, Inds but have not been defined in table Markers. The matching fields in the two tables have the name Marker Name.

If you don’t have any missing markers, the output from the query will look like this:

If you had missing markers, they would be present in the records in the output screen above. Your job would then be to add these to the Markers table, so that data consistency is restored. Remember to fill out all fields in the Markers table, for instance the Forward primer, the Reverse primer, the Marker location etc.

Top

March , 2006