------------------------------------------------------------------ CHROMOSOME DATA EXPORTED FROM M R C HUMAN GENETICS UNIT, EDINBURGH ------------------------------------------------------------------ You may obtain data from any one or more of 6 different data sets, as follows: Cph.data "Copenhagen" 180 cell data set (Lundsteen and Granum) Edi.data "Edinburgh" 125 cell data set (Piper) Phi.data "Philadelphia" 160 cell data set RBA.data R-band 53 cell data set (courtesy Applied Imaging) 600.data 600-band 136 cell data set (courtesy Applied Imaging) CPR.data Routine amniotic 2668 cell data set (courtesy C. Lundsteen) Generally, the data in each directory is organised as follows: a/... Half the data, Woolz format (*) b/... The other half, Woolz format (*) sa/... The same as a, but rectangular format (see below) sb/... The same as b, but rectangular format fa/... Symbolic feature data in individual file per metaphase fb/... The same as fa, other half of data. profsymb/... Symbolic profiles from a and b (see below) homsymb/... Profile similarity matrices derived by cross-correlation aa/... Half the likelihood data bb/... The other half ANYTHING IN UPPER CASE shell scripts for this and that (*) In the Cph.data these are subdivided according to Granum and Lundsteen's original sets. In my work, set A was all data in subdirectories whose name commences with l (learning) and set B was from directories with name commencing t (test). Files whose name has the suffix ".Z" have been compressed by the Unix "compress" utility in order to save space; "uncompress" will restore them. "Rectangular" data format: ------------------------- We usually store our chromosomes in a run-coded ("Woolz")format. To export this data to non-Woolz sites, I have made each chromosome a rectangular array of pixels and have filled in the "background" with pixels whose value is zero. Each data file is a binary file containing a sequence of chromosome data structures in the following format: Byte 0: 1 if chromosome segmented automatically 2 if chromosome segmented manually 255 if end-of-file Byte 1: Class (0=unclassified, 1 - 22, X=23, Y=24, 25=abnormal) Bytes 2-5: first line of chromosome (32-bit integer in Sun format, i.e. first byte on tape is most significant in word). Bytes 6-9: last line of chromosome (32-bit integer) Bytes 10-13: first column of chromosome (32-bit integer) Bytes 14-17: last column of chromosome (32-bit integer) Bytes 18-(17+n) Chromosome pixels, where n=number of lines * number of columns Byte (18+n) Start of next chromosome (etc., etc.). Symbolic chromosome profiles: ---------------------------- The data is stored as a separate file for each cell. Within each cell, each profile is formatted as follows: () ... takes a value in range 1 - 24 (X=23, Y=24). is machine-found centromere position is true centromere position (-1 except in Cph data). The orientation of the profiles is decided by fully automatic location of the centromere. If this is in fact correct, then has value 1, and -1 if it is incorrect. (not necessarily reliable in CPR.data). No smoothing or normalisation has been employed. The machine-found axis has been used without correction. Only those rare chromosomes unclassifiable by a human operator have been omitted. Feature data: ------------ Each line of the file consists of the chromosome class followed by the 30 values of the features described below (normalised cell-wise). The normal classes are 1 - 24 (X = 23, Y = 24). If the data set contains abnormal or unclassifiable chromosomes then these will have class number >=25. At the end of each cell there is a line of data with class -1 and every feature value 0.0. The 30 features measured by the MRC Edinburgh chromosome system: --------------------------------------------------------------- This note refers to data extracted after August 1991. A few features (notably number 7) previously had different interpretations. 0. Normalised area 1. Size = normalised area + k * normalised length 2. Relative density = mass / area 3. Area centromeric index (scaled) 4. Mass centromeric index (was RATIO massci / areaci) 5. CVDD 6. NSSD 7. Normalised length (previously various things either highly correlated with other features or with no discrimination at all!) 8. WDD-1 9. WDD-2 10. WDD-3 11. WDD-4 12. WDD-2p 13. WDD-6 14-19. MWDD-1 -- MWDD-6 20-25. GWDD-1 -- GWDD-6 26. Length centromeric index (scaled) 27. Normalised convex hull perimeter 28. Number of density profile maxima (number of bands) 29. NBINDEX (number of bands in half profile / number of bands) For more details of features 0-27, see J. Piper and E. Granum (1989) "On fully automatic feature measurement for banded chromosome classification". Cytometry 10:242-255. Queries (and acknowledgements!) to: ---------------------------------- Jim Piper Phone: +44 31 332 2471 MRC Human Genetics Unit Fax: +44 31 343 2620 Edinburgh Email: jimp@hgu.mrc.ac.uk