------------------------------------------------------------------
CHROMOSOME DATA EXPORTED FROM M R C HUMAN GENETICS UNIT, EDINBURGH
------------------------------------------------------------------

You may obtain data from any one or more of 6 different data sets, as
follows:

Cph.data	"Copenhagen" 180 cell data set (Lundsteen and Granum)
Edi.data	"Edinburgh" 125 cell data set (Piper)
Phi.data	"Philadelphia" 160 cell data set
RBA.data	R-band 53 cell data set (courtesy Applied Imaging)
600.data	600-band 136 cell data set (courtesy Applied Imaging)
CPR.data	Routine amniotic 2668 cell data set (courtesy C. Lundsteen)

Generally, the data in each directory is organised as follows:
	a/...	Half the data, Woolz format (*)
	b/...	The other half, Woolz format (*)
	sa/...	The same as a, but rectangular format (see below)
	sb/...	The same as b, but rectangular format
	fa/...	Symbolic feature data in individual file per metaphase
	fb/...	The same as fa, other half of data.
	profsymb/...	Symbolic profiles from a and b (see below)
	homsymb/...	Profile similarity matrices derived by
				cross-correlation
	aa/...	Half the likelihood data
	bb/...  The other half
	ANYTHING IN UPPER CASE	shell scripts for this and that

(*) In the Cph.data these are subdivided according to Granum and
Lundsteen's original sets.  In my work, set A was all data in
subdirectories whose name commences with l (learning) and set B was
from directories with name commencing t (test).

Files whose name has the suffix ".Z" have been compressed by the Unix
"compress" utility in order to save space; "uncompress" will restore them.

"Rectangular" data format:
-------------------------

We usually store our chromosomes in a run-coded ("Woolz")format.  To
export this data to non-Woolz sites, I have made each chromosome a
rectangular array of pixels and have filled in the "background" with
pixels whose value is zero.

Each data file is a binary file containing a sequence of chromosome
data structures in the following format:

Byte 0:		1 if chromosome segmented automatically
		2 if chromosome segmented manually
		255 if end-of-file
Byte 1:		Class (0=unclassified, 1 - 22, X=23, Y=24, 25=abnormal)
Bytes 2-5:	first line of chromosome (32-bit integer in Sun format,
		i.e. first byte on tape is most significant in word).
Bytes 6-9:	last line of chromosome (32-bit integer)
Bytes 10-13:	first column of chromosome (32-bit integer)
Bytes 14-17:	last column of chromosome (32-bit integer)
Bytes 18-(17+n)	Chromosome pixels, where n=number of lines * number of columns

Byte (18+n)	Start of next chromosome (etc., etc.).

Symbolic chromosome profiles:
----------------------------

The data is stored as a separate file for each cell.

Within each cell, each profile is formatted as follows:

	<true class> <number of points> (<orientation flag>) <Cpos> <CCpos>
	<point1> <point2> ... <pointn>

<True class> takes a value in range 1 - 24 (X=23, Y=24).
<Cpos> is machine-found centromere position
<CCpos> is true centromere position (-1 except in Cph data).

The orientation of the profiles is decided by fully automatic location
of the centromere.  If this is in fact correct, then <orientation flag>
has value 1, and -1 if it is incorrect.  (not necessarily reliable in
CPR.data).

No smoothing or normalisation has been employed.

The machine-found axis has been used without correction.

Only those rare chromosomes unclassifiable by a human operator
have been omitted.


Feature data:
------------

Each line of the file consists of the chromosome class followed by the
30 values of the features described below (normalised cell-wise).  The
normal classes are 1 - 24 (X = 23, Y = 24).  If the data set contains
abnormal or unclassifiable chromosomes then these will have class
number >=25.  At the end of each cell there is a line of data with
class -1 and every feature value 0.0.


The 30 features measured by the MRC Edinburgh chromosome system:
---------------------------------------------------------------

This note refers to data extracted after August 1991. A few features
(notably number 7) previously had different interpretations.

 0.	Normalised area
 1.	Size  =  normalised area   +   k * normalised length
 2.	Relative density  =  mass / area
 3.	Area centromeric index (scaled)
 4.	Mass centromeric index (was RATIO massci / areaci)
 5.	CVDD
 6.	NSSD
 7.	Normalised length (previously various things either highly
	correlated with other features or with no discrimination at all!)
 8.	WDD-1
 9.	WDD-2
10.	WDD-3
11.	WDD-4
12.	WDD-2p
13.	WDD-6
14-19.	MWDD-1 -- MWDD-6
20-25.	GWDD-1 -- GWDD-6
26.	Length centromeric index (scaled)
27.	Normalised convex hull perimeter
28.	Number of density profile maxima (number of bands)
29.	NBINDEX (number of bands in half profile / number of bands)

For more details of features 0-27, see J. Piper and E. Granum (1989)
"On fully automatic feature measurement for banded chromosome
classification". Cytometry 10:242-255.


Queries (and acknowledgements!) to:
----------------------------------

Jim Piper				Phone:	+44 31 332 2471
MRC Human Genetics Unit			Fax:	+44 31 343 2620
Edinburgh				Email:	jimp@hgu.mrc.ac.uk