29 March 2013

The Life of PIE

Around 3600 BCE (give or take a millenia or two), humans of an area that may have included India and Eastern Europe (give or take a few thousand miles in any direction) spoke what linguists call Proto-Indo-European (PIE). As people spread out, conquered or were conquered, and assimilated the indigenous people around them, they began to live too far or too isolated from other related tribes. Their new environment was different from their now-distant cousin's world. Because of those environmental differences, their language gained new words, invented slang, and deleted words no longer needed. I know what you're thinking... "Come on, lady! You said you were doing a DNA series!" Well, I assure you, this is a great tie-in. Sit back and enjoy the ride.

A (very) small representation of the "family tree" of Proto-Indo-European languages

Proto-Indo-European is essentially the mother-tongue of thousands of languages. PIE was the parent of Proto-Albanian, Proto-Armenian, Proto-Anatolian, Proto-Balto-Slavic, Proto-Celtic, Proto-Germannic, Proto-Greek, Proto-Indo-Iranian, Proto-Italic and Proto-Tocharian. To be sure, some researchers add more "children", some add less. Sadly, Proto-Anatolian and Proto-Tocharian have no living languages, having been replaced by the language of conquering peoples thousands of years ago. Proto-Albanian and Proto-Armenian mothered modern Albanian and Armenian, respectively (no siblings for these children!). All the others added several grandchildren (and a few great great great great grandchildren) to PIE's family tree. So what does this have to do with genetics? While a true scientific corollary between genes and language is still controversial, its hard to argue that a rough pattern doesn't exist. As man emerged from a singular origin point, he adapted to new environments. He became hairier in cold climates, had more sweat glands added in humid ones. He developed disease immunities and food allergies. Not every mutation is good, but any mutation that makes it to the next generation is a "winner".

I love talking languages almost as much as I like talking genealogy (sometimes it's a dead heat). One thing I enjoy using when talking about language is my first name, Starr. There is no language that I know of that doesn't have a word for "star". So it's a perfect way to illustrate the connectivity of language. And I can use that to illustrate some of the concepts of genetic genealogy that you need in order to set expectations and choose the right tests for your research. You'll note in the pictures above and below that some words are in red. Each is the word for "star" in the neighboring language. If the large tree above is too small (click to enlarge), Proto-Indo-European has the word H'ster. As you follow the branches, you'll see similarities follow through to almost every connected language. Because of accent, slang or some other mutation due to unique environments, each language changes the word just a little bit. But PIE still lives inside.

I put Persian and Urdu in their alphabet so you could see the spelling similarities and the other three in Roman characters to show a similarity of pronunciation.

Naturally, no one speaks PIE now. In fact, PIE existed (if it did exist) long before written word and we speak offshoots that have small similarities to PIE. So how do we prove PIE? Well, linguists noticed that Spanish, Italian, French and Romanian (and many others) all had similar words. This branch was easily connected to Latin (the language of Roman conquerors who tore through most of Europe all the way to England), because scholars and churches still used Latin. the gradual change from the mother-tongue to what's called the Romance Languages was documented by their written records. Knowing when a document was created, researchers were able to identify when a spelling or total word change (mutation) happened and connect it to an earlier form of the language until they reached the purer form of Vulgar Latin (there is a so-called Church Latin that is a bit more formal). Scientists took other languages and studied their words, grammars and date of first recorded use to help group the languages together and link them to similar but now dead languages. Parent languages were determined by having the same or similar words as all their resulting languages, but missing the differences of invented words. If 3 languages have the same or similar word for bicycle, but different words for car, then their parent language had no word for car, but a similar word for bicycle. As I point out in the photo above, Sanskrit uses the word Tara for "star". Vedic Sanskrit must have a similar word, because Sanskrit's sister, Prakrit, has children that use the same sounds in their words (Sitara and Takara). Also note that Persian is Urdu's 2nd cousin once removed, but their spelling for the word star is very similar. To be that close, linguists argue, Proto-Indo-European (or an intervening now dead language) had to have had a similar word and alphabet. By tracing when the differences pop up, linguists can get a rough guess of when a population diverged from it's siblings.

While language "mutation" doesn't follow gene mutation exactly, the way geneticists determine parent genes is very similar to how linguists determine parent languages. When a gene strand, let's call it Ted, mutates, scientists mark the mutation. So after generations, four people decide to test their DNA. Judy has TedAB, Frank has TedABC, Mikki has TedABD, and Bernardo has TedAE. Obviously, Judy, Frank and Mikki are related more closely since they have the A and B mutations. Frank's family tree is documented to, say, Italy. Mikki is from Russia and Judy has documented Indian heritage for at least 10 generations. So scientists reason mutation C is an Italian mutation and D is an Russian mutation. At some point in the past (probably earlier than Mikki's family has documented), Frank and Mikki's ancestors were in India. Bernardo's family is less connected because he doesn't have the B mutation. At some point in ancient unknown history, Bernardo's family left the main group with the TedA mutation prior to the B mutation. Using historical documents and as many living test subjects as one can, scientists build an algorithm that guesses the most likely migration pattern of the Ted gene. Bernardo's family is connected to a large population of E mutations who still live in China. A scientist's best guess would be that the original TedA group split up with some going to India and some to China. But where did A come from? With more tests, scientists find a group of West Africans who have TedF. No A. An isolated tribe in South Africa is discovered and tested. They have TedFG. Aha! So their family originated from somewhere in West Africa. But where is the original Ted gene? As far as anyone could tell, since no one has found an unmutated Ted gene, the origins have to be somewhere between North and West Africa. That's a lot of mileage to cover. The more people who are tested, the more who have good documentation of their own personal family migration, scientists can make the picture of when and how Ted began to mutate clearer.

Note how similar all those "sibling" languages are. (And not too far from their "cousin" Greek.)
When Richard III was recently discovered, scientists used two documented living relatives of Richard's sister and tested their mitochondrial DNA. Mitochondrial DNA is passed directly from mother to child with no interference from the father's side (with few exceptions). Their mutations are specific. One can watch the family tree of mtDNA grow and see where the changes were made. The more mutative markers that match between two people's mtDNA, the more closely related their direct maternal line. It's still a bit of a best guess, but because of matches in the mtDNA (and other physical proofs), scientists feel confident enough to declare they have found Richard III. When you have your own DNA tested, whether it be the ethnicity (autosomal) test, Y chromosome, or mitochondrial DNA, you'll be matched against living people who have taken the test. You'll be connected to people who have the same mutations. The more mutations in common, the closer you're connected to that person. To find out if you are related to someone who is deceased (whether it be your unknown great grandfather or Charlemagne), you'll be matched to living people who have a documented proof for their connection. If there is no hard proof that a living person is connected to the deceased person in question, you will not be able to prove a connection yourself.

Compare this to their Latin and Greek "cousins". Also note, Irish and Scot Gaelic have another name for "star" that is Seren or close to it. I wanted to show a variant that also has a similar "cousin".

Now, chances are you're reading this blog in English. Are you from England? Are your parents? If I went through your tree for 10 generations, would I find only English ancestry (1022 directly related people all from England)? I'm guessing not. English is of German origin. This is better seen in Old English rather than Modern English. Why? Because English has had influence from several languages since it's beginning. We no longer use "thee" and "thou", which interestingly were originally spelled with a letter that looked very similar to a y. When we dropped that letter, we replaced it with y (and that's how "thou" became "you"). We borrow from other modern languages for "taco", "kimono", and "aloha". We have thousands of borrowed or improved words from Latin, because it was the language of scholars and conquerors for so long. Prior to today's post, you may not have realised that English, Hindu and Albanian are cousins. But if you heard all three, you may have noticed similar sounds, words, or alphabets. Many people can mistakenly believe they are understanding a foreign language, because of these similarities that transcend written history.

In genetics, mutations come in several forms. A gene can be deleted (goodbye to "thee" and "panchymagogue"). Or a gene can be mistranscribed in only one spot (what's called an SNP or single nucleotide polymorphism). This would be like the confusion of there/their or your/you're in English. A gene can be inverted (copied upside down). Similar to my new pet peeve: people who literally use literally wrong. I've already mentioned borrowed words. In genetics, that would be translocation or insertion. Every human gets 23 chromosomes from mom and 23 from dad. One pair is the sex chromosomes of XX or XY. The other 22 pairs are called autosomal and they provide most of our genetic makeup. In a chromosome pair, sometimes a gene will be transferred from mom's gene to dad's gene (or vice versa). Sometimes they'll swap genes. What this means is that when a cell is divided to make the egg or sperm, the half that is made the egg/sperm may have more or less of mom or dad in it because of which chromosome makes it to the new cell. The child made from this combination isn't an exact 25% of each grandparent and can be missing an ethnicity marker or have more than documentation would allow. You'll notice in the photo above the different words in Irish and Scots Gaelic that don't really seem to fit. There are many words for "star" in many languages and Irish Gaelic also uses "seren" like Welsh does. And because of the influence of English (due to German invaders and modern prevalence), Welsh also accepts "star" in conversation. Irish and Scots share a different word that suggests an indigenous tribes of human predate Celts and were replaced (the now defunct father-language translocated it's native word into the Celtic mother-language). And the English word "star" has been inserted into the Celtic languages.

These are so close to each other, they might as well be the "John Smith" of language!

Ideally, both linguists and geneticists want to find isolated groups of people to test. The more remote and insular, the better. Because of the large influence of Latin, it shows up in English despite English's Germannic origins. The above enlargement of the Proto-Balto-Slavic branch of PIE's family tree shows what scientists really hate to find. Proto-Balto-Slavic covers Central to Eastern Europe and a great deal of Asia. Where did it start? How long ago did it break away from PIE? Is Macedonian really a cousin of Croatian, or are they siblings? Where's the conclusive proof? There is none really. It's all a best guess. Researchers look at when languages were first documented. They identify the earliest known ethnic identity that is specifically different from it's neighbor. And then they guess. (Seriously.) Anyone who has studied the history of the countries involved here knows that the borders changed more frequently than Taylor Swift's boyfriends. So where does one draw the line? Where does Lithuania end and Poland begin?

In genetics, this problem runs rampant. Anyone dealing with high Scandinavian or missing German ethnic markers knows what I mean. The Euro-Asiatic land mass was large and relatively accessible. People were coming and going and conquering and being conquered all of the time. One group would win today only to lose tomorrow. So testing for their specific location markers is difficult. A mutation may have originated in Scandinavia, but the vikings ran rampant all over the place (more than once). And we all know about Genghis Khan! While the labs get more tests and refine the algorithms used to decide where Central Europe turns into Eastern Europe, we just have to be patient. As far as Native American markers, many people want to know the specific tribe they belong to. That isn't possible, because so many tribes traded and intermarried leaving no definitive mutations that point to one tribe over another.

Albanian and Armenian have changed a bit from their beginnings, but have no siblings. All Proto-Anatolian and Proto-Tocharian languages are now extinct, but notice how close they are to others like Welsh!
 The photo above shows two still living languages (Albanian and Armenian) that have no siblings. Their respective orgin tribes remained isolated and insular to the point that they remained purer to the original PIE. Before you start pointing it out, Yll may have at one time been Hyll and before that something closer to H'ster. Note also the two dead branches of Anatolian (which had the documented language of Hittite among others before going extinct) and Tocharian. Both of them share similarities to other branches of the tree that were geographically isolated from them. Linguists argue that the only way that is possible is if they have a common ancestor. Researchers hope to one day connect PIE to it's sibling languages from around the world into a higher Proto language and go higher still. Few people give any credence to studies that try right now as there is so much we don't know that it's all guesswork. It'd be the genealogical equivalent of connecting your grandmother to Adam and Eve.

Geneticists are also trying to find our origins, but have the same problems. Humans move and mate a lot. It's difficult to detrmine if the markers in a group's DNA are exclusive to them or a larger group. It's near impossible to be definitive on whether it proves a connection to nearby neighbors or indicates a deeper, older connection to a long dead group. All reasonable research shows a common origin point of Africa and a general migration patern from there, but genealogists want something more definitive. And it's just not there yet. It really is more guess than science right now. Should we give up on it? Not at all! The more tests taken by more people, the more information we have to narrow down the results. The science, just like humans, is in constant evolution.

And I just realised I also illustrated why no one can define an origin point for a surname either. Dang, I'm good.


  1. I am also a student of linguistics. I enjoyed your discussion here. I did want to point out that You is actually the object form of Ye. What actually happened with Thou/Thee/Thy/Thine was a social shift. This singular/familiar form was gradually lost and replaced by the plural/formal form as a result of the rise of the middle class and social mobility in England during the Renaissance. I have been thinking recently about the use of the objective form of you as a subject and the loss of the subject form. You may be partially right in that thou and you would have sounded similar, ye sounding also like the objective thee. Anyway, I just wanted to add that minor correction while letting you know I appreciate your post.

    1. Sorry for the lateness of my reply. Thank you for your comments and it's always good to hear of another enthusiast out there. I appreciate your correction. I had read a book about the shift once and couldn't remember the details as well as I had hoped, but thought the example would be adequate to illustrate my point.

  2. Anonymous31/3/13 05:12

    what is your Y-haplogroup ?

  3. My Y-haplogroup is R-P312.

  4. Forgive me. My Y-haplogroup is now R-DF27. I'm planning to test for SNP Z225. FTDNA is behind in designating and showing these new SNPs in its haplotree.