Lineage Specific LncRNAs and MEP development

Unappreciated lncRNAs, identification and spatial relationship to coding genes

Vikram R. Paralkar et. al have shown that lncRNAs are co-regulated with coding genes and can influence the maturation of erythroblasts. Erythroblasts are lineage from HSPCS (CD 34, 45, 59) which produce erythrocytes (red blood cells). LncRNAs are non-coding RNAs with at least 200 nucleotides in length. Paralkar et.al used polyA+ selection and deep sequencing to address the hypotheses 1) Do erythroid progenitors express lineage specific lncRNAs? 2) how are lncRNAs regulated?

Their manuscript developed an algorithm for identifying novel, unannotated, lncRNAs using 3 different algorithms which identify the non-coding potential of targeted regions. The algorithms used were 1) BlastX and Pfam protein database homology 2) codon conservation using PhyloCSF 3) presence of long open reading frames > 100 amino acids using EMBOSS GetORF.

Algorithm 1) can determine the absence of protein creation, hence non-coding and 2) 3) can assess the long non-coding potential by studying the homology and open-reading frames. Using the 3 algorithms they can cross-validate which potential novel lncRNAs are high-strigent and low strigent classifications. The HS-lncRNA classification had 64% novel lncRNAs that were not identified in RefSeq, UCSC, and ENSEMBL which is quite impressive because ENSEMBL offers leading annotation updates.

Comparing the lineage specificity, 90% coding gene mRNAs were found in all three clone types MEPs, Megakaryocytes, erythrocytes, whereas only 33% of the high-stringency lncRNAs were found in all three types. This indicates that lncRNAs may be lineage specific. Comparing lncRNAs to coding TSS, the proportion of HS/LS lncRNAs were equally frequent in diverge, antisense, intronic, and intergenic regions. In comparison to coding genes the orientation of lncRNAs is uniform, without specification of spatial relationships (ORF) to coding genes.

Transcription factors regulating mouse eyrthro-megakaryocytic lncRNAs

This group investigated promoter and enhancer signatures to determine how lncRNAs are transcribed. They used Chip-Seq for H3K4me1, H3K4me3, H3K436me3 histone marks on liver erythroblasts and megakaryocytes and mapped the chromatin signatures to genomic loci of coding genes and lncRNAs. Constructing a ratio defining promoters as H3K4me3-high/H3K4me1-low, and defining enhancers as H3K4me1-high/H3K4me3-low. ~25% of the TSS of lncRNAs had enhancer-like signatures which was significantly greater than coding genes (<10%) (Fisher’s Exact Test p= $10^{-12}$ erythroblasts, $10^{-14}$ megakaryocytes ); and ~75% of lncRNAs were transcribed from regions with promoter-like signatures. I wonder if in this calculation of TSS and chromatin signature if the length of the lncRNA was weighted or is the TSS independent of the length; intuitively the latter must be true (?).

To investigate the TF regulation, they performed Chip-Seq for GATA1, and TAL1 (erythroblasts) and GATA1,GATA2, TAL1, FLI1 in megakaryocytes (ENCODE definitions of lineage specific transcription factors). Vikram et. al compared binding of TFs in all three cell types (lncRNAs and coding genes) and found that the TFs were present uniformly frequently. In erythropoiesis, GATA1 was downregulated showed significantly higher frequencies in HS lncRNAs. The meaning of this is unclear, but implies shared regulation because of the uniform presence of known TFs.

Vikram then tested experimentally (in-vitro) the role of GATA1 regulation of erythroid lncRNAs. They used an immortalized Gata1 $^{-}$ cell line and treated with $\beta$ estradiol which triggers GATA-1 gene expression and terminal erythroid maturation. Comparing coding genes to LS/HS lncRNAs they found significant up regulation of genomic loci with GATA1 binding in both coding gene and LS/HS lncRNA, however the lncRNAs up-regulation were more pronounced.

Interestingly, Vikram et. al have recently demonstrating the heptad TF (Gata2, Runx1, Tal1, Fli1, Lmo2, Ldb1, Lyl1, and Erg) in HSPCs were lineage specific activation with respect to coding genes were upregulated during megakaryopoiesis and repressed during erythropoiesis. To verify this regulation pattern in lncRNAs, they assessed the heptad TF occupancy and found the highest frequency occupancy was up activation in megakaryopoiesis and either down activation in erythropoiesis or no change in erythropoiesis. They showed that the TF heptad previously explored in coding genes were also primed in HSPCs as a common regulatory TF in lncRNAs under specific lineages.

RNAi to assess erythroid lncRNA function

They selected abundant and erythroid-enriched lncRNAs both conserved across mouse and human and non-conserved. They selected candidate lncRNAs that were significantly expressed in erythroblasts compared to other MEPs and excluded LS-lncRNAs. They also performed enrichment testing and confirmed the candidates were signficantly enriched in adult bone marrow erythroblasts using microarray analysis. Then they created shRNAs and selected 21 lncRNAs with at least 40% knockdown of lncRNA expression. Next they assessed the maturation of erythroblasts from lncRNA-knockdown erythroblasts and identified 1 LS-lncRNA and 6 HS-lncRNA whose knockdown by at least to different shRNAs. Of the novel 7, one was previously annotated “Lincred1”, and they’ve named the other 6 (Galont, Redrum, Erytha, Scarletltr, Bloodlinc, Ggnbp2). The knockdown affect on erythroid maturation implies that these 7 lncRNAs are involved with terminal maturation of erythroid cells.

Interestingly 6 out of 7 of the lncRNAs were conserved, yet showed no transcript expression in humans implying that interspecies lncRNAs expression does not guarantee functionality.

Vikram et. al 2014.

Copy Number Variation in Cancer: Introduction to Genomic Clonal Evolution

Eric Lander’s work speaks for itself, and his work developing highly sensitive and high specific variant detection tool “MuTect” beat out other public SNP callers. Copy number variation is defined as a genomic deletion/insertion event that has direct relevance toward cancer studies because tumors tend to have heterozygous allele variation. In normal cells gene allele frequencies tend to be heterozygous (50%) or homozygous (100%). In cancer/tumor cells the gene copies tend to vary more because of genomic rearrangement events which tend to remove geneic regions alltogether, or replicate gene copies thus changing the allele ratio as genomic regions are copied/deleted more than others. Copy Number variation has been studied and associations to single nucleotide polymorphisms have been found as biomarkers of genomic disequilibrium in the International HapMap Study.

The International HapMap Project investigated point mutations and their role in site specific factors for genomic variation. Each person has two copies of each chromosome, except for males, but site specific genomic content does vary between individuals. Site specific polymorphisms have signatures for well studied diseases and thus associated to copy variation.

Germline mutations exist in the gamete and were always present in the cell contrasting somatic mutations which occurred within the cellular body. Somatic mutations are often the reason why tumor cells become the dominant clone and/or resistant to treatment. These cancerous genomic point mutations often originally occur in low allelic frequency existing in one cell out of millions of cells but later evolve into tumor resistant cells.

MuTect can detect somatic mutations with high sensitivity needed for low allelic mutations and also with high specificity reducing errors in overcalling point mutations (Type 1 error) and under-calling germline mutations (Type 2 error). MuTect uses a Bayesian algorithm to first filter sequencing read errors and also filter correlated reads to the sequencing artifacts. It then performs a second Bayesian step to identify variant mutations. It detects variants using a reference $M_0$ which assumes no variants exist at the site and any non-reference target is due to sequencing error. The comparison group is defined as $M_{f}^{m}$ where f is the estimated allelic fraction and m is the unknown true mutant at a site. f is estimated given the following:

For each site the reference allele as $r \in \{A,C,G,T\}$ , $b_{i}; i =\{1,...,d\}$ is the base of the read and d is the site and $e_{i}$ is the probability of error of that base call denoted $e_{i}^{\frac{-q_{i}}{10}}$ where $q_{i}$ is the Phred quality score. Given the Reference/comparison:

$L(M_{f}^{m}) = P(\{b_{i}\} | \{e_{i}\},r,m,f) = \prod_{i=1}^{d}P(b_{i}|e_{i},r,m,f)$

where we assume that sequencing errors across reads is independent and substitutions errors are uniform/equally likely with probability $\frac{e_{i}}{3}$

$P(b_{i} | e_{i},r,m,f) =$

$\begin{cases} \frac{f^{e_{i}}}{3} + (1-f)(1-e_{i}) & b_{i}=r\\ f(1-e_{i}) + \frac{(1-f)^{e_{i}}}{3} & b_{i} =m\\ \frac{e_{i}}{3} & \text{otherwise}\\ \end{cases}$

The conditional probability for $P(b_{i})$ is piecewise set by the empirical observations of the mutant site matching or mismatching the reference. If the reference is matched then the allelic frequencing is penalized by a factor of the sequencing error and summed with the product of the complement of allele frequency and sequencing accuracy. If the base site matches a mutant then the allele frequency is multiplied by the sequencing accuracy and the complement allele frequency is then penalized. Normally the allele frequency is a small number so if a base matches the reference we expect the allele gene variant to be penalized and the non-gene variant frequency to be rewarded. If the base does match the mutant then the allele gene variant is expected to be rewarded and the non-gene allele variant to be penalized uniformly for all substitutions.

$LOD_{T}(m,f) = \text{ln}(\frac{L(M_{f}^{m})}{L(M_{0})}) \geq \text{ln}\delta-\text{ln}(\frac{P(m,f)}{1-P(m,f)})= \theta_{T}$

To determine $P(m)$ , $P(f)$ we assume them independent and $P(f)=1$ is uniformly distributed and in practive $\theta_{T}=6.3$ due to a reasonable assumption that typical mutational frequency is $3 \times 10^{-6}$ which is odds ratio of 10:1 in favor of the reference model which is to say that 1-10 mutational variant per 1Mb yielding $\theta_{T}=6.3$ in practice.

Then calculating $LOD_{T}$ across all 3 mutational values of m allows the estimation of $f$ using maximum liklihood estimation to find $f$ that mximizes $LOD_{T}$ defined as $\widehat{f_{ML}}$

$\text{estimated} \widehat{f_{ML}} = \widehat{f} = \frac{ \text{number of mutant reads}}{\text{number of total reads}}$

Variant filtering reduces the FPR among mutant variant calls. Mutect filters in three ways: a standard, non-filtered, call (STD), a high confidence significance threshold (HC), and high confidence significance threshold coupled with a panel of normal site reference (PON) which creates prior germline mutation sites founded on normal sequence identities. High FPR is a function on sequencing depth and the deeper sequencing around 8X-12X increases power for unknown mutation sites, but for validating known SNPs depth levels of 15X-20X are needed. Further HC filtering has slightly higher FPR compared with HC+PON signifying that PON are useful but not required if one uses careful HC filters.

HC filters are listed as Proximal gap, Poor mapping, Triallelic site, Strand bias, Clustered Position, and Observed in Control. Proximal gap filters false positives where greater than 3 reads are mapped to sites in small indel sites. Clustered Position rejects candidates where alternate alleles are clustered at a consistent distance from the start and end of the read alignment (Using median less than 11 and and MAD less than 4 to identify clustered reads).

Reducing FPR used dbSNP records for identifying known germline sites, HC filters, and HC+PON.

The Business of Science

https://www.theguardian.com/science/2015/jun/23/sexism-in-science-did-watson-and-crick-really-steal-rosalind-franklins-data

this article describes the drama behind publishing the article on the double helix

Comparative Analysis of Single Cell Techniques: Ziegenhain et. al

Single Cell sequencing (scRNA-Seq) is an advancement of RNA-seq methods. Sequencing occurs by taking mRNA and inducing reverse transcription producing cDNA. Recent developments of scRNA-Seq adds unique barcodes to each mRNA molecule identifying each cell which originated during amplification. These developments adds a random demultiplexing barcode to each single cell during reverse transcription which allows for the better production of hundreds of thousands of scRNA-Seq libraries which improves the throughput by one to three orders of magnitude. For traditional RNA-Seq each sample would have thousands of libraries that were amplified during reverse transcription with one demultiplexing barcode per sample.

C1 sequencing uses Smart-Sequencing protocol kit using the C1 platform from Fluidigm that uses a microfluidic chip to automate the cell isolation protocol which separates up to 96 cells. the mRNA lysis protocol can include ERCC ( external RNA seq control consortium) to create cDNA using oligo-dt priming, template switching, and PCR (polymerase chain reactions) up to 96 NexTera reactions. The advantages of this protocol is that the cell isolation steps are automated using a microfluidic chip and the reaction volumes are small but enough such that the protocol generates full length cDNAs. The negative aspects are that the costs of the microfluidic chip, smart-seq kit, and Nextera reactions are costly. Also the smart-seq kit sequenced over the C1 platform does not use UMI’s and is expensive. However, these costs have fallen somewhat due to Drop-Seq protocols.

Drop-seq is a recently developed microdroplet-based approach. Each cDNA is decorated with a cell-specific multiplexing barcode and a UMI count abundance. Each oligo-dt primers are immobilized on beads and captured on nano-liter emulsion droplets. In the case for Drop-Seq library preparation, ERCC spike ins can not be included because it would have to immobilized on the beads an suspended. However, the Drop-Seq droplets usually do not carry cell transcriptome (mRNA) because the cell concentration has to be relatively low to avoid double droplets.

Single Cell Experiments: Contrasting Two Different Studies’ Performance

Hi.

So some journals, that are not hypothesis association journals, tend to submit the raw and analytical data incompletely. Contrastly, this paper (https://www.ncbi.nlm.nih.gov/pubmed/26951663) by Xin et.al performed C1 sequencing that has upwards of 600 samples per experiment. This paper submitted 622 samples, note that single cell sequencing has the ability to isolate cells individually, and sequence each individual cell at the individual level (by definition), reported all their samples. Their paper describes how 200 or so were filtered out, but they still included the samples for reproducibility. Although this seems like common sense, this is not a standardized practice. (GEO Submission http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE77980)

For instance this paper is a bit more confusing. Grun et.al performed single cell sequencing that included 250 samples in a two group comparison to measure the and correct for noise technicalities (http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE54695). Grun et. al wrote the paper well and controlled for noise by sequencing RNA spike samples with CEL-Seq (cellular expression linear transcription sequencing) which uses in-vitro transcription and not PCR amplification. They had two lanes with some 70 cells for each sample type (RNA control, or single cell) in two mediums ran on two lanes (http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE54695). Upon analyzing the data, the data clusters very well; however for CEL-Seq I expected a sample per cell (by definition) so at this moment I’m a bit confused why there are only 16 samples, as opposed to including all the single cell information.

From their perspective the paper was focused on validation the CEL-Seq data controlling for technical noise, so the granular cell level information was not in the scope of their experiment; they were needing aggregated data. I emailed the primary author Dr. Grun, and requested cell level data, we will see if he replies.

Dr. Grun , a researcher at Max Plank institute Germany, did reply to my questions. Essentially the cellular level data was pooled into corresponding fastq files. His Supplementary figure 1a, did show an example for reads from the two lanes that they sequenced, but I had difficultly extrapolating correctly which controls matched the fastq data based on the total number of reads.

Dr. Grun’s reply suggested to look at the barcodes in the fastq files! that is a clever insight, if I look at the barcodes, they do show that each fastq file has the reported bar code for numerous samples. Nice.

Beginning a Science Blog

Hi All.

I realize that my blog has been largely an unfocused discussion, but would like to center my writing around biotechnology, ideas about research, RNA-Sequencing, and mathematics.

The research world is seeing radically democratization with the rise of public data, software ecosystems, and cloud computing. Electronic publishing has challenged the sense of prestigious impact factor which is seeing a new model; interactivity. Some argue that interactive discussions and sharing ideas continuously has a greater or equal impact compared to printed publishing.

I think that with publicizing data and accelerating publishing, quality control and security must be held.

Positive Thinking

Look. we’ve all been there. Negative thinking is a reality, but you can not be strategic without first being positive. Positive outlooks really are healthy to concentrate on.

Determination

There is always a way.

How do you learn?

So I am self teaching a computer language. how does one best learn? One must learn all the definitions, ask questions about the differences and similiarities between the terms, and then work examples. It is very important to write the definitions down, question the terms, and review each term as it comes up during readings. Do not passively learn, but become as active as possible. Create your own analogies, draw pictures. and review the terms from the last section the night before going to bed. IT is very important to create a strict schedule and stick to it.

Eliminate distractions while one is studying new terms. Do not take breaks until after a session is complete. Go for hour long sessions if possible.

Russian Dialogue , Text (Draft)

–Вчера у меня был ужасный день

— Что случилось?

— Во-первых, мне надо встать рано. Во-вторых, я проспал и был опаздывать на метро.

–почему ты не приедет на машину?

–Моя машина сломалась, и Во-третьих, я ненавижу болеть, но заболел.

–Какой Ужас!

–Сначала я заболел так как позавчера до трех ночи готовился к контрольной работе по математике.

— Я рекомендую тебе вернуться домой спать

–Привет! меня зовут Антон, и ты?
–Очень приятна, меня зовут Николай. Кто ваши родители?
— Мой отец инженер механик, и моя мать писательница. Когда мы был год, мы переехали на Аризону. Там я прожила до шестнадцати лет, и потом мы переехали на Калифорнии.

–Хорошо. Мой отец преподаватель и моя мама профессора. Я жил в России всегда, я никогда переехать .

— Где ты хочешь переехать, ли ты может?
— Я никогда думаю о теме. Но ли могу, я хочу переехать в Монтане жить в лесу. Я очень люблю природу, и всегда хотел жить в очень маленьком городе, в котором многих людей знают всего имена всего соседей.
— Очень хорошо! Пошли!

— Что случилось? Почему ты расстроен?
— У меня проблемы с компьютером. Мой компьютер завис, и я перегрузил его. Потом я терялся мои файлы.

–Ты сохранились файлы?
–Нет.
–Очень плохо! Не может быт!
— Не повезло, знаю. Сейчас, мне надо написать файлов опять.

arcolombo

Mathematics is the queen of all science