博碩士論文 93242003 詳細資訊




以作者查詢圖書館館藏 以作者查詢臺灣博碩士 以作者查詢全國書目 勘誤回報 、線上人數:9 、訪客IP:35.173.234.237
姓名 魏建豪(Jian-Hao Wei)  查詢紙本館藏   畢業系所 物理學系
論文名稱 基因序列的k 字齊普夫子集解析
(k-tuple Zipf m-Set analysis on DNA)
相關論文
★ 人類陰道滴蟲之Myb2蛋白質動態性質研究★ 分析原核生物基因體複製起點與終點的反向對偶對稱現象
★ 分析基因體拷貝數變異所使用的兩種方法比較:隱藏馬可夫模型與成對高斯合併法★ 使用兩種方法偵測基因體拷貝數變異:成對高斯合併法與隱藏馬可夫模型
★ 以整體晶片數據為母體應用於分析基因差異表達的z檢定方法★ GSLHC - 運用基因組及層次類聚以生物功能群將有生物活性的複合物定性的方法
★ 一個檢定測量微晶片基因表達數據靈敏度的全統計計算法★ 運用嶄新抗體固著策略發展及驗證新式抗體微晶片平台
★ Drug-resistant colon cancer cells produce high carcinoembryonic antigen and might not be cancer-initiating cells★ 創傷性關節炎軟骨之退化進程- 大鼠模型基因體圖譜研究
★ 以Z曲線分析法探索人類基因體之辨識★ 以個人電腦叢集平行運算模擬蛋白質結構
★ 各類演算法對DNA序列的辨讀與ORF之搜尋★ 細菌基因體隨機性的統計分析
★ DNA序列的不同相位上辨識與搜尋基因★ 人類病源體中的攝取訊號序列分析
檔案 [Endnote RIS 格式]    [Bibtex 格式]    [相關文章]   [文章引用]   [完整記錄]   [館藏目錄]   [檢視]  [下載]
  1. 本電子論文使用權限為同意立即開放。
  2. 已達開放權限電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。
  3. 請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。

摘要(中) 一個普遍被使用的數理統計方法-齊普夫定律,1994年被Mantegna與他的研究團隊使用在基因序列k字串的發生頻率與其排名的解析上(k字串齊普夫解析),強調非編碼區有類語言的冪次規則。不過,這樣的結論被大量的質疑與討論。
我們整理不同的齊普夫分佈研究領域,發現觀察的重點雖不盡相同,但事件總數為N時,各別事件在隨機狀態時機率均為1/N。然而,基因序列在序列的p(序列A+T含量所佔比)越遠離一半時,各別字串的機率在隨機狀態差異越大,因此在非隨機狀態中,機率不等是受到p與生物特徵兩個因素造成,影響齊普夫分佈的解析判斷。
這個研究中,我們運用不同p的基因體序列與其對應的隨機序列的數據,證實k字串齊普夫子集解析法可以去除p的影響,改善k字串齊普夫解析難以定義隨機序列冪次的障礙,確立子集解析的優勢。
另外,我們擬合四個函式(直線、指數、對數、冪次)選定足以代表物種特徵的「高頻字」(高頻率出現的字串),並嘗試找出865個物種高頻字冪次的普適性。研究結果顯示物種的冪次與其物種複雜度有關,傳達基因複製的演化結果。
摘要(英) Zipf’s law is a characterization of the relation between the frequency of any word in a text and the ranking of that word in the frequency table. It states that if the text is that of a natural language, then the frequency versus ranking relation is an approximate power law. For a few years in the mid to late 1990’s Zipf’s law was intensely discussed in the context of genomic sequences, but no clear consensus was reached as to whether, as a general rule, the word frequencies -- a genomic a word is an oligonucleotide of a given length; we call a k-nucleotide word a k-mer -- in genomic sequences, or some specific portion thereof, obey a Zipf’s law. Here we revisit the issue by studying the frequency versus ranking relations of a large number of complete genomes, and of parts of genomes having different biological functions. We show that the nucleotide composition has an influence on the frequency versus rank relation of a genomic sequence that is strong enough to mask whatever Zipf’s-law behavior the sequence may possess. Once this influence is removed, then all genomes obey the same broadly defined classes of Zipf’s laws, with the most important class-defining factor being the length of k-mers, or the integer k. For eukaryotes, the Zipf’s laws for the exonic and intronic segments of the genome differ significantly. Based on the observation that the Zipf’s law of a sequence is determined by the subset of k-mers having the highest frequencies (of occurrence), we derive a relation between the Zipf’s-law exponent and the high-frequency tail of the frequency distribution, and infer that for genomes in general the high-frequency tail is best represented by an exponential function, as opposed to linear, logarithmic, or power-law functions.
關鍵字(中) ★ 高頻字
★ 排名
★ 字的發生頻率
★ 全基因序列
★ 語言
★ 齊普夫定律
★ 編碼區
★ 非編碼區
★ 外顯子
★ 內含子
★ 頻率分佈
★ 冪次分佈
關鍵字(英) ★ coding parts
★ high-frequency words
★ ranking
★ k-mers
★ frequency of occurrence of words
★ complete genome sequences
★ noncoding parts
★ Zipf’s law
★  natural language
★ exons
★ introns
★ power-law distribution
★ frequency distribution
論文目次 摘要 .......................................................... i
ABSTRACT ..................................................... ii
序 .......................................................... iii
誌謝 ......................................................... iv
1. 緒論(INTRODUCTION) ......................................... 1
1.1 生物訊息的載體 ...........................................................................................1
1.1.1 生命的起源..................................................................................................... 1
1.1.2 基因序列的構造............................................................................................. 2
1.2 基因序列的演化模式...................................................................................3
1.2.1 基因序列的突變與重組................................................................................. 3
1.2.2 自然選擇與物種分類..................................................................................... 5
1.3 隨機系統的特性...........................................................................................6
1.3.1 隨機的定義..................................................................................................... 6
1.3.2 中央極限定理................................................................................................. 7
1.4 齊普夫定律(Zipf law)與現象觀察..............................................................7
1.4.1 文字資訊的書目計量學(Bibliometrics)........................................................... 7
1.4.2 何謂齊普夫定律? .......................................................................................... 7
1.4.3 基因體序列的N 字串齊普夫定律.................................................................. 8
1.4.4 蛋白質表現的似齊普夫規則.......................................................................... 9
1.4.5 齊普夫定律無所不在.................................................................................... 10
1.5 齊普夫分佈的特性與應用.......................................................................10
1.5.1 最小努力原則(Principle of Least Effort)造成齊普夫分佈的魯棒性(robust) 11
1.5.1.1 Furusawa 建立簡單濃度擴散模式,2003 年.........................................11
1.5.1.2 Ogasawara 遺傳漂變和自然選擇的演化理論模型,2009 年................ 12
1.5.1.3 Bernat 運用算法信息論,模擬城市人口變動,2010 年...................... 13
1.5.1.4 其他例子................................................................................................... 14
1.5.2 尺度不變性與其冪次ζ ................................................................................. 14
1.5.2.1 氙Xe 的熱核碎裂,碎片分佈的冪次成氣液相變新依據..................... 15
1.5.2.2 基因表現量最大似然數分佈的冪次觀察癌症分類............................... 15
1.5.2.3 都市人口分佈、森林資源規模分佈與優化........................................... 18
1.5.3 訊息定量的Shannon 熵H 與冗數R....................................................... 19
1.5.3.1 基因體序列非編碼的含量影響影響熵與冗數....................................... 19
1.5.3.2 基因序列的G+C 含量影響結果? ......................................................... 20
1.5.4 序列模型中,齊普夫指數ζ與長程關聯指數α .......................................... 20
1.5.4.1 對照序列的長程關聯指數與齊普夫指數的邊界................................... 21
1.5.4.2 齊普夫與長短程關聯並沒有對等的關係............................................... 22
2. 材料與方法 (MATERIALS AND METHODS)......................... 24
2.1 完整的基因體序列 .....................................................................................24
2.2 基因序列的k 字串齊普夫子集解析法(k-tuple Zipf m-Set analysis).........24
2.2.1 滑動窗口與k 字串齊普夫解析法............................................................... 24
2.2.2 相對頻率....................................................................................................... 25
2.2.3 相對子集頻率.................................................................................................. 26
2.3 排名機率分佈直方圖 (Rank-Probability density function Histogram,
RPDF Histogram) ..................................................................................................26
2.4 以2%為分界的高頻字與低頻字............................................................26
2.4.1 DNA 序列字串齊普夫子集圖與高頻字測試....................................28
2.4.1.1 齊普夫子集圖的函式測試....................................................................... 29
2.4.1.2 排名機率分佈(RPDF)的限制,以機率分佈(PDF)取代之................... 30
2.4.1.3 機率分佈的函式測試............................................................................... 31
3. 研究結果(RESULTS) ......................................... 35
3.1 不同 p 的基因體與對應隨機序列的3 字串齊普夫解析.........................35
3.1.1 齊普夫圖與齊普夫子集圖........................................................................... 35
3.1.2 隨機序列的齊普夫(子集)冪次................................................................... 37
3.1.3 排名機率分佈直方圖................................................................................... 37
3.1.4 隨機序列突顯齊普夫子集解析優勢........................................................... 37
3.2 以數學基礎比較相對頻率與相對子集頻率...........................................39
3.2.1 為何相對頻率的隨機序列有階梯狀?....................................................... 39
3.2.2 相對頻率的隨機序列k 字串有k+1 階梯................................................... 40
3.2.3 相對子集頻率的隨機序列只有一個階梯................................................... 41
3.3 齊普夫子集解析冪次的普適性...................................................................41
3.3.1 字串長度、物種分類與冪次關係............................................................... 43
3.3.2 序列長度、p 對解析冪次的影響................................................................ 43
3.3.3 依p 與長度範圍分成五個分類................................................................... 44
3.3.4 基因體序列、基因區、基因間隔區、外碼子、內碼子的齊普夫冪次........ 45
4. 討論(DISCUSSION) .......................................... 48
4.1 物種的冪次與演化關係...............................................................................48
4.2 相對子集頻率不受到序列的p 大小影響.................................................48
4.3 齊普夫子集圖的曲線.................................................................................48
4.3.1 低頻字的隨機性.....................................................................................48
4.3.2 對形式的分類無特別益處........................................................................... 49
4.4 齊普夫冪次與序列種類無關,與序列的p、長度有關..........................49
4.4.1 冪次無異於序列類,以長度log(L)=5.4, 6.2 當新分界編為九個分類.... 49
4.4.2 冪次在短序列中對p 有顯著的差異、對長度無特定大小依靠............... 50
4.4.3 物種的齊普夫冪次於不同類型序列探索................................................... 51
4.4.4 齊普夫冪次與序列種類無關....................................................................... 52
參考資料..................................................... 54
附表 ......................................................... 57
參考文獻 1. Mantegna, R.N., et al., Linguistic Features of Noncoding DNA-Sequences. Physical Review Letters, 1994. 73(23): p. 3169-3172.
2. Mantegna, R.N., et al., Systematic Analysis of Coding and Noncoding
DNA-Sequences Using Methods of Statistical Linguistics. Physical Review E, 1995. 52(3): p. 2939-2950.
3. Ramsden, J.J. and J. Vohradsky, Zipf-like behavior in procaryotic protein expression. Physical Review E, 1998. 58(6): p. 7777-7780.
4. Li, W.T., Zipf’s Law in Importance of Genes for Cancer Classification Using Microarray Data. J. theor. Biol. , 2002 219: p. 539–551.
5. Hernando, A., C. Vesperinas, and A. Plastino, Fisher information and the thermodynamics of scale-invariant systems. Physica A 2010 389(490-498).
6. Tan, M.H.e.a., Relationship between Zipf dimension and fractal dimension of city-size distribution. . Geographical research, 2004 23(2): p. 243-248.
7. Gong, X.Q. and Z. Wang, A Note on the Zipf’s Law. Complex Systems and Complexity Science 2008 5(3): p. 73-78.
8. Bernat, C.M.e.a., Universality of Zipf’s law. Phys. Rev. E 2010 82: p. 011102.
9. Yi, L.U., Analysis of forest resource scale usiong on Zipf’s law. Journal of Nanjing Forestry University (Natural Science Edition) 2009 33(2): p. 73-76.
10. Chen, H.D., The Footprint of Evolution Duplication- Universal Equivallent Length of Genomes., in NCU. 2009
11. Li, W.T., Zipf’s Law Everywhere. Glottometrics, 2003 5: p. 14-21.
12. Tsay, M.Y., Information-metrics and Document properties 2003 Taipei: Hwa Tai Publishing.
13. Zipf, G.K., Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology Addison-Wesley, Cambridge, MA, 1949.
14. You, R.Y., Zipf’s Law and the Distribution of Chinese Character Frequency. .Journal of Chinese Information Processing, 1999. 14((3)): p. 60-65.
15. Kosmidis, K., A. Kalampokis, and P. Argyrakis, Language time series analysis. Physica a-Statistical Mechanics and Its Applications, 2006. 370(2): p. 808-816.
16. Manning, C.D.e.a., Foundations of Statistical Natural Language Processing. . 1999 MIT Press.
17. Li, W.T., Random Texts Exhibit Zipf-Law-Like Word-Frequency Distribution. Ieee Transactions on Information Theory, 1992. 38(6): p. 1842-1845.
18. Havlin, S., The Distance between Zipf Plots. Physica a-Statistical Mechanics and Its Applications, 1995. 216(1-2): p. 148-150.
19. Cancho, R.F.I. and R.V. Sole, Least effort and the origins of scaling in human language. Proceedings of the National Academy of Sciences of the United States of America, 2003. 100(3): p. 788-791.
20. Ferrer-i-Cancho, R. and B. Elvevag, Random Texts Do Not Exhibit the Real Zipf's Law-Like Rank Distribution. Plos One, 2010. 5(3): p. e9411.
21. Bol´an, B.C.e.a., Statistical properties and linguistic coherence in noncoding DNA sequences. Rev. Mex. Fis. E, 2005. 51(2): p. 118–125.
22. Flam, F., Hints of a Language in Junk DNA. Science, 1994. 266(5189): p.1320-1320.
23. Konopka, A.K. and C. Martindale, Noncoding DNA, Zipf's law, and language. Science, 1995. 268(5212): p. 789.
24. Voss, R.F., Linguistic features of noncoding DNA sequences - Comment. Physical Review Letters, 1996. 76(11): p. 1978.
25. Mantegna, R.N., S.V. Buldyrev, and A.L. Goldberger, Mantegna et al. Reply:. Phys. Rev. Lett. , 1996. 76, : p. 1979-1981.
26. Furusawa, C. and K. Kaneko, Zipf's law in gene expression. Physical Review Letters, 2003. 90(8)
27. Ogasawara, O., S. Kawamoto, and K. Okubo, Zipf's law and human transcriptomes: an explanation with an evolutionary model. Comptes Rendus Biologies, 2003. 326: p. 1097-1101.
28. Ogasawara, O. and K. Okubo, On Theoretical Models of Gene Expression Evolution with Random Genetic Drift and Natural Selection. . Plos One, 2009. 4(11): p. e7943.
29. Powers, M., Applications and Explanations of Zipf’s Law. new methods in language processing and computational natural language learning ACL, 1998 p. 151-160.
30. A., A.L., Zipf’s law and the Internet. Glottometrics, 2002 3: p. 143-150.
31. Stanley, H.E., et al., Scaling features of noncoding DNA. Physica a-Statistical Mechanics and Its Applications, 1999. 273(1-2): p. 1-18.
32. Sellis, D. and Y. Almirantis, Power-laws in the genomic distribution of coding segments in several organisms: An evolutionary trace of segmental duplications, possible paleopolyploidy and gene loss. Gene, 2009. 447(1): p. 18-28.
33. Han, D.D.e.a., Nuclear fragmentation may exist in the Zipf law. Chinese Science Bulletin 2000 45(9): p. 913-918.
34. Bonhoeffer, S., et al., No signs of hidden language in noncoding DNA. Physical Review Letters, 1996. 76(11): p. 1977-1977.
35. Peng, C.K., et al., Statistical Properties of DNA-Sequences. Physica a-Statistical Mechanics and Its Applications, 1995. 221(1-3): p. 180-192.
36. Peng, C.K., et al., Mosaic Organization of DNA Nucleotides. Physical Review E, 1994. 49(2): p. 1685-1689.
37. Peng, C.K., et al., Long-Range Correlations in Nucleotide-Sequences. Nature, 1992. 356(6365): p. 168-170.
38. Peng, C.K., et al., Finite-Size Effects on Long-Range Correlations - Implications for Analyzing DNA-Sequences. Physical Review E, 1993. 47(5): p. 3730-3733.
39. Buldyrev, S.V.e.a., Generalize Lévy-walk model for DNA nucleotide sequences. Phys. Rev. E 1993. 47(6): p. 4514-4523.
40. Azbel’, M.Y., Random Two-Component One-Dimensional Ising Model for Heteropolymer Melting. . Phys. Rev. Lett., 1973. 31(9): p. 589-592.
41. Czirok, A., et al., Correlations in Binary Sequences and a Generalized Zipf Analysis. Physical Review E, 1995. 52(1): p. 446-452.
42. Voss, R.F., Evolution of Long-Range Fractal Correlations and 1/F Noise in DNA-Base Sequences. Physical Review Letters, 1992. 68(25): p. 3805-3808.
43. Li, W.T., Expansion-Modification Systems - a Model for Spatial 1/F Spectra. Physical Review A, 1991. 43(10): p. 5240-5260.
44. Li, W.T., Large-Scale Patterns in DNA Texts. . originally prepared for Scientific American, 1999: p. 1-10.
45. Israeloff, N.E., M. Kagalenko, and K. Chan, Can Zipf distinguish language from noise in noncoding DNA? Physical Review Letters, 1996. 76(11): p. 1976-1976.
46. Trotta, E., et al., 1H NMR study of [d(GCGATCGC)]2 and its interaction with minor groove binding 4',6-diamidino-2-phenylindole. Journal of Biological Chemistry, 1993. 268(6): p. 3944-51.
47. National center for biotechnology information genome database.
48. Rice annotation project database.
49. Hedges, S.B., The origin and evolution of model organisms. Nature Reviews Genetics, 2002. 3(11): p. 838-849.
50. Hsieh, L.C., et al., Minimal model for genome evolution and growth. Physical Review Letters, 2003. 90(1): p. -.
51. Chen, H.D., et al., Universal Global Imprints of Genome Growth and Evolution – Equivalent Length and Cumulative Mutation Density. PLoS ONE 2010. 5(4): p. e9844, 1-15.
指導教授 李弘謙(Hoong-Chien Lee) 審核日期 2011-8-29
推文 facebook   plurk   twitter   funp   google   live   udn   HD   myshare   reddit   netvibes   friend   youpush   delicious   baidu   
網路書籤 Google bookmarks   del.icio.us   hemidemi   myshare   

若有論文相關問題,請聯絡國立中央大學圖書館推廣服務組 TEL:(03)422-7151轉57407,或E-mail聯絡  - 隱私權政策聲明