Introduction to Information Retrieval Penelusuran Informasi (Information Retrieval) Sumber: CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan Taufik Fuadi Abidin Link Analysis
Pembahasan Minggu Ini Hypertext dan Link (tautan): melihat lebih jauh, tidak hanya aspek tektual dari dokumen HTML Makna dari keterkaitan Aplikasi terkait 2
Hyperlink (tautan) Link dapat dimanfaatkan sebagai sumber daya yang potensial dalam menentukan authenticity& authority Jenis Link: Good, Bad& Unknown Good Bad 3
Simple Iterative Logic Link yang Goodtidak akan tunjuk ke node yang Bad Jika Goodnode link ke Badnode maka Goodnode akan menjadi Bad node Jika Goodnode link ke node tertentu, maka node tersebut merupakan node yang Good Good Bad 4
Simple Iterative Logic Good Bad Good Bad 5
Contoh Link Analysis Social networks Page Ranks http://www.cs.cornell.edu/home/kleinber/networks-book/ 6
Sec. 21.1 Web Merupakan Directed Graph Page A Anchor hyperlink Page B Asumsi 1:hyperlink antara halaman HTML menyatakan authority dari halaman tersebut (quality signal) Asumsi 2:Text yang ada pada anchor dari sebuah hyperlink menggambarkan (a glance) halaman target/tujuan (textual context)
Introduction to Information Retrieval Asumsi 1: Site Dengan Reputasi Baik 8
Introduction to Information Retrieval Asumsi 2: Menggambarkan Target 9
Sec. 21.1.1 Indexing Anchor Text Ketika mengindeks documen D, dengan bobot tertentu, anchor textyang tunjuk ke dokumen D juga di-include-kan Armonk, NY-based computer giant IBM announced today Big Blue today announced record profits for the quarter www.ibm.com Joe s computer hardware links Sun HP IBM Anchor text weight tergantung authority dari anchor page s website Jika diasumsikan bahwa konten dari cnn.com atau yahoo.com adalah authoritative (bagus) maka percaya semua anchor text dari web itu
Citation Analysis (Part of Link Analysis) Citation frequency Bibliographic coupling frequency Articles that co-cite the same articles are related Citation indexing Who is this author cited by (Garfield 1972)
Isu: Web isn t scholarly citation Miliaran partisipan, setiap partisipan punya tujuan sendiri-sendiri Spam ada dimana-mana Ketika SE mulai menggunakan link untuk ranking, (sekitar tahun 1998), spam terkait link tumbuh berkembang 12
Link Analysis Page Rank (Google) A link analysis algorithm that assigns a weight to each page to measure the relative importance of the page within the set. It interprets a link from page A to page B as a vote, by page A, for page B (see The Anatomy of a Large-Scale Hypertextual Web Search Engine. Brin, S.; Page, L., 1998) ExpertRank (Ask.com) Subject-Specific Popularity: analyzed links in context to rank a web page's importance within its specific subject. Ex. a web page about basketball would rank higher if other web pages about basketball' link to it Presented at the General Lecture, Mathematics Department, Syiah Kuala University, May 2007 13
Page Rank Algorithm Diketahui node A, B, C, and D. Initial PR adalah 0.25 Jika B, C, D tunjuk ke A maka PR(A) = PR(B) + PR(C) + PR(D) B A D C Namun, jika B jika tunjuk (link) ke C, dan D link ke semau (A, B, dan C), maka vote untuk setiap node dinormalisasi dengan B membagi jumlah outbound links ke halaman tersebut. Vote B adalah 0.125 untuk node A dan C Vote D = 0.25/3 = 0.081 untuk node A, B, dan C A C D PR(A) = PR(B)/OL(B) + PR(C)/OL(C) + PR(D)/OL(D) = 0.456 Presented at the General Lecture, Mathematics Department, Syiah Kuala University, May 2007 14
Sec. 21.3 Hyperlink-Induced Topic Search (HITS) In response to a query, instead of an ordered list of pages each meeting the query, find twosets of interrelated pages: Hub pagesare good lists of links on a subject. e.g., Bob s list of cancer-related links. Authority pagesoccur recurrently on good hubs for the subject.
Sec. 21.3 Hubs and Authorities Thus, a good hub page for a topic pointsto many authoritative pages for that topic. A good authority page for a topic is pointed to by many good hubs for that topic.
Sec. 21.3 High-level scheme Extract from the web a base setof pages that couldbe good hubs or authorities. From these, identify a small set of top hub and authority pages; iterative algorithm.
Sec. 21.3 Representasi n n adjacency matrix A: each of the npages in the base set has a row and column in the matrix. Entry A ij = 1if page ilinks to page j, else = 0. 1 2 3 1 2 3 1 2 3 0 1 0 1 1 1 1 0 0