From Several Hundred Million to Some Billion Words: Scaling up a Corpus Indexer and a Search Engine with MapReduce Jordi Porta ComputaBonal LinguisBcs Area Departamento de Tecnología y Sistemas Centro de Estudios de la Real Academia Española ([email protected]) Outline “The more observa-ons we gather about language use, the more accurate descrip-on we have about language itself” Lin & Dyer, 2010: Data-‐intensive text processing with MapReduce • • • • • • A Case Study: Elpais.com “The only thing Basic Data Structures for Complex Queries be<er than big data IntroducBon to MapReduce is bigger data” The Scale out vs. Scale up Dilemma Scaling up with Phoenix++ Experiments introducing MapReduce into a CMS: – Query Engine: • Word CounBng • Word Cooccurrencies • N-‐grams – Indexer • File Inversion – Text Inversion – Structure Inversion 2 A Case Study: Elpais.com on-‐line archive Base Corpus: • Elpais.com on-‐line archive: • • • • • 1976–2011 2.3 million news Cleaned + PoS tagged 23 million structural elements 878 million tokens • Lists of linguisBc elements: • Concordances • Counts • AssociaBons • DistribuBons • … • Sub-‐corpora • Non-‐trivial processing: • Cooccurrence networks / clustering / … • Word vectors • … 3 • Basic Data Structures for Corpus Indexing SupporBng Complex Queries Doc id Hits Pos Pos Textual indices: 1 – PosBng Lists: • • • • [token=“de”] [token=“la” and msd=“Nfs”] [token=“pelota”] within [secBon=“sports”] … • Structural indices: – Interval Trees: • P within TEXT • P containing HI[REND=IT] containing [lemma=“de”] • … First Token Doc id 2 … Last Token … • Red/Black-‐Trees <p> </p> 4 How Text is Indexed Documents: • begin, • end, • offset • … 3.7G 135M 54M String Singletons PosBng list PosiBonal Indices 331M+6.8G 321M+7.5G Hash Table Structural Indices Corpus Layers 168M+4.7G Interval tree 130M 779M 1.5G (tags) 5 What is MapReduce? • • • • • • A widely used parallel compuBng model For batch processing of data on terabyte or petabyte scales On clusters of commodity servers (typically) Fault tolerance: task monitoring and replicaBon Abstracts parallelizaBon, synchronizaBon and communicaBon Hadoop is the most popular MapReduce scale out implementaBon 6 Hadoop MapReduce Map(k1, v1) -‐> list(k2, v2) Reduce(k2, list(v2)) -‐> list(v2) 7 Scaling out vs Scaling up Dilemma • Scaling out (horiz.) – Cluster of commodity servers – Pros: • Cheaper • Fault tolerance • Easy to update – Cons: • • • • • • Power consumpBon Cooling Space Difficult to implement Networking equipment Licensing costs • Scaling up (vert.) – Powerful server (+proc./+RAM) – Pros: • • • • • • Power consumpBon Cooling Space Easy to implement Networking equip. ~ 0 Licensing costs – Cons: • Expensive • Fault tolerance • Difficult to update 8 • Scaling out vs Scaling up Dilemma vs Scale-‐out for Hadoop: Time Appuswamy & al. (2014) “Scale-‐up to rethink?” In Proc. 4th Ann. Symp. on Cloud CompuFng. 50%, < 14 Gb • Input size in analyBcs producBon clusters: – Microsot / Yahoo: median < 14 Gb, Facebook: 90% < 100 Gb • Conclusion: – “Scaling up is more cost-‐effec-ve for many real-‐word problems” 9 Scaling up with Phoenix++ MapReduce "Phoenix++: Modular MapReduce for Shared-‐Memory Systems", 2011: JusBn Talbot, Richard M. Yoo, and Christos Kozyrakis. Second InternaFonal Workshop on MapReduce and its ApplicaFons. (Code: hyp://mapreduce.stanford.edu) • Split: divides input data to chunks for map tasks • Combiner: are thread-‐local objects • Map: invokes the Combiner for each key-‐value pair emiyed (not at the end of Map) • ParBBon/Suffle: not in Phoenix++ • Reduce: aggregates key-‐value pairs in Combiners (All Map task are finished) • Merge: sort key-‐value pairs (opBonal phase) 10 • Problems: introduces rehash between map and reduce phases Experiments: Data & Machinery • Base Corpus Elpais.com on-‐line archive (1976–2011) 2.3 million news Cleaned + PoS tagged 23 million structural elements 878 million tokens • MulBples of Elpais.com 4.5e+06 4e+06 3.5e+06 3e+06 Types – – – – – – x2, x4, x8, x16 – Don’t obey Heaps’ Law (V(n) = K·∙nβ) • 1x Dell PowerEdge R615 2.5e+06 2e+06 1.5e+06 1e+06 500000 0 0 5e+08 1e+09 Tokens 1.5e+09 2e+09 – RAM: 256 Gb – CPUs: 2 x Xeon E5-‐2620 / 2 GHz (24 threads) – DISKS: 3 x SAS 7.200 rpm, RAID-‐5 11 Word CounBng Given a (sub-‐)corpus, compute the frequencies of all the words (word frequency profile) 12 Word CounBng … Split(docs) = divide docs into <doc, begin, end> Word Combiner: 0 (la) Map(<doc, begin, end>) = for i = begin to end do emit(words[docs[doc].offset + i], 1) end Freq. 456 19 (de) 5667 … … 13 Word CounBng 75 1 threads 2 threads 4 threads 8 threads 16 threads 24 threads 70 65 60 75 s, 8 Bw Baseline (1 thread) Elapsed Time (seconds) 55 50 45 40 35 30 ~11 s, 1 Bw 25 20 15 10 5 0 0 1e+09 2e+09 3e+09 4e+09 5e+09 Corpus Size (tokens) 6e+09 7e+09 ~10 s, 8 Bw 8e+09 1 2 3 4 5 6 7 8 Corpus Size (billions of tokens) ~2 s, 1 Bw 16 threads ~ 24 threads 14 Word CounBng 13 elpais.com x4 elpais.com x2 elpais.com x1 12 11 1 thread -‐> 2 threads 12.0 s -‐> 6.2 s 10 Elapsed Time (seconds) 9 8 The decrease in performance of threads 7 6 5 8 threads -‐> 16 threads 1.5 s -‐> 1.0 s 4 3 2 1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Number of Threads 15 Word Cooccurrencies Given a word (or a lema or a PoS, …), compute the frequency of the surrounding words (let/right) 16 Word Cooccurrences -‐2 -‐1 +1 +2 -‐L …… Split(PosBng list) = divide the posBng list into <doc, n, posiBons> Combiner: Word Freq. 0 (la) 456 19 (de) 5667 … … +R …... Map(<doc, n, pos>) = for i = 0 to n -‐ 1 do for j = 1 to L do emit(words[docs[doc].offset + pos[i] -‐ j], 1) end for j = 1 to R do emit(words[docs[doc].offset + pos[i] + j], 1) end end Let Context Right Context 17 Word Cooccurrences of [token = “de”] in Elpais.com x1 => 55.7 Millions 11 1 threads 2 threads 4 threads 8 threads 16 threads 24 threads 10 9 10s Radius=10 1 Thread Baseline (1 thread) Elapsed Time (seconds) 8 7 6 5 4 3 2s Radius<=10 24 Threads 2 1 0 1 2 3 4 5 6 Radius (tokens) L = R = Radius 7 8 9 10 18 Word Cooccurrences of [token = “de”] in Elpais.com xN using 24 threads 8 350 M < 8 s Radius=10 radius=10 radius=5 radius=3 radius=1 7 Elapsed Time (seconds) 6 5 4 3 55.7 Mw 2 s 2 R=10 1 5e+07 1e+08 1.5e+08 2e+08 2.5e+08 3e+08 3.5e+08 4e+08 4.5e+08 50 100 150 200 Frequency 250 300 250 300 350 Token Frequency (million occurrences) x1 x2 x4 x8 19 N-‐grams Given a (sub-‐)corpus, compute the frequency of all the sequences of words (lemma, PoS, …) with lengths between a minimum and a maximum 20 • MapReduce N-‐grams The Suffix-‐σ Algorithm (2013): “CompuBng n-‐Gram Klaus Berberich and Srikanta Bedathur StaBsBcs in MapReduce”. In Proceedings of 16th InternaFonal Conference on Extending Database Technology (EDBT 2013) – Stage 1: Maximal n-‐gram counts are computed and sorted – Stage 2: Shorter n-‐gram counts are computed from the suffix array of maximal n-‐grams in the reducer tasks • New York Times Annotated Corpus (NYT) – 1.8 million newspaper arBcles, 1987-‐2007, ~3 Gb – 1 billion tokens (345,827 types) 2-‐5-‐grams / min freq.=10 (NYT), 100 (CW) • ClueWeb09-‐B (CW) – 50 million web documents, 2009, ~246 Gb – ~21 billion tokens (979,935 types) • Hadoop Cluster: 10 nodes – 2x6 cores, 64 Gb RAM, 4x2 Tb HDD 21 MapReduce N-‐grams max=4 …. Split(Interval tree) = divide the interval tree into intervals: <doc, begin, end> Combiner: Word Freq. <0,10,19,24> (la,nube,de,cenizas) 456 <19,24,43,45> (de,cenizas,”,”,procedente) 5667 … … Map(<doc, begin, end>) = for pos = begin to end do emit(<words[docs[doc].offset + pos, words[docs[doc].offset + pos + 1, … words[docs[doc].offset + pos + max>, 1) end Maximal n-‐gram 22 800 700 Elapsed Time (seconds) 600 3 min MapReduce N-‐grams 2-grams to 3-grams 2-grams to 5-grams 2-grams to 10-grams 2-grams to 20-grams 2-‐5-‐grams / min freq.=10 (NYT), 100 (CW) 500 400 300 200 100 0 00 5e+06 5 1e+07 10 1.5e+07 15 2e+07 20 2.5e+07 25 3e+07 30 Sample Size (million tokens) Sample Size (tokens) Problems with the hash table combiner 23 MapReduce into the Indexer: File Inversion • Text inversion • Structural inversion 24 Key MapReduce File Inversion Text Indexing Value Key Aggregated value Doc1 Doc2 25 MapReduce File Inversion: Indexing Structure Emiyed tree value for key k Emiyed tree value for key k Aggregated tree Time(Join(T1, T2)) = O(max(h1, h2)) = O(max(log(n1, n2))) 26 MapReduce File Inversion: Text Indexing Elapsed Time (hours) 15 Bw 2,5 Days 64 62 60 58 56 54 52 50 48 46 44 42 40 38 36 34 32 30 28 26 24 22 20 18 16 14 12 10 8 6 4 2 0 24 Threads Indexing 8 Bw 1 Day 1 Bw 2 h 0 2 Bw 4 h 2e+09 Boyleneck: Concurrent read access!! 4 Bw 9:30 h 4e+09 6e+09 8e+09 1e+10 1.2e+10 1.4e+10 1.6e+10 2 4 6 Corpus Size 8 (tokens) 10 12 14 16 Corpus Size (billion tokens) 27 Conclusions & Future Work • Conclusions: – ApplicaBons scale well to the sizes of interest using MapReduce – MapReduce in mulBcore servers is a cost-‐effecBve soluBon for corpus management tools • Future Work: – Improve hash containers – Introduce new associaBve containers – Implement other non-‐trivial staBsBcs 28 Thank you !! QuesBons ?? 29
© Copyright 2025 Paperzz