A space separates the URL from some JSON-like data. The result is a line delimited file with information about one URL on each line. You’ll need to clone the git repository to get the script along with the library files:Įdu./:http If you just want to see how to get the data now, the repository provides a couple python scripts for querying the index. If you’re interested you can read the details there. Scott Robertson, who was responsible for putting the index together, writes in the github README about the file format used for the index and the algorithm for querying it. Then you can grab just those pages out of the crawl segments. Now with the URL index it is possible to query for domains you are interested in to discover whether they are in the Common Crawl corpus. While setting up a parallel Hadoop job running in AWS EC2 is cheaper than crawling the Web, it still is rather expensive for most. While the Common Crawl has been making a large corpus of crawl data available for over a year now, if you wanted to access the data you’d have to parse through it all yourself. The Common Crawl now has a URL index available. Be sure to follow Jason on Twitter and on his blog to keep up to date with other interesting work he does! You can see his description of his work and results below and on his blog. He used the Common Crawl Index to look at NCSU Library URLs in the Common Crawl Index. Jason is the Associate Head of Digital Library Initiatives at North Carolina State University Libraries. The index has already proved useful to many people and we would like to share an interesting use of the index that was very well described in a great blog post by Jason Ronallo. Last week we announced the Common Crawl URL Index. Analysis of the NCSU Library URLs in the Common Crawl Index
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |