The Sejong Corpus

One of my academic interests is corpus linguistics, and so naturally I’d love to find a great Korean corpus.  I still haven’t found one I can download and access properly, but there are some web-based ones.  One of these is the Sejong Corpus.

A corpus is just a very large collection of texts that has been annotated (generally for part of speech and lemma) so that powerful and specific searches can be performed.  The Sejong Corpus is a collection of Korean texts, and there are several ways you can search it, but it’s only available through the web site, and the search tools are not nearly as powerful as what I can do with the British National Corpus (which I have on my own computer, and can also access on-line here).

The main search form on the Sejong corpus allows you to search for any lemma, and gives examples of that lemma, including any declined forms.  Unfortunately, it doesn’t allow you to search for specific forms of a word.  So I can search for 연구, and it gives sentences with the word 연구 in many forms, like 연구를, 연구와, and 연구이다, but I can’t search for 연구를, and I can’t search for ~를.  This is quite a shame, as that tool would be really useful, especially for looking up examples of certain grammatical forms.   There may be some way to do this, but I haven’t found it yet.  It also seems to be lacking a way of searching for collocations.

One useful tool is the 전자사전검색 (see the top menu), which allows you to search for words found in the corpus.  It differs from regular dictionaries in that it includes only but all words found in the corpus, thus including many words not found even in the 국어사전, including ones formed by productive word formation rules.  The really neat thing about it is that, after choosing the part of speech you want on the left (choose 용언 if you want verbs or adjectives), you can also search for either X 시작하는, 끝나는  or 포함하는 words; that is, words beginning with, ending with, or including your search term.  So if I want to look for examples of words ending in 꾼, I search 체언, 끝나는, “꾼”, and it gives 13 words, including 밀수꾼, 사기꾼, and 소몰이꾼.

I still haven’t checked out the entire site, but they’ve also got ways of searching for dialectal forms and for field-specific searches (전문 용어), e.g. comparing the frequency of words in different academic fields.  Overall it’s a good start, but I could think of so many other search tools they could add to make it more useful.


5 thoughts on “The Sejong Corpus

    1. Jibril

      I know this post is a little old, but I thought you should know that you can actually request a copy of the 세종 corpus here

      You don’t have to live in Korea to make the request. The only benefit of making your request when you’re in Korea is that you have the option of choosing between receiving a DVD with the data on it or a link in your inbox with the login credentials.

      Also, SNU has set up a better set of tools for searching the 세종 corpus here. And if I remember correctly, it even lets you search for collocations in addition to searching by POS tags.

  1. Jibril

    I also forgot to add that KAIST also has a free Korean corpus that is available for download. You can choose between the annotated corpus and the raw corpus.

    They have lots of other tools and corpora available too.

  2. Ed Kim

    I requested the Kaist website multiple times in attempt to download the corpus. Their website never responded to me with the information to download their corpus. I may have to contact them personally.
    Thanks for the information about the Sejong corpus. Have you personally done any work with Korean corpus?

    1. Jibril

      I’ve never had any problem requesting data from KAIST. I think the e-mails they send out are automated, so it could be that the messages are being misidentified as spam. In the meantime, you can always search the KAIST corpus online:

      Anyway, I haven’t done much serious work with Korean corpora beyond counting the occurrences of whatever token I’m interested in. Since my MA coursework was more linguistic than computational or programming based, tokens and their frequencies were all I ever really needed.

Leave a Reply