jieba分词的功能和性能分析

人工智能96

jieba分词问题导引

  1. 用户词典大小最大可以有多大
  2. 用户词典大小对速度的影响
  3. 有相同前缀和后缀的词汇如何区分
  4. 对比百度分词的API

问题一:词典大小

从源码大小分析,整个jieba分词的源码总容量为81MB,其中系统词典 dict.txt的大小为5.16MB,所以用户词典至少可以大于5.16MB,在从词典中的词语数量来看,系统词典的总的词语数共 349047行,每一行包括词语、词频、词性三个属性,所以初步可以判断用户词典可以很大。

import pandas as pd
import numpy as np
import os
path = os.getcwd()
print(path)
dict_path = os.path.join(path, 'medical_dict')
#调用pandas的read_csv()方法时,默认使用C engine作为parser engine,而当文件名中含有中文的时候,用C engine在部分情况下就会出错。所以在调用read_csv()方法时指定engine为Python就可以解决问题了。
res = pd.read_csv(dict_path+'\\部位.txt',sep=' ',header=None,encoding='utf-8',engine='python')
res = res.append( pd.read_csv(dict_path+'\\疾病.txt',sep=' ',header=None,encoding='utf-8',engine='python') )
res = res.append( pd.read_csv(dict_path+'\\检查.txt',sep=' ',header=None,encoding='utf-8',engine='python') )
res = res.append( pd.read_csv(dict_path+'\\手术.txt',sep=' ',header=None,encoding='utf-8',engine='python') )
res = res.append( pd.read_csv(dict_path+'\\药品.txt',sep=' ',header=None,encoding='utf-8',engine='python') )
res = res.append( pd.read_csv(dict_path+'\\症状.txt',sep=' ',header=None,encoding='utf-8',engine='python') )
res = res.append( pd.read_csv(dict_path+'\\中药.txt',sep=' ',header=None,encoding='utf-8',engine='python') )
​
print(res.count())
0 358851 358802 35880

将35885个医疗词典放入,远比系统词典小。之后导出为一个用户词典,代码如下:

res[2] = res[1]res[1] = 883635print(res.head())res.to_csv(dict_path+'\\medicaldict.txt',sep=' ',header=False,index=False)

测试语句

#encoding=utf-8
import jieba
import jieba.posseg as pseg
import os
path = os.getcwd()
# 添加用户词典
jieba.load_userdict(path + "\\medical_dict\\medicaldict.txt")
print( path + "\\medical_dict\\medicaldict.txt" )
test_sent = (
"患者1月前无明显诱因及前驱症状下出现腹泻,起初稀便,后为水样便,无恶心呕吐,每日2-3次,无呕血,无腹痛,无畏寒寒战,无低热盗汗,无心悸心慌,无大汗淋漓,否认里急后重感,否认蛋花样大便,当时未重视,未就诊。")
words = jieba.cut(test_sent)
print('/'.join(words))
Building prefix dict from the default dictionary ...Loading model from cache C:\Users\Public\Documents\Wondershare\CreatorTemp\jieba.cacheLoading model cost 0.921 seconds.Prefix dict has been built succesfully.D:\code\jiebafenci\jieba\medical_dict\medicaldict.txt患者/1/月前/无/明显/诱因/及/前驱/症状/下/出现/腹泻/,/起初/稀便/,/后/为/水样便/,/无/恶心/呕吐/,/每日/2/-/3/次/,/无/呕血/,/无/腹痛/,/无/畏寒/寒战/,/无/低热/盗汗/,/无/心悸/心慌/,/无/大汗淋漓/,/否认/里急后重/感/,/否认/蛋/花样/大便/,/当时/未/重视/,/未/就诊/。

结果正常,判别一条症状的响应速度快,jieba分词是足以将所有的医疗词汇放入,对于性能的影响可以在进一步分析。

问题二:词典大小对效率的影响

  1. 35885个词语,1条测试语句
load time: 1.0382411479949951scut time: 0.0s
  1. 71770个词语,1条测试语句
load time: 1.4251623153686523scut time: 0.0s
  1. 1148160个词语
load time: 7.892921209335327scut time: 0.0s

逐渐变慢了

  1. 2296320个词语
load time: 15.106632471084595scut time: 0.0s

在本机已经开始变得很慢了

  1. 4592640个词语
load time: 30.660043001174927scut time: 0.0s
  1. 9185280个词语
load time: 56.30760192871094s
  1. 18370560个词语
load time: 116.30s

jieba分词的功能和性能分析

制作为折线图如上,基本上词语大小和加载速度呈正比。但是加载的词典一般保留在内存中,对内存和I/O负担较大。

之后将2220条病史数据导入后,对分词处理时间依然没有什么影响,在0.1s以内,分词时间可以忽略。

问题三:有相同前缀和后缀的词汇如何区分

  1. 关于无尿急、尿频、尿痛,在jieba分词导入用户词典后是能正确区分的,相关病例如下
/患者/3/小时/前/无/明显/诱因/出现/上/腹部/疼痛/,/左/上腹/为主/,/持续性/隐痛/,/无/放射/,/无/恶心/及/呕吐/,/无/泛酸/及/嗳气/,/无/腹胀/及/腹泻/,/无/咳嗽/及/咳痰/,/无/胸闷/及/气急/,/无/腰酸/及/腰疼/,/无/尿急/、/尿频/及/尿痛/,/无/头晕/,/无/黒/曚/,/无/畏寒/及/发热/,/无尿/黄/,/无/口苦/,/来/我院/求治/。

但是无尿黄划分成了无尿/黄,在查找用户词典后,发现是词典中没有尿黄的症状,为词典问题,便跳过处理。但是在症状中确实同时存在无尿和尿频,初步分析可能是词语在词典中的顺序,或者是jieba分词系统内部的分词策列导致,现在分析第一种可能,在词典中无尿在14221行,尿频在13561行,现在将无尿放在第一行,看分词结果。结果仍然为无/尿频,所以结果为是jieba分词内部的算法策略,当两个词语的词频相同是,后匹配的词语优先,比如在词语匹配中尿频比无尿后匹配,所以最后区分尿频,这与正确的分法也相匹配。

  1. 再比如腰部酸痛,在部位中有腰部这个词语,在症状中也有腰部酸痛这个词语,测试jieba分词会如何区分
测试词典:腰部 883635酸痛 883635腰部酸痛 883635测试结果:腰部酸痛

将词典顺序交换后,并将腰部和疼痛的词频都设置成大于883635的值后,结果仍然是腰部酸痛,所以可以得出jieba分词更倾向于分长度更长的词语,即使短的词语的词频较大也会优先分长度更长的。而我去向自己学医的同学了解后,他也认为分成长词更合理,所以也不用处理。

  1. 在查看病例中,发现很多病例中存在方位名 + 部位名的词语,并且应该分成一个词语,如下代码实现添加方位名+部位名的词典,如词典中已经存在,便跳过。
res = pd.read_csv(dict_path+'\\部位.txt',sep=' ',header=None,encoding='utf-8',engine='python')
direct = ['上','下','左','右','前','后']
print(res[0].head())
resum = res[0].count()
print(resum)
result = res[0]
# 在部位名前加上方位名
for item in res[0].tolist():
if(item[0] in direct):
continue
else:
temp = Series(['左' + item, '右' + item], index = [resum+1,resum+2])
resum = resum + 2
result = result.append(temp)
print(result.tail())
df = pd.DataFrame(result)
print(df.describe())

百度分词

github地址 : https://github.com/baidu/lac/

实现代码

from LAC import LAC
​
# 装载分词模型
lac = LAC(mode='seg')
​
# 单个样本输入,输入为Unicode编码的字符串
text = u"LAC是个优秀的分词工具"
seg_result = lac.run(text)
​
# 批量样本输入, 输入为多个句子组成的list,平均速率会更快
texts = [u"腰部酸痛"]
lac.load_customization('userdict.txt', sep=None)
seg_result = lac.run(texts)
print(seg_result)

用户词典只需要添加词语和词性即可。经过测试得到结论:

1.百度分词无词频的概念,但是也更倾向于分长度更长的词语。

  1. 后匹配的词语优先,如无尿频,也会划分成无/尿频,与词典中的顺序无关。

Original: https://www.cnblogs.com/linkcxt/p/14770968.html
Author: linkcxt
Title: jieba分词的功能和性能分析



相关阅读

Title: 观测下老外的水平如何

I'm not exactly sure what you are trying to achieve here. Whatever you transceive with IsoPcdA's transceive method are complete APDUs (as defined in ISO/IEC 7816-4, or rather any PDU within the ISO-DEP transport protocol). So the return value of transceive is a full C-APDU (command APDU) and the byte array parameter of transceive is a full R-APDU (response APDU) including the two bytes of the status word (SW1 | SW2). Thus, the last two bytes of that parameter are the status word. In your example SW1 would be 02 and SW2 would be 03.

What you see as status byte in the InDataExchange command of the PN532 NFC controller is not the status word of the APDU but the status of the command execution within the PN532 NFC controller. This status byte gives you information about buffer overflows, communication timeouts, etc and is not something that is returned by the card side.

EDIT : Sample code + test commands:

Sample Code running on Galaxy Nexus (CM 10):

try {
  Class isoPcdA = Class.forName("android.nfc.tech.IsoPcdA");
  Method isoPcdA_get = isoPcdA.getDeclaredMethod("get", Tag.class);

  final IsoPcdA techIsoPcdA = (IsoPcdA)isoPcdA_get.invoke(null, tag);

  if (techIsoPcdA != null) {
    if (mWorker != null) {
      mInterrupt = true;
      mWorker.interrupt();
      try {
        mWorker.join();
      } catch (Exception e) {}
    }

    mInterrupt = false;
    mWorker = new Thread(new Runnable() {
      public void run () {
        try {
          techIsoPcdA.connect();

          byte[] command = techIsoPcdA.transceive(new byte[]{ (byte)0x90, (byte)0x00 });
          Log.d(CardEmulationTest.class.getName(), "Connected.");

          while (!mInterrupt) {
            Log.d(CardEmulationTest.class.getName(), "C-APDU=" + StringUtils.convertByteArrayToHexString(command));
            command = techIsoPcdA.transceive(command);
          }
        } catch (Exception e) {
          Log.e(CardEmulationTest.class.getName(), "Exception while communicating on IsoPcdA object", e);
        } finally {
          try {
            techIsoPcdA.close();
          } catch (Exception e) {}
        }
      }
    });

    mWorker.start();
  }
} catch (Exception e) {
  Log.e(CardEmulationTest.class.getName(), "Exception while processing IsoPcdA object", e);
}

Test (using ACR122U):

InListPassivTargets (1 target at 106kbps)

> FF00000004 D44A 0100 00
< D54B 010100046004088821310578338800 9000

InDataExchange with DATA = 0x01

> FF00000004 D440 01 01 00
< D541 00 01 9000

So we get an error code of 0x00 from the card reader (status of InDataExchange command; not part of the actual response APDU), we get 0x01 as the response (this is the IsoDepA response APDU) and we get 0x9000 as the status code for the card reader wrapper APDU (not part of the actual response APDU).

InDataExchange with DATA = 0x01 0x02

> FF00000005 D440 01 0102 00
< D541 00 0102 9000

So we get an error code of 0x00 from the card reader (status of InDataExchange command; not part of the actual response APDU), we get 0x01 0x02 as the response (this is the IsoDepA response APDU) and we get 0x9000 as the status code for the card reader wrapper APDU (not part of the actual response APDU).

InDataExchange with DATA = 0x01 0x02 0x03

> FF00000006 D440 01 010203 00
< D541 00 010203 9000

So we get an error code of 0x00 from the card reader (status of InDataExchange command; not part of the actual response APDU), we get 0x01 0x02 0x03 as the response (this is the IsoDepA response APDU) and we get 0x9000 as the status code for the card reader wrapper APDU (not part of the actual response APDU).

InDataExchange with DATA = 0x01 0x02 0x03 0x04

> FF00000007 D440 01 01020304 00
< D541 00 01020304 9000

So we get an error code of 0x00 from the card reader (status of InDataExchange command; not part of the actual response APDU), we get 0x01 0x02 0x03 0x04 as the response (this is the IsoDepA response APDU) and we get 0x9000 as the status code for the card reader wrapper APDU (not part of the actual response APDU).

Thus, we get exactly the data taht we send as command APDU as response APDU (note that none of these APDUs is formatted according to ISO 7816-4, but that doesnt matter as the IsoPcdA card emulation works with any ISO 14443-4 transport protocol format).

The status code of 0x9000 belongs to the card reader APDU encapsulation (CLA=FF INS=00 P1P2=0000 Lc [PN542 COMMAND] Le=00) that is required as the ACR122U's PN532 is accessed over the CCID (PC/SC) interface. These are pure reader command encapsulation and have nothing to do with the communication over ISO-DEP.

The D440 01 [DATA] is the PN532 command to exchange data (e.g. APDUs) over ISO-DEP and the D541 00 [DATA] is the associated response.

Original: https://www.cnblogs.com/jiftle/p/16508368.html
Author: jiftle
Title: 观测下老外的水平如何