惊!brat安装后进行标注-实战,并且通过一行代码自动标注为BIO格式,便于模型训练-and 错误解决

人工智能124

find 文件夹名称 -name '*.txt'|sed -e 's|.txt|.ann|g'|xargs touch,其意思是对每个txt文件都创建一个空的标引文件.ann,因为BRAT是要求的collection中,每个txt文件是必须有一个对应的.ann文件的,方便放置标引内容,这个ann文件的格式也挺规范。

jtt@jtt-System-Product-Name:/var/www/html/brat/ainer/data$ find all-txt -name '*.txt'|sed -e 's|\.txt|.ann|g'|xargs touch

jtt@jtt-System-Product-Name:/var/www/html/brat/ainer/data$ ls
all-txt
jtt@jtt-System-Product-Name:/var/www/html/brat/ainer/data$ cd all-txt
jtt@jtt-System-Product-Name:/var/www/html/brat/ainer/data/all-txt$ ls
2021-10.ann  2021-2.ann  2021-4.ann  2021-6.ann  2021-8.ann
2021-10.txt  2021-2.txt  2021-4.txt  2021-6.txt  2021-8.txt
2021-1.ann   2021-3.ann  2021-5.ann  2021-7.ann  2021-9.ann
2021-1.txt   2021-3.txt  2021-5.txt  2021-7.txt  2021-9.txt

3.标注

在自己要标注的数据目录添加配置文件annotation.conf,编辑标引规范,就是写明白自己标注的都有哪些命名实体、哪些语义关系。eg:

[entities]

OTH

LOC

NAME

ORG

TIME

TIL

NUM

[relations]

[events]

[attributes]

jtt@jtt-System-Product-Name:~$ cd /var/www/html/brat/ainer
jtt@jtt-System-Product-Name:/var/www/html/brat/ainer$ ls
annotation.conf  data  visual.conf
jtt@jtt-System-Product-Name:/var/www/html/brat/ainer$ vi annotation.conf

点击键盘 i  进行实体类别编辑,如下修改

Method-tech
Area-subject
Time
Other

关系
Applyied Arg1:Method-tech,Arg2:Area-subject
Associate Arg1:Method-tech,Arg2:Method-tech
Associate2 Arg1:Area-subject,Arg2:Area-subject
Emergence Arg1:Method-tech,Arg2:Time

Located            Arg1:Other, Arg2:Other
Geographical_part  Arg1:Other,    Arg2:Other
Family             Arg1:Person, Arg2:Other
Employment         Arg1:Other, Arg2:Other
Ownership          Arg1:Other, Arg2:Other

属性
Merge-time Arg:<relation>

&#x70B9;&#x51FB;esc&#x9000;&#x51FA;&#x7F16;&#x8F91;
&#x5149;&#x6807;&#x5230;&#x6700;&#x540E;&#x8F93;&#x5165;&#xFF1A;wq&#x5373;&#x53EF;&#x8FD4;&#x56DE;&#x547D;&#x4EE4;&#x754C;&#x9762;
</relation>

点击BRAT页面,用自己的账号登录,从页面上直接进入collection中,找到文件进行标引。

命名实体标引直接用光标拖拽,关系标引用鼠标将一个实体指向另一个实体即可。

jtt@jtt-System-Product-Name:~$ cd /var/www/html/brat
jtt@jtt-System-Product-Name:/var/www/html/brat$ python2 standalone.py
Serving brat at http://127.0.0.1:8001

现在遇到一个问题:找不到要标注的数据在哪里,然后通过代码测试 总算找到了。之后将配置文件和数据移动到了正确位置中。【这就是坑-跳出来了】总结:在data中新建一个文件夹/var/www/html/brat/data/all-txt,里面包含文本/.ann/以及配置文件annotation.conf 和 visual.conf

jtt@jtt-System-Product-Name:/var/www/html/brat$ cd data
jtt@jtt-System-Product-Name:/var/www/html/brat/data$ ls
6.ann  6.txt  examples  tutorials

&#x79FB;&#x52A8;&#xFF1A;jtt@jtt-System-Product-Name:/var/www/html/brat/ainer/data$ mv all-txt /var/www/html/brat/data/

jtt@jtt-System-Product-Name:/var/www/html/brat$ cd data
jtt@jtt-System-Product-Name:/var/www/html/brat/data$ ls
6.ann  6.txt  all-txt  examples  tutorials
jtt@jtt-System-Product-Name:/var/www/html/brat/data$ rm 6.ann
jtt@jtt-System-Product-Name:/var/www/html/brat/data$ rm 6.txt
jtt@jtt-System-Product-Name:/var/www/html/brat/data$ ls
all-txt  examples  tutorials

ok&#x4E86;

&#x63A5;&#x4E0B;&#x6765;&#x5C06;&#x914D;&#x7F6E;&#x6587;&#x4EF6;&#x91CD;&#x65B0;&#x4FEE;&#x6539;&#x4E00;&#x4E0B;

jtt@jtt-System-Product-Name:/var/www/html/brat$ vi annotation.conf

&#x4FEE;&#x6539;&#x5B8C;&#x6210;&#x540E;&#x8FDB;&#x884C;&#x6807;&#x6CE8;

可以看到图中有了数据了

惊!brat安装后进行标注-实战,并且通过一行代码自动标注为BIO格式,便于模型训练-and 错误解决

三.实战标注过程

选择要标注的实体,直接弹出框框进行标注,最后可以将标注好的数据导出

四.模型

根据标注的结果转化成BIO标注,选择bert-bilstm-crf模型进行标注。

怎么标注呢

1.自动标注的流程及结果展示

进入目录/var/www/html/brat/tools;

输入:python anntoconll.py 要进行BIO标注的文本文件

ok

eg:

ann文件标注了两个实体:T1 Method-tech 4 12 COVID-19
T2 Other 225 237 intelligence

运行代码:python anntoconll.py /var/www/html/brat/data/all-txt/2021-1.txt

生成了标注后的文件:文件名.conll

结果:

惊!brat安装后进行标注-实战,并且通过一行代码自动标注为BIO格式,便于模型训练-and 错误解决

2.写代码标注:也很简单,这里说说思路

根据标注的.ann文件,找到是实体的标注

第一个识别的标为B,判断如果他后面的词还是实体,就给标注为I

当不是实体的其他标为O

3.模型训练

输入实体识别模型中可以直接训练,不同模型的输入可能不相同,稍微修改数据样式即可

错误解决

正常的配置文件,在实体标注时可以,在bio标注时报错。同样在bio标注时正确,在实体识别时报错,究其原因及解决办法--如下

原因:

py文件中代码报错

解决办法:

在brat/server/src/sspostproc.py文件中,实体标注时代码:

jtt@jtt-System-Product-Name:~$ cd /var/www/html/brat
jtt@jtt-System-Product-Name:/var/www/html/brat$ python2 standalone.py
Serving brat at http://127.0.0.1:8001

文件内容为

#!/usr/bin/env python

# Python version of geniass-postproc.pl. Originally developed as a
# heuristic postprocessor for the geniass sentence splitter, drawing
# in part on Yoshimasa Tsuruoka's medss.pl.

from __future__ import with_statement

import re

INPUT_ENCODING = "UTF-8"
OUTPUT_ENCODING = "UTF-8"
DEBUG_SS_POSTPROCESSING = False

__initial = []

# TODO: some cases that heuristics could be improved on
# - no split inside matched quotes
# - "quoted." New sentence
# - 1 mg .\nkg(-1) .

# breaks sometimes missing after "?", "safe" cases
__initial.append((re.compile(r'\b([a-z]+\?) ([A-Z][a-z]+)\b'), r'\1\n\2'))
# breaks sometimes missing after "." separated with extra space, "safe" cases
__initial.append((re.compile(r'\b([a-z]+ \.) ([A-Z][a-z]+)\b'), r'\1\n\2'))

# join breaks creating lines that only contain sentence-ending punctuation
__initial.append((re.compile(r'\n([.!?]+)\n'), r' \1\n'))

# no breaks inside parens/brackets. (To protect against cases where a
# pair of locally mismatched parentheses in different parts of a large
# document happen to match, limit size of intervening context. As this
# is not an issue in cases where there are no interveining brackets,
# allow an unlimited length match in those cases.)

__repeated = []

# unlimited length for no intevening parens/brackets
__repeated.append((re.compile(r'(\([^\[\]\(\)]*)\n([^\[\]\(\)]*\))'),r'\1 \2'))
__repeated.append((re.compile(r'(\[[^\[\]\(\)]*)\n([^\[\]\(\)]*\])'),r'\1 \2'))
# standard mismatched with possible intervening
__repeated.append((re.compile(r'(\([^\(\)]{0,250})\n([^\(\)]{0,250}\))'), r'\1 \2'))
__repeated.append((re.compile(r'(\[[^\[\]]{0,250})\n([^\[\]]{0,250}\])'), r'\1 \2'))
# nesting to depth one
__repeated.append((re.compile(r'(\((?:[^\(\)]|\([^\(\)]*\)){0,250})\n((?:[^\(\)]|\([^\(\)]*\)){0,250}\))'), r'\1 \2'))
__repeated.append((re.compile(r'(\[(?:[^\[\]]|\[[^\[\]]*\]){0,250})\n((?:[^\[\]]|\[[^\[\]]*\]){0,250}\])'), r'\1 \2'))

__final = []

# no break after periods followed by a non-uppercase "normal word"
# (i.e. token with only lowercase alpha and dashes, with a minimum
# length of initial lowercase alpha).

__final.append((re.compile(r'\.\n([a-z]{3}[a-z-]{0,}[ \.\:\,\;])'), r'. \1'))

# no break in likely species names with abbreviated genus (e.g.

# "S. cerevisiae"). Differs from above in being more liberal about
# separation from following text.

__final.append((re.compile(r'\b([A-Z]\.)\n([a-z]{3,})\b'), r'\1 \2'))

# no break in likely person names with abbreviated middle name
# (e.g. "Anton P. Chekhov", "A. P. Chekhov"). Note: Won't do
# "A. Chekhov" as it yields too many false positives.

__final.append((re.compile(r'\b((?:[A-Z]\.|[A-Z][a-z]{3,}) [A-Z]\.)\n([A-Z][a-z]{3,})\b'), r'\1 \2'))

# no break before CC ..

__final.append((re.compile(r'\n((?:and|or|but|nor|yet) )'), r' \1'))

# or IN. (this is nothing like a "complete" list...)
__final.append((re.compile(r'\n((?:of|in|by|as|on|at|to|via|for|with|that|than|from|into|upon|after|while|during|within|through|between|whereas|whether) )'), r' \1'))

# no sentence breaks in the middle of specific abbreviations
__final.append((re.compile(r'\b(e\.)\n(g\.)'), r'\1 \2'))
__final.append((re.compile(r'\b(i\.)\n(e\.)'), r'\1 \2'))
__final.append((re.compile(r'\b(i\.)\n(v\.)'), r'\1 \2'))

# no sentence break after specific abbreviations
__final.append((re.compile(r'\b(e\. ?g\.|i\. ?e\.|i\. ?v\.|vs\.|cf\.|Dr\.|Mr\.|Ms\.|Mrs\.)\n'), r'\1 '))

# or others taking a number after the abbrev
__final.append((re.compile(r'\b([Aa]pprox\.|[Nn]o\.|[Ff]igs?\.)\n(\d+)'), r'\1 \2'))

# no break before comma (e.g. Smith, A., Black, B., ...)
__final.append((re.compile(r'(\.\s*)\n(\s*,)'), r'\1 \2'))

def refine_split(s):
"""
    Given a string with sentence splits as newlines, attempts to
    heuristically improve the splitting. Heuristics tuned for geniass
    sentence splitting errors.

"""

    if DEBUG_SS_POSTPROCESSING:
        orig = s

    for r, t in __initial:
        s = r.sub(t, s)

    for r, t in __repeated:
        while True:
            n = r.sub(t, s)
            if n == s: break
            s = n

    for r, t in __final:
        s = r.sub(t, s)

    # Only do final comparison in debug mode.

    if DEBUG_SS_POSTPROCESSING:
        # revised must match original when differences in space<->newline
        # substitutions are ignored
        r1 = orig.replace('\n', ' ')
        r2 = s.replace('\n', ' ')
        if r1 != r2:
            print >> sys.stderr, "refine_split(): error: text mismatch (returning original):\nORIG: '%s'\nNEW:  '%s'" % (orig, s)
            s = orig

    return s

if __name__ == "__main__":
    import sys
    import codecs

    # for testing, read stdin if no args
    if len(sys.argv) == 1:
        sys.argv.append('/dev/stdin')

    for fn in sys.argv[1:]:
        try:
            with codecs.open(fn, encoding=INPUT_ENCODING) as f:
                s = "".join(f.read())
                sys.stdout.write(refine_split(s).encode(OUTPUT_ENCODING))
        except Exception, e:
            print >> sys.stderr, "Failed to read", fn, ":", e
            </->

BIO标注时代码:

jtt@jtt-System-Product-Name:~$ cd /var/www/html/brat/tools
jtt@jtt-System-Product-Name:/var/www/html/brat/tools$ &#xA0;python anntoconll.py /var/www/html/brat/data/2021-pre2000/2021-1.txt

内容为

#!/usr/bin/env python

# Python version of geniass-postproc.pl. Originally developed as a
# heuristic postprocessor for the geniass sentence splitter, drawing
# in part on Yoshimasa Tsuruoka's medss.pl.

import re

INPUT_ENCODING = "UTF-8"
OUTPUT_ENCODING = "UTF-8"
DEBUG_SS_POSTPROCESSING = False

__initial = []

# TODO: some cases that heuristics could be improved on
# - no split inside matched quotes
# - "quoted." New sentence
# - 1 mg .\nkg(-1) .

# breaks sometimes missing after "?", "safe" cases
__initial.append((re.compile(r'\b([a-z]+\?) ([A-Z][a-z]+)\b'), r'\1\n\2'))
# breaks sometimes missing after "." separated with extra space, "safe" cases
__initial.append((re.compile(r'\b([a-z]+ \.) ([A-Z][a-z]+)\b'), r'\1\n\2'))

# join breaks creating lines that only contain sentence-ending punctuation
__initial.append((re.compile(r'\n([.!?]+)\n'), r' \1\n'))

# no breaks inside parens/brackets. (To protect against cases where a
# pair of locally mismatched parentheses in different parts of a large
# document happen to match, limit size of intervening context. As this
# is not an issue in cases where there are no interveining brackets,
# allow an unlimited length match in those cases.)

__repeated = []

# unlimited length for no intevening parens/brackets
__repeated.append(
    (re.compile(r'(\([^\[\]\(\)]*)\n([^\[\]\(\)]*\))'), r'\1 \2'))
__repeated.append(
    (re.compile(r'(\[[^\[\]\(\)]*)\n([^\[\]\(\)]*\])'), r'\1 \2'))
# standard mismatched with possible intervening
__repeated.append(
    (re.compile(r'(\([^\(\)]{0,250})\n([^\(\)]{0,250}\))'), r'\1 \2'))
__repeated.append(
    (re.compile(r'(\[[^\[\]]{0,250})\n([^\[\]]{0,250}\])'), r'\1 \2'))
# nesting to depth one
__repeated.append(
    (re.compile(r'(\((?:[^\(\)]|\([^\(\)]*\)){0,250})\n((?:[^\(\)]|\([^\(\)]*\)){0,250}\))'),
     r'\1 \2'))
__repeated.append(
    (re.compile(r'(\[(?:[^\[\]]|\[[^\[\]]*\]){0,250})\n((?:[^\[\]]|\[[^\[\]]*\]){0,250}\])'),
     r'\1 \2'))

__final = []

# no break after periods followed by a non-uppercase "normal word"
# (i.e. token with only lowercase alpha and dashes, with a minimum
# length of initial lowercase alpha).

__final.append((re.compile(r'\.\n([a-z]{3}[a-z-]{0,}[ \.\:\,\;])'), r'. \1'))

# no break in likely species names with abbreviated genus (e.g.

# "S. cerevisiae"). Differs from above in being more liberal about
# separation from following text.

__final.append((re.compile(r'\b([A-Z]\.)\n([a-z]{3,})\b'), r'\1 \2'))

# no break in likely person names with abbreviated middle name
# (e.g. "Anton P. Chekhov", "A. P. Chekhov"). Note: Won't do
# "A. Chekhov" as it yields too many false positives.

__final.append(
    (re.compile(r'\b((?:[A-Z]\.|[A-Z][a-z]{3,}) [A-Z]\.)\n([A-Z][a-z]{3,})\b'),
     r'\1 \2'))

# no break before CC ..

__final.append((re.compile(r'\n((?:and|or|but|nor|yet) )'), r' \1'))

# or IN. (this is nothing like a "complete" list...)
__final.append((re.compile(
    r'\n((?:of|in|by|as|on|at|to|via|for|with|that|than|from|into|upon|after|while|during|within|through|between|whereas|whether) )'), r' \1'))

# no sentence breaks in the middle of specific abbreviations
__final.append((re.compile(r'\b(e\.)\n(g\.)'), r'\1 \2'))
__final.append((re.compile(r'\b(i\.)\n(e\.)'), r'\1 \2'))
__final.append((re.compile(r'\b(i\.)\n(v\.)'), r'\1 \2'))

# no sentence break after specific abbreviations
__final.append(
    (re.compile(r'\b(e\. ?g\.|i\. ?e\.|i\. ?v\.|vs\.|cf\.|Dr\.|Mr\.|Ms\.|Mrs\.)\n'),
     r'\1 '))

# or others taking a number after the abbrev
__final.append(
    (re.compile(r'\b([Aa]pprox\.|[Nn]o\.|[Ff]igs?\.)\n(\d+)'), r'\1 \2'))

# no break before comma (e.g. Smith, A., Black, B., ...)
__final.append((re.compile(r'(\.\s*)\n(\s*,)'), r'\1 \2'))

def refine_split(s):
    """Given a string with sentence splits as newlines, attempts to
    heuristically improve the splitting.

    Heuristics tuned for geniass sentence splitting errors.

"""

    if DEBUG_SS_POSTPROCESSING:
        orig = s

    for r, t in __initial:
        s = r.sub(t, s)

    for r, t in __repeated:
        while True:
            n = r.sub(t, s)
            if n == s:
                break
            s = n

    for r, t in __final:
        s = r.sub(t, s)

    # Only do final comparison in debug mode.

    if DEBUG_SS_POSTPROCESSING:
        # revised must match original when differences in space<->newline
        # substitutions are ignored
        r1 = orig.replace('\n', ' ')
        r2 = s.replace('\n', ' ')
        if r1 != r2:
            print("refine_split(): error: text mismatch (returning original):\nORIG: '%s'\nNEW:  '%s'" % (orig, s), file=sys.stderr)
            s = orig

    return s

if __name__ == "__main__":
    import sys
    import codecs

    # for testing, read stdin if no args
    if len(sys.argv) == 1:
        sys.argv.append('/dev/stdin')

    for fn in sys.argv[1:]:
        try:
            with codecs.open(fn, encoding=INPUT_ENCODING) as f:
                s = "".join(f.read())
                sys.stdout.write(refine_split(s).encode(OUTPUT_ENCODING))
        except Exception as e:
            print("Failed to read", fn, ":", e, file=sys.stderr)
</->

Original: https://blog.csdn.net/weixin_42565135/article/details/119491403
Author: Coding With you.....
Title: 惊!brat安装后进行标注-实战,并且通过一行代码自动标注为BIO格式,便于模型训练-and 错误解决



相关阅读

Title: Anaconda3中库的安装及在pycharm使用

文章目录

前言

Andaconda的安装前文已经提到。现在介绍Anaconda的使用。

一、查看版本

在电脑开始菜单。Anaconda文件夹下打开Anaconda Prompt。输入 conda --version
惊!brat安装后进行标注-实战,并且通过一行代码自动标注为BIO格式,便于模型训练-and 错误解决
版本为4.6.11

; 二、创建虚拟环境及操作

1.创建环境

在Anaconda Prompt终端输入:

conda create --name 环境名 python=版本号

比如我创建一个环境名为tensorflow的虚拟环境,python版本为3.7

conda create --name tensorflow python=3.7

使用 conda info -e查看是否有你创建的虚拟环境
惊!brat安装后进行标注-实战,并且通过一行代码自动标注为BIO格式,便于模型训练-and 错误解决
见上图,这里已经有了 tensorflow环境。(base为Anaconda安装时就存在的基础环境)
如果环境创建失败,请查看这篇博客,这里给出一种解决方法:虚拟环境失败请看这里

2.conda其他命令

conda remove -n tensorflow -all 删除名称为tensorflow的虚拟环境
conda list 列出本虚拟环境中所有安装的库,看命令行最前面括号里就知道目前所在哪个环境中。
conda activate tensorflow 激活tensorflow虚拟环境。不需要先退出上一个环境。
conda deactivate tensorflow 关闭虚拟环境
conda install pip 安装pip管理器

三、在指定环境中安装所需的库

1、首先激活环境 conda activate tensorflow(一定要先确认所在环境)
2、使用 pip安装管理器,在创建环境的时候会安装有 pip。如果没有就用conda安装
3、安装所需要的库。
例1:安装tensorflow2-cpu版本:安装2.2.

pip install tensorflow==2.2.0 -i https://pypi.tuna.tsinghua.edu.cn/simple

例2:安装tensorlayer 2.2.1版本

pip install tensorlayer==2.2.1 -i https://pypi.tuna.tsinhua.edu.cn/simple

安装时请指定安装源。这里指定清华源。官方源下载太慢,还容易出错。等待下载即可。
此处列出国内常用源地址(一般记住一个就行):

清华:https://pypi.tuna.tsinghua.edu.cn/simple
阿里云:http://mirrors.aliyun.com/pypi/simple/ 中国科技大学
https://pypi.mirrors.ustc.edu.cn/simple/
豆瓣:http://pypi.douban.com/simple/

安装指定的版本请用 &#x5E93;&#x540D;==&#x7248;&#x672C;&#x53F7;指定;例: pip install numpy==1.16.1

卸载指定库: pip uninstall "&#x5E93;&#x540D;" 例: pip uninstall numpy

四、Pycharm中使用安装了指定库的解释器

首先,打开自己的项目。在 File--settings--Project:"&#x81EA;&#x5DF1;&#x7684;&#x9879;&#x76EE;&#x540D;"打开之后可能是这样:没有东西
惊!brat安装后进行标注-实战,并且通过一行代码自动标注为BIO格式,便于模型训练-and 错误解决
接下来,点红圈那里的齿轮,然后点 Add...
惊!brat安装后进行标注-实战,并且通过一行代码自动标注为BIO格式,便于模型训练-and 错误解决
选择 conda Environment
惊!brat安装后进行标注-实战,并且通过一行代码自动标注为BIO格式,便于模型训练-and 错误解决
之后先点 Existing environment,然后再点2处的...

惊!brat安装后进行标注-实战,并且通过一行代码自动标注为BIO格式,便于模型训练-and 错误解决
最后找到Anaconda安装的文件夹,跟据 1--2--3--4步找到 python.exe,点 OK就可以愉快的使用了
惊!brat安装后进行标注-实战,并且通过一行代码自动标注为BIO格式,便于模型训练-and 错误解决

; 五、结语

如:我在 Anaconda中使用conda创建了名为 tensorflow的虚拟环境,在虚拟环境中使用 pip安装了 tensorflow==2.2,安装了 numpy==1.16.1。那么经过解释器的导入。现在再Pycharm中打开python文件就可以导入 tensorflownumpy,就不会出现找不到库的问题了

Original: https://blog.csdn.net/m0_52304861/article/details/123561307
Author: 皮皮
Title: Anaconda3中库的安装及在pycharm使用